Tag: Academic publishing

Journal innovations – the next step is augmented reality?

In the previous post, I noted that a chemistry publisher is about to repeat an earlier experiment in serving pre-prints of journal articles. It would be fair to suggest that following the first great period of journal innovation, the boom in rapid publication “camera-ready” articles in the 1960s, the next period of rapid innovation started around 1994 driven by the uptake of the World-Wide-Web. The CLIC project[cite]10.1080/13614579509516846[/cite] aimed to embed additional data-based components into the online presentation of the journal Chem Communications, taking the form of pop-up interactive 3D molecular models and spectra. The Internet Journal of Chemistry was designed from scratch to take advantage of this new medium.[cite]10.1080/00987913.2000.10764578[/cite] Here I take a look at one recent experiment in innovation which incorporates “augmented reality”.[cite]10.1055/s-0035-1562579[/cite]

The title is interesting: “Combination of Enabling Technologies to Improve and Describe the Stereoselectivity of Wolff–Staudinger Cascade Reaction“. One of these technologies relates to “microwave-assisted flow generation of primary ketenes by thermal decomposition of α-diazoketones at high temperature”, but the journal presentation itself attempts the “faster interpretation of computed data via a new web-based molecular viewer, which takes advantage from Augmented Reality (AR) technology“. To access this component directly, go to the link https://leyscigateway.ch.cam.ac.uk/staudinger/ It is not incorporated into the journal infrastructures as the CLIC project attempted, but is perhaps closer to the model I noted in the previous post of supporting (FAIR) data associated with the article and hosted separately from the journal.

What happens next depends rather on the Web browser you are using. With many browsers and tablets, a conventional 3D molecular presentation appears; there is no button present where the red arrow points. You will find out this is because “Augmented Reality is not available in your browser, as the getUserMedia() API is not supported“

Some browsers (the latest Opera, FireFox, Chrome) do support this feature, and a new AR button appears. Selecting this now layers the video from the device camera onto the 3D molecular model; the molecule now floats in the scene captured by the camera (which in the case below is the room I am sitting in). After a few seconds you are urged to “point the camera towards the AR marker”. The supporting information contains such AR markers as a navigation aid for the 3D coordinates contained there. An example is:

If this marker is now brought into the camera view (by printing it, sic) and holding it in front of the camera image, the marker resolves into further data relevant to the molecule of interest, layered into the existing scene of the room and the molecule. For the marker above, it resolves to a reaction energy profile which reveals where the specific molecule sits energetically in terms of the overall reaction.

This layering of “heads up” molecular data into a scene comprising a 3D molecular model and the human viewer of that molecule captured in video is what defines the concept of “augmented reality” (the data being the augmentation, rather than the human).

Having now tried it out, I was left wondering whether this truly was a great advance in enabling technology for chemistry journals. The role of the camera seems primarily to capture the AR markers contained in the supporting information; the presence of the reader in the video image apparently inspecting the molecule could be regarded as a distraction. The AR markers (QR codes) are merely visual representations of a URL, which in the form of a DOI (as used in this blog) to locate data is rather more familiar to most readers. The DOI, by the way, carries further information in the form of metadata, and which when sent to e.g. DataCite, enables the data to be found. Does the data need to be layered onto the molecule (and apparently floating in front of the reader) to become usable? Could it instead be placed in a pop-up or separate window of its own (as the 1994 CLIC project achieved)? Do the AR markers enable the data to be FAIR? One can Find the data (albeit only by reading and printing the supporting information) and view it in the AR scene, but is it Accessible (can one access the underlying numerical data?) or Interoperable (place it into another program) or Re-usable?

As with all enabling technologies, one has to always ask if that technology helps or hinders. Or is the principle of KISS (keep it simple) sometimes better? It is however good to see research groups experimenting with these themes and meanwhile readers can judge for themselves whether “heads up” AR augmentation of the data describing research is indeed the next big thing.

August 17, 2016
Chemistry preprint servers (revisited).
This week the ACS announced its intention to establish a “ChemRxiv preprint server to promote early research sharing“. This was first tried quite a few years ago, following the example of especially the physicists. As I recollect the experiment lasted about a year, attracted few submissions and even fewer of high quality. Will the concept succeed this time, in particular as promoted by a commercial publisher rather than a community of scientists (as was the original physicists model)?

The RSC (itself a highly successful commercial publisher) has picked up on this and run its own commentary. You will find quotes from yours truly there, along with Peter Murray-Rust, a long time ardent promoter of community driven open science. One interesting aspect is that the ACS runs around 50 journals, and the decision on whether each will accept preprints for publication will (shortly = next few weeks) be made by the individual editors. I wonder if the eventual list of those supporting the project will bring any surprises (bets on J. Am. Chem. Soc. preprints anyone)?

But I want to pick up on the declared aspiration “to promote early research sharing“. Here I couple research sharing with data sharing. If you share your research, you should also share the data resulting from that research. We are now entering a new era of data sharing (in part as a result of mandation by various funding bodies) and so one has to ask whether a pre-print server will encourage people to create and share FAIR data (data which is findable, accessible, inter-operable and re-usable) as a model to replace the current one of “supporting information” held in enormous PDF files (mostly unFAIR on at least three counts). This question is indeed posed in the RSC commentary. What I would like to see happen are projects such as that described here, which create what were described as “first class research objects”, and which I think amply fulfil the criteria of being FAIR. So, will ChemRxiv preprint servers help promote such FAIR data sharing as part of early research sharing? We will find out soon.

The ACS supports OA (Open Access) sharing of articles, provided the authors pay (or arrange payment of) the appropriate APC or article processing charge. These charges are complex, being subject to various discounts (for example if you as an author are an ACS member or not) but are generally not insignificant (> $1000). I wondered whether preprints might be subject to an APC, and so I asked the ACS. The response was “we don’t anticipate any submission or usages fees at this time“. I think that means free at point of submission, and free at point of readership “at this time“.

Finally, let me now summarise as I understand the current family of “research publications”:
1. The preprint
2. The final author version as submitted to a journal
3. The “version of record” (VoR) as published by the journal
4. Any FAIR published data associated with the article
All four of these are attempts at “research sharing”. Each may be located in a different location, and each may have its own DOI. And of course we cannot easily know how much overlap there is between each of them. Thus, how might 1-3 differ in terms of the story or “narrative” of scientific claims? Does 4 agree or support 1-3? Does 4 agree with perhaps data subsets contained in 1-3? If keeping abreast of the current research literature is a challenge, imagine having to cope with/reconcile up to four versions of each “publication”!

Lots of food for thought here. We have not heard the last of these themes.
August 16, 2016
Data-free research data management? Not an oxymoron.
I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?
1. Specify a data management plan to be followed, as for example points 2-5 below.
2. Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
3. Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
4. Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
5. Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).
So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?
1. Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
2. Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
3. A date stamp formalising when the (meta)data is actually deposited.
4. A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
5. A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
6. One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
7. It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
8. If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [cite]10.5281/zenodo.19272[/cite]
9. Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
10. etc.
So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.
May 24, 2016
Collaborative FAIR data sharing.
I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[cite]10.1021/jacs.5b13070[/cite]

I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).
1. The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[cite]10.17616/R3K64N[/cite]
2. A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[cite]10.14469/hpc/244[/cite]). This data collection has some of the following attributes:
3. Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
4. The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
5. In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
6. Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[cite]10.14469/hpc/332[/cite]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).
  - In this specific example the WEO has been assigned its DOI using the repository noted above.[cite]10.17616/R3K64N[/cite]
  - We have in the past also used Figshare[cite]10.17616/R3PK5R[/cite]) for this purpose, see e.g. 10.6084/m9.figshare.1181739^‡
  - The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed, doi: 10.6084/m9.figshare.2581987.v1[cite]10.6084/m9.figshare.2581987.v1[/cite] In this instance this is held on the Figshare[cite]10.17616/R3PK5R[/cite] repository.
7. The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:
  1. 10.14469/hpc/281[cite]10.14469/hpc/281[/cite] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
  2. 10.14469/hpc/272[cite]10.14469/hpc/272[/cite] itself contains five members, one of which is e.g.
    
    10.14469/hpc/267[cite]10.14469/hpc/267[/cite] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
    
    This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.
If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:
1. collaboration between scientists, potentially at different locations and institutions.
2. attribution of provenance/credit for the individual items (via ORCID).
3. separate date stamping by the various contributors.
4. providing bi-directional links between data and publications.
5. holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
6. Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.
Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[cite]10.17616/R3K64N[/cite] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!

^‡A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.
April 17, 2016

Metametadata: data about data about (chemical) data.

Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.

Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[cite]10.2312/re3.008[/cite] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI^‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.

Data repository	The repository metametadata DOI^♣	Badge
Figshare	10.17616/R3PK5R[cite]10.17616/R3PK5R[/cite]
Zenodo	10.17616/R3QP53[cite]10.17616/R3QP53[/cite]
Cambridge structure database	10.17616/R36011[cite]10.17616/R36011[/cite]
Crystallographic open database	10.17616/R37S31[cite]10.17616/R37S31[/cite]
Oxford University Research Archive	10.17616/R3Q056[cite]10.17616/R3Q056[/cite]
Open Notebook Science	10.17616/R3859D[cite]10.17616/R3859D[/cite]
Usefulchem	10.17616/R3Z89N[cite]10.17616/R3Z89N[/cite]
Chemotion	10.17616/R34P5T[cite]10.17616/R34P5T[/cite]
Chemspider	10.17616/R38P4P[cite]10.17616/R38P4P[/cite]
Chemical Database Service	10.17616/R36P42[cite]10.17616/R36P42[/cite]
Imperial College HPC data repository.	10.17616/R3K64N[cite]10.17616/R3K64N[/cite],[cite]10.14469/hpc/382[/cite]
Imperial College SPECTRa repository.[cite]10.1021/ci7004737[/cite]	10.17616/R30316[cite]10.17616/R30316[/cite]

Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[cite]10.2312/re3.008[/cite]

Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.

Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.

^‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ^♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.

April 16, 2016

Publishing embargoes.
Publishing embargoes seem a relatively new phenomenon, probably starting in areas of science when the data produced for a scientific article was considered more valuable than the narrative of that article. However, the concept of the embargo seems to be spreading to cover other aspects of publishing, and I came across one recently which appears to take such embargoes into new and uncharted territory.

One example (there are many others) of embargoes continuing to operate in the era of open science and open data relates to crystallographically derived coordinates for macromolecules. Biomolecular structures are allowed to be embargoed for a maximum of one year before becoming openly available or "released" (considered a friendlier term than embargo). A more recent phenomenon is of embargoes on press releases which may be prepared by authors and or publishers to accompany the appearance of any article considered especially newsworthy. The publisher will then request that the press release is only released to coincide with the actual publication time and date of the article itself. Both of these types of embargo are more or less accepted by both parties. But in the last five years or so, new types of embargo have been introduced and it is these I want to discuss here.
1. The self-archive or "green open access" version of an article, in the form of the last author version of an accepted manuscript prior to copy-editing and other operations by a publisher. Such Green OA versions are now a mandatory requirement from funders (in the UK), arising from the need to conduct a "REF" or research excellence framework assessment of all (UK) universities every seven years or so. In order to allow assessors and funding councils unencumbered access to these research outputs, the authors must self-archive their publications in a suitable institutional repository. In general therefore, there should always exist two versions of any scientific paper authored within these guidelines, the AV (author version) and VoR (Version of Record, held by the publisher, and carrying the guarantee of peer review). Publishers now embargo author versions until the VoR version has been published, and sometimes even up to 18 months beyond this period.
2. The "supporting information" or SI embargo. This is closely related to the crystallographic data embargo noted above, but it applies in general to most other data and information associated with an article. Until very recently, most SI was in fact handled by the publisher themselves, and so it was released at the same time as the article. Since it is becoming more common to deposit data and SI in a separate repository, some publishers mandate that the release dates of this material must not precede the article itself. Deposition of such data has also become a mandatory requirement from (UK) funders since May 2015, and I have blogged about such "research data management" often here. In effect, both the scientific article and the data supporting it achieve their own DOIs or persistent digital identifiers, allowing easy and independent access to either the article OR its data. In fact, assigning such a DOI has a more subtle effect; creating a DOI means that metadata describing the object is also created and then aggregated by the agency issuing the DOI such as CrossRef and DataCite. Importantly, one should note that SI which is handled purely by the publisher will not have its own separate DOI and it will not have its own metadata. The data metadata for example can include the DOI for the article, and vice versa. I have shown examples of the utility of such metadata for data in an earlier post.
3. So now we come to the most recent embargo, which has surfaced since around May 2015, as increasingly data has become a first class object in its own right with its own DOI and importantly its own metadata. There is now evidence that some publishers are requesting that this very metadata about data is also subjected to an embargo, not to be released before the article which makes use of that data is itself released. So data can be deposited in "dark form" prior to a publication, but the metadata (which carries the date stamp and provenance for the deposition) may have to be "dark" or embargoed. Actually, this is not yet very common; for example I asked the Royal Society of Chemistry what their policy was, with the reply "the Royal Society of Chemistry wouldn’t require metadata about the data files to be embargoed".
We live in an era where the very careers of reseachers can be determined by their claim to priority about scientific discoveries. The date stamps for priority continue to be largely controlled and issued by publishers and some may decide that it will be in their business interests to extend their control to data. Perhaps they may even wish to control all aspects of publication including the data and its metadata, acting as self-proclaimed research facilitators.

At this moment, this has not happened; both data and its metadata can remain open and FAIR. Which is where I think we should go in the future in the interests of open science itself.
April 13, 2016

Global initiatives in research data management and discovery: searching metadata.

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey\:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI\:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey\:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI\:InChI=1S\/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey\:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey\:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[cite]10.1021/ci9803233[/cite] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

March 7, 2016

LEARN Workshop: Embedding Research Data as part of the research cycle
I attended the first (of a proposed five) workshops organised by LEARN (an EU-funded project that aims to ...Raise awareness in research data management (RDM) issues & research policy) on Friday. Here I give some quick bullet points relating to things that caught my attention and or interest. The program (and Twitter feed) can be found at https://learnrdm.wordpress.com where other's comments can also be seen.
- Henry Oldenburg, founder member and first secretary of the Royal Society, was the first Open Scientist.
- About 100 people attended the workshop. Of these ~3-5 identified themselves as researchers creating data, and the rest comprised research data managers, administrators, librarians, publishers (but see below) etc. Many were new to their posts.
- Not publishing scientific data should become recognised as scientific malpractice.
- Central libraries should pro-actively disperse their knowledge to data scientists in departments.
- If a scientist is concerned that openly publishing their data might give advantage to their competitors, they are urged to counteract this by "being cleverer than the others".
- The three great bastions of open science are (a) Open Data, (b) Open access articles and (c) doing science openly. Examples of this third category include open notebook science (ONS), a form notably pioneered by Jean-Claude Bradley. One attribute of ONS was noted as no insider knowledge.
- Learned societies should endow medals for Open Science.
- (Some) publishers are reinventing themselves as Research Facilitators.
The plenaries are all well worth dipping into (certainly the video and in some cases all the slides are scheduled to appear).

If you are a researcher (undergraduate students, PGs, PDRAs, early career researchers and academics) you should immediately track down your local evangelist/expert in RDM and ask what the local infrastructures are (or will be shortly built).
February 1, 2016
Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[cite]10.12688/f1000research.6742.1[/cite] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

The second email alerted me to an article[cite]10.15200/winn.143871.12809[/cite] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone^† can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.

^†Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but I would argue if the intent is serious, it certainly should.

August 5, 2015

Personal web pages on digital repositories.

The university sector in the UK has quality inspections of its research outputs conducted every seven years, going by the name of REF or Research Excellence Framework. The next one is due around 2020, and already preparations are under way! Here I describe how I have interpreted one of its strictures; that all UK funded research outputs (i.e. research publications in international journals) must be made available in open unrestricted form within three months of the article being accepted for publication, or they will not be eligible for consideration in 2020.

At the outset, I should say that one infrastructure to help researchers adhere to the guidelines is being implemented in the form of the Symplectic system. This allows a researcher to upload the final accepted version of a manuscript. At Imperial College, a digital repository called Spiral serves this purpose and also acts as the front end for collecting informative metadata to enhance discoverability. The final accepted version is then converted by the publisher into a version-of-record. This contains styling unique to the publisher and the content is subjected to further scrutiny by the authors as proof corrections. In an ideal world, these latter changes should also be faithfully propagated back to the final accepted version, as would all the supporting information associated with the article. Since most authors do not exactly enjoy the delights of proof corrections, this final reconciliation of the two versions may not always be assiduously undertaken.

I became concerned about the existence of two versions of any given scientific report and that the task of ensuring total fidelity in the content of both versions may negatively impact on the author’s time. Much better if the publisher could grant permission for the author to archive the version-of-record into a digital repository.

Some experiments were needed, and I decided to start them in reverse, by archiving my oldest publications. Since Symplectic now provides a system to do this, I began by using it. Symplectic identifies each publisher’s policies for archival, of which the most liberal are known as ROMEO GREEN. To quote from the definition, this colour allows the author to “archive pre-print and post-print or publisher’s version/PDF“. In an afternoon I had processed most of my ROMEO green articles. You know how it is sometimes, you do not read the fine print! And so the library soon informed me that archival of ROMEO GREEN was in fact only permitted on the author’s “personal web page”. Spiral, as an institutional repository, does not apparently constitute a personal web page for me and so none of my Symplectic submissions could be accepted for archival there.

Time to rethink the experiment. Firstly, I very much wanted the reprints to be held by a proper digital repository rather than a conventional web page. Why? I wanted my reprints to adhere as much as possible to FAIR: findable, accessible, interoperable and re-usable. Well, at least the first two of those (the last two relate more to data). A repository is designed to hold metadata in a formal and standards-based manner and metadata helps achieve FAIR. So I asked the Royal Society of Chemistry (as a ROMEO GREEN publisher) whether a personal web page hosted on a digital repository would qualify. I was soon informed that I had proposed a neat solution here, and they couldn’t see an issue.

Now, all I had to do is find a repository where I could create such a personal web page. The chemistry department at Imperial College has for ten years hosted a DSpace repository called SPECTRa[cite]10.1021/ci7004737[/cite] which already has the functionality for individuals to create personal collections. I had also picked up on the increasing attention being given to Zenodo, like the World-Wide Web itself an offshoot of CERN (of large Hadron Collider fame) and born from the need for researchers to more permanently archive the outputs of their researches. These outputs include software, videos, images, presentations, posters, publications and (most obviously for CERN) datasets. I thought I would include them in my experiment as well. There results are summarised below.

	DSpace-SPECTRa	Zenodo
Community	Henry Rzepa personal web page reprint collection	Rzepa personal computational chemistry data and reprint page
Collection	Royal Society of Chemistry reprints
Publication	10042/195577	10.5281/zenodo.18758[cite]10.5281/zenodo.18758[/cite]
Thesis	10044/1/20860[cite]http://doi.org/10044/1/20860[/cite]	10.5281/zenodo.18777[cite]10.5281/zenodo.18777[/cite]
Dataset	10.14469/ch/191342[cite]10.14469/ch/191342[/cite]	10.5281/zenodo.18632[cite]10.5281/zenodo.18632[/cite]
Harvesting	OAI-ORE	OAI-PMH

The last line of this table includes a link to another design feature of a repository, facilitating the ability to harvest the content. The ContentMine project (“The right to read is the right to mine!“) has shown how such harvesting of facts from the literature can be automated on a vast scale, and (IMHO) represents an example of those disruptive innovations that have the power to change the world forever. It also enshrines the idea that scientific facts funded by the public purse should be capable of being openly liberated from their containers. A harvestable repository seems an ideal container for achieving this.

My experiment is part of what might be seen as the increasingly subtle interplay between:

scientific authors, whose creative endeavour research is and without whom scientific publishers would not exist
publishers who create a business model from the content freely given them by authors but also (especially if a commercial publisher) need to be accountable to their shareholders.
the funding councils, many of whom now wish the outcomes of the research they fund to be openly available to all
the local libraries/administrators who have to adhere to/enforce all the rules contractually handed down to them by publishers whose direct customers they are, but who also need to serve their community of readers and authors.
researchers who would rather do research than fret about the above, and who would rather spend limited resources doing that research rather than diverting an increasing amount of their attention into the above system.
readers, who need unimpeded access to the research endeavours of others, but often have little influence on the policies and actions of all the other stakeholders, since they are NOT considered customers (of the publishers).
etc. etc.

My experiment was in part designed to explore these rules, their interpretations and their boundaries. For the time being at least I seem to have found an arrangement that allows me to distribute versions-of-record of my own work, thanks to a generous and far-sighted learned society publisher. Watch this space!

Acknowledgments

This post has been cross-posted in PDF format at Authorea.

June 20, 2015