Tag: Academic publishing

The “Accessible” in FAIR (data).
In a previous post, I looked at the Findability of FAIR data in common chemistry journals. Here I move on to the next letter, the A = Accessible.

The attributes of A[cite]10.1038/sdata.2016.18[/cite] include:
1. (meta)data are retrievable by their identifier using a standardized communication protocol.
2. the protocol is open, free and universally implementable.
3. the protocol allows for an authentication and authorization procedure.
4. metadata are accessible, even when the data are no longer available.
5. The metadata should include access information that enables automatic processing by a machine as well as a person.
Items 1-2 are covered by associating a DOI (digital object identifier) with the metadata. Item 3 relates to data which is not necessarily also OPEN (FAIR and OPEN are complementary, but do not mean the same).

Item 4 mandates that a copy of the metadata be held separately from the data itself; currently the favoured repository is DataCite (and this metadata way well be duplicated at CrossRef, thus providing a measure of redundancy). It also addresses an interesting debate on whether the container for data such as a ZIP or other compressed archive should also contain the full metadata descriptors internally, which would not directly address item 4, but could do so by also registering a copy of the metadata externally with eg DataCite.

Item 4 also implies some measure of separation between the data and its metadata, which now raises an interesting and separate issue (introduced with this post) that the metadata can be considered a living object, with some attributes being updated post deposition of the data itself. Thus such metadata could include an identifier to the journal article relating to the data, information that only appears after the FAIR data itself is published. Or pointers to other datasets published at a later date. Such updating of metadata contained in an archive along with the data itself would be problematic, since the data itself should not be a living object.

Item 5 is the need for Accessibility to relate both to a human acquiring FAIR data and to a machine. The latter needs direct information on exactly how to access the data. To illustrate this, I will use data deposited in support of the previous post and for which a representative example of metadata can be found at (item 4) a separate location at:
data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/5496

This contains the components:
1. <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata" relatedMetadataScheme="ORE"schemeURI="http://www.openarchives.org/ore/ ">https://data.hpc.imperial.ac.uk/resolve/?ore=5496</relatedIdentifier>
2. <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart" relatedMetadataScheme="Filename" schemeURI="filename://aW5wdXQuZ2pm">https://data.hpc.imperial.ac.uk/resolve/?doi=5496&file=1</relatedIdentifier>
Item 6 is an machine-suitable RDF declaration of the full metadata record. Item 7 allows direct access to the datafile. This in turn allows programmed interfaces to the data to be constructed, which include e.g. components for immediate visualisation and/or analysis. It also allows access on a large-scale (mining), something a human is unlikely to try.

It would be fair to say that the A of FAIR is still evolving. Moreover, searches of the DataCite metadata database are not yet at the point where one can automatically identify metadata records that have these attributes. When they do become available, I will show some examples here.

Added: This search: https://search.test.datacite.org/works?
query=relatedIdentifiers.relatedMetadataScheme:ORE shows how it might operate.
April 18, 2019
A search of some major chemistry publishers for FAIR data records.
In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

One can query thus:
1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
  which retrieves the very healthy looking 6,179,287 works.
2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
  ?query=relatedIdentifiers.relatedIdentifier:10.1021*
  which returns a respectable 210,240 works.
3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
  ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*)
  and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.
I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

Publisher Search 2 Search 3

ACS 210,240 14,213

RSC 138,147 1,279

Elsevier 185,351 56,373

Nature 12,316 8,104

Wiley 135,874 9,283

Science 3,384 2,343

These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?
1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
  returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
2. And just to show the searches are behaving as expected:
  ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
  returns 196,027 works.
It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

Finally, we have not really explored adherence to eg the AIR of FAIR. That is for another post.
April 12, 2019
Questions about the (metadata) components of a scientific article.
The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.

The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).^‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.

Let me analyse a recent example.
1. For the article[cite]10.1021/acsomega.8b03005[/cite] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.^† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
  - This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
    
    This last inference can be tested using metadata from this article[cite]10.1039/C7SC03595K[/cite] using e.g.
    https://api.crossref.org/v1/works/10.1039/C7SC03595K or
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1039/C7SC03595K
    which reveals a full citation list, including explicit citations to data objects as per: https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/1620
    
    Of the 37 citations listed in the article itself,[cite]10.1021/acsomega.8b03005[/cite] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
    
    An alternative method of invoking a metadata record;
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
    retrieves a sub-set of the article metadata available using the CrossRef query,^‡ but again with no included references and again nothing for the data citation #22.
2. Citation #22 in the above does have its own metadata record, obtainable using:
  https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
  - This has an entry
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
    which points back to the article.[cite]10.1021/acsomega.8b03005[/cite]
3. To summarise, the article noted above[cite]10.1021/acsomega.8b03005[/cite] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![cite]10.1039/C7SC03595K[/cite]
For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[cite]10.1021/acsomega.8b03005[/cite] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[cite]10.1002/mrc.4806[/cite], [cite]10.1006/jmre.1997.1214[/cite], [cite]10.1006/jmre.2000.2071[/cite] relating to the one discussed above[cite]10.1021/acsomega.8b03005[/cite] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.

This now raises the following questions:
1. Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
2. If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
3. Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
4. More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?
I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!

^‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. ^†JSON, which is not particularly human friendly.
April 8, 2019
“Richer metadata makes content more useful”
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.

Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.

So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
1. References
2. Open References
3. ORCID IDs
4. Text mining URLs
5. Abstracts
RSC

ACS

Elsevier

Springer-Nature

Wiley

Science

One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.

To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.

I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.

Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[cite]10.1021/acsomega.8b03005[/cite]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
February 16, 2019

Publisher	Search 2	Search 3
ACS	210,240	14,213
RSC	138,147	1,279
Elsevier	185,351	56,373
Nature	12,316	8,104
Wiley	135,874	9,283
Science	3,384	2,343

Re-inventing the anatomy of a research article.

The traditional structure of the research article has been honed and perfected for over 350 years by its custodians, the publishers of scientific journals. Nowadays, for some journals at least, it might be viewed as much as a profit centre as the perfected mechanism for scientific communication. Here I take a look at the components of such articles to try to envisage its future, with the focus on molecules and chemistry.

The formula which is mostly adopted by authors when they sit down to describe their chemical discoveries is more or less as follows:

An introduction, setting the scene for the unfolding narrative
Results. This is where much of the data from which the narrative is derived is introduced. Such data can be presented in the form of:
- Tables
- Figures and schemes
- Numerical and logical data embedded in narrative text
Discussion, where the models constructed from the data are illustrated and new inferences presented. Very often categories 2 and 3 are conflated into one single narrative.
Conclusions, where everything is brought together to describe the essential aspects of the new science.
Bibliography, where previous articles pertinent to the narrative are listed.

In the last decade or so, the management of research data has developed as a field of its own, with three phases:

Setting out a data management plan at the start of the project, often a set of aspirations together with putative actions,
the day-to-day management of the data as it emerges in the form of an electronic laboratory notebook (ELN),
the publication of selected data from the ELN into a repository, together with the registration of metadata describing the properties of the data.

In the latter category, item 8 can be said to be a game-changer, a true disruptive influence on the entire process. The key aspect is that it constitutes independent publication of data to sit alongside the object constructed from 1-5. More disruption emerges from the open citations project, whereby category 5 above can be released by publishers to adopt its own separate existence. So now we see that of the five essential anatomic components of a research article, two are already starting to achieve their own independence. Clearly the re-invention of the anatomy of the research article is well under way already.

Next I take a look at what sorts of object might be found in category 8, drawing very much on our own experience of implementing 7 and 8 over the last twelve years or so. I start by observing that in 2 above, figures are perhaps the object most in need of disruptive re-invention. In the 1980s, authors were much taken by the introduction of colour as a means of conveying information within a figure more clearly; although the significant costs then had to be borne directly by these authors (and with a few journals this persists to this day). By the early 1990s, the introduction of the Web[cite]10.1039/C39940001907[/cite] offered new opportunities not only of colour but of an extra dimension (or at least the illusion of one) by means of introducing interactivity for three-dimensional models. Some examples resulting from combining figures from category 2 with 8 above are listed in the table below.

Examples of re-invented data objects from category 2
Example	Object title	Object DOI	Article DOI
1	Figure 9. Catalytic cycle involving one amine …etc.	10.14469/hpc/1854	10.1039/C7SC03595K
2	FAIR Data Figure. Mechanistic insights into boron-catalysed direct amidation reactions	10.14469/hpc/4919	10.1039/C7SC03595K
3	FAIR Data table. Computed relative reaction free energies (kcal/mol-1) of Obtusallene derived oxonium and chloronium cations	10.14469/hpc/1248	10.1021/acs.joc.6b02008
4	(raw) NMR data for Epimeric Face-Selective Oxidations …	10.14469/hpc/1267	10.1021/acs.joc.6b02008
5	Bibliography	10.14469/hpc/1116	10.1021/acs.joc.6b02008

Example 1 illustrates how a figure from category 2 above can be augmented with active hyperlinks specifying the DOI of the data in category 8 from which the figure is derived, thus creating a direct and contextual connection between the research article and the research data it is based upon. These links are embedded only in the Acrobat (PDF) version of the article as part of the production process undertaken by the journal publisher. Download Figure 9 from the link here and try it for yourself or try the entire article from the journal, where more figures are so enhanced.

Example 2 takes this one stage further. The hyperlinks in the published figure in example 1 were embedded in software capable of resolving them, namely a PDF viewer. But that is all that this software allows. By relocating the hyperlink into a Web browser instead, one can add further functionality in the form of Javascripts perhaps better described as workflows (supported by browsers but not supported by Acrobat). There are three such workflows in example 2.

The first uses an image map to associate a region of the figure data object defined by a DOI.
The second interrogates the metadata specifically associated with the DOI (the same DOIs that are seen in the figure itself) to see if there is any so-called ORE metadata available (ORE= Object Re-use and Exchange). If there is, it uses this information to retrieve the data itself and pass it through to
the third workflow represented by a set of JavaScripts known as JSmol. These interpret the data received and construct an interactive visual 3D molecular model representing the retrieved data.

All this additional workflowed activity is implemented in a data repository. It is not impossible that it could also be implemented at the journal publisher end of things, but it is an action that would have to be supported by multiple publishers. Arguably this sort of enhancement is far better suited and more easily implemented by a specialised data publisher, i.e. a data repository.

Example 3 does the same thing for a table.

Example 4 enhances in a different manner. Conventionally NMR data is added to the supporting information file associated with a journal article, but such data is already heavily processed and interpreted. The raw instrumental data is never submitted to the journal and is pretty much always possibly only available by direct request from the original researchers (at least if the request is made whilst the original researchers are still contactable!). The data repository provides a new mechanism for making such raw instrumental (and indeed computational) data an integral part of the scientific process.

Example 5 shows how a bibliography can be linked to a secondary bibliography (citations 35 and 36 in this example in the narrative article) and perhaps in the future to Open Citations semantic searches for further cross references.

So by deconstructing the components of the standard scientific article, re-assembling some of them in a better-suited environment and then linking the two sets of components to each other, one can start to re-invent the genre and hopefully add more tools for researchers to use to benefit their basic research processes. The scope for innovation seems considerable. The issue of course is (a) whether publishers see this as a viable business model or whether they instead wish to protect their current model of the research article and whether (b) authors wish to undertake the learning curve and additional effort to go in this direction. As I have noted before, the current model is deficient in various ways; I do not think it can continue without significant reinvention for much longer. And I have to ask that if reinvention does emerge, will science be the prime beneficiary?

December 29, 2018

Open Access journal publishing debates – the elephant in the room?

For perhaps ten years now, the future of scientific publishing has been hotly debated. The traditional models are often thought to be badly broken, although convergence to a consensus of what a better model should be is not apparently close. But to my mind, much of this debate seems to miss one important point, how to publish data.

Thus, at one extreme is COAlition S, a model which promotes the key principle that “after 1 January 2020 scientific publications on the results from research funded by public grants provided by national and European research councils and funding bodies, must be published in compliant Open Access Journals or on compliant Open Access Platforms.” This includes ten principles, one of which “The ‘hybrid’ model of publishing is not compliant with the above principles” has revealed some strong dissent, as seen at forbetterscience.com/2018/09/11/response-to-plan-s-from-academic-researchers-unethical-too-risky I should explain that hybrid journals are those where the business model includes both institutional closed-access to the journal via a subscription charge paid by the library, coupled with the option for individual authors to purchase an Open Access release of an article so that it sits outside the subscription. The dissenters argue that non-OA and hybrid journals include many traditional ones, which especially in chemistry are regarded as those with the best impact factors and very much as the journals to publish in to maximise both the readership, hence the impact of the research and thus researcher’s career prospects. Thus many (not all) of the American Chemical Society (ACS) and Royal Society of Chemistry (RSC) journals currently fall into this category, as well as commercial publishers of journals such as Nature, Nature Chemistry,Science, Angew. Chemie, etc.

So the debate is whether funded top ranking research in chemistry should in future always appear in non-hybrid OA journals (where the cost of publication is borne by article processing charges, or APCs) or in traditional subscription journals where the costs are borne by those institutions that can afford the subscription charges, but of course also limit the access. A measure of how important and topical the debate is that there is even now a movie devoted to the topic which makes the point of how profitable commercial scientific publishing now is and hence how much resource is being diverted into these profit margins at the expense of funding basic science.

None of these debates however really takes a close look at the nature of the modern research paper. In chemistry at least, the evolution of such articles in the last 20 years (~ corresponding to the online era) has meant that whilst the size of the average article has remained static at around 10 “pages” (in quotes because of course the “page” is one of those legacy concepts related to print), another much newer component known as “Supporting information” or SI^♥ has ballooned to absurd sizes. It can reach 1000 pages[cite]10.1021/jacs.6b13229[/cite] and there are rumours of even larger SIs. The content of SI is of course mostly data. The size is often because the data is present in visual form (think spectra). As visual information, it is not easily “inter-operable” or “accessible”. Nor is it “findable” until commercial abstracting agencies chose to index it. Searches of such indexed data are most certainly “closed” (again depending on institutional purchases of access) and not “open access”. You may recognise these attributes as those of FAIR (Findable, accessible, inter-operable and re-usable). So even if an article in chemistry is published in pure OA form, in order to get FAIR access to the data associated with the article, you will probably have to go to a non-OA resource run by a commercial organisation for profit. Thus a 10 page article might itself be OA, but the full potential of its 1000+ page data (an elephant if ever there was one) ends up being very much not OA.

You might argue that the 1000+ pages of data does not require the services of an abstracting agency to be useful. Surely a human can get all the information they want from inspecting a visual spectrum? Here I raise the future prospects of AI (artificial intelligence). The ~1000 page SI I noted above[cite]10.1021/jacs.6b13229[/cite] includes e.g NMR spectra for around 70 compounds (I tried to count them all visually, but could not be certain I found them all). A machine, trained to identify spectra from associated metadata (a feature of FAIR), could extract vastly more information than a human could from FAIR raw data^‡ (a spectrum is already processed data, with implied information/data loss) in a given time. And for many articles, not just one. Thus FAIR data is very much targeted not only at humans but at the AI-trained machines of the future.

So I again repeat my assertion that focussing on whether an article is OA or not and whether publishing in hybrid journals is to be allowed or not by funders is missing that 100-fold bigger elephant in the room. For me, a publishing model that is fit for the future should include as a top priority a declaration of whether the data associated with it is FAIR. Thus in the Plan-S ten principles, FAIR is not mentioned at all. Only when FAIR-enabled data becomes part of the debates can we truly say that the article and its data are on its way to being properly open access.

^‡The FAIR concept did not originally differentiate between processed data (i.e. spectra) and the underlying primary or raw data on which the processed data is based. Our own implementation of FAIR data includes both types of data; raw for machine reprocessing if required, and processed data for human interpretation. Along with a rich set of metadata, itself often created using carefully designed workflows conducted by machines.

^♥The proportion of articles relating to chemistry which do not include some form of SI is probably low. These would include articles which simply provide a new model or interpretation of previously published data, reporting no new data of their own. A famous historical example is Michael Dewar’s re-interpretation of the structure of stipitatic acid[cite]10.1038/155050b0[/cite] which founded the new area of non-benzenoid aromaticity.

November 4, 2018
Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.
Harnessing FAIR data is an event being held in London on September 3rd; no doubt all the speakers will espouse its virtues and speculate about how to realize its potential.^♥ Admirable aspirations indeed. Capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.

The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.

The metadata for the above DOI includes information such as;
1. The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
2. Date stamps for the original creation date and subsequent modifications.
3. A rights declaration, in this case the CC0 license which describes how the data can be re-used.
4. Related identifiers, in this case describing members of this collection.
The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).
1. One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
2. Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
  <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
  The advantage of expressing the metadata in this way is that a general search of the type:
  https://commons.datacite.org/doi.org?query=subjexts.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
  can be used to track down any molecule with metadata corresponding to the above InChIkey.
3. Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree^†), as returned by the Gaussian program;
  <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
  I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.
  - At the coarsest level, a search of the type
    https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*
    should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all^‡) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
  - The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
    https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.732417
  - The searcher can experiment with different levels of precision to narrow or broaden the search.
  - I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
4. The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
  https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*+AND+ subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390^♥
I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.

^†It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units.

^‡In theory, a range query of the type:
https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_energy+AND+subjects.subject:[\-649.1 TO \-649.8] should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values.

^♥Implicit in this search is the grouping
https://commons.datacite.org/doi.org?query=(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*) + (subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390
Currently however DataCite do not correctly honour this form of grouping.

^♥Video of the speakers and the panel session at the end is now available.
August 7, 2018
First, Open Access, then Open (and FAIR) Data, now Open Citations.

The topic of open citations was presented at the PIDapalooza conference and represents a third component in the increasing corpus of open scientific information.

David Shotton gave us an update on Citations as First Class data objects – Citation Identifiers and introduced (me) to the blog where he discusses this topic. The citations or bibliography has long been regarded as an essential, and until recently inseparable, component at the end of a scientific article. It is also a component easily susceptible to “game play“. Authors can be tempted to self-cite themselves, possibly to excess and perhaps worse, to cite their friends and colleagues for other than purely scientific reasons. There are other issues. Thus to infer the context of any particular citation, one has to read the text where it is cited and this too can be subjected to game play. One may have to “read between the lines” to try to judge whether the citation is being cited favourably as supporting any case being made, or instead to indicate disagreement with the cited authors. An article that is being cited because one disagrees with the conclusions therein may still go on to contribute to the cited author’s “h-index” of esteem. So there are various aspects of citations that deserve improvement, or certainly development and evolution.

Shotton told us that many publishers are now releasing article citations as open (CC0) data in their own right, as urged to do so on the Initiative for Open Citations site. A corpus of some 13 million of these are now available as RDF triples with a SPARQL end-point. This latter means that semantic searches of the corpus can be undertaken. So what are the benefits? Worthy aspirations such as to explore connections between knowledge fields, and to follow the evolution of ideas and scholarly disciplines (similar in fact to the new Dimensions product I discussed in the previous post). When I probed into the various sites linked above, I had in mind to identify some clear scientific outcomes of making them available in this manner, perchance even in the field of chemistry. When I succeed I will follow-up on this post, but at the moment I am not yet in a position to illustrate these benefits with chemical stories. If anyone reading this post has such, please let us know!

I will conclude here by noting much discussion at universities of the future of the scientific article itself; whether it should be increasingly mandated as GOLD Open Access (made so by payment of an article processing charge, or APC, by its authors), or whether journals should retain the hybrid publishing models where only a proportion of articles are GOLD, and the remainder are paid for by subscription fees for licensing access to the non-GOLD articles in the journal. Meanwhile, in what seems sometimes as a separate conversation, the article itself is being dis-assembled into components such as open and/or FAIR data, open citations, infographics, social media and yes, even blogs. Are these two evolutions headed in different directions? Certainly, I think the future is not what it used to be!

February 3, 2018
PIDapalooza 2018. A conference like no other!
Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?

The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.
1. Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.
  You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.
2. Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
3. Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
4. Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
5. Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.
Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!
January 23, 2018
Two stories about Open Peer Review (OPR), the next stage in Open Access (OA).

We have heard a lot about OA or Open Access (of journal articles) in the last five years, often in association with the APC (Article Processing Charge) model of funding such OA availability. Rather less discussed is how the model of the peer review of these articles might also evolve into an Open environment. Here I muse about two experiences I had recently.

Organising the peer review of journal articles is often now seen as the single most important activity a journal publisher can undertake on behalf of the scientific community; the very reputation of the journal depends on this process being conducted responsibly, thoroughly and with integrity by the selected reviewers. Reviewers undertake this process voluntarily, mostly anonymously, without remuneration or recognition and often with short deadlines for completion. After one such review, I recently received an interesting follow-up email from the journal, suggesting I register my activity with Publons.com, a site set up to register and give non-anonymous credit for reviewing activities. I should say that Publons is a commercial company, set up in 2012 to to “address the static state of peer-reviewing practices in scholarly communication, with a view to encourage collaboration and speed up scientific development”. Worthy aims, but like many a .com company nowadays, one might ask what the back-story might be. Thus many of the Internet giants, Google, Facebook, Twitter etc, do have back-stories, which often underpin their business models, but which may only emerge years after their founding. With only a hazy idea of what Publons’ back-story might be, I went ahead and registered my reviewing activity.

After doing so, I then accessed my entry. You only learn that I have reviewed for a particular journal, but nothing about the actual process itself. I did not really think that this experiment had done much to encourage collaboration and speed up scientific development. It might be useful for early career researchers to get their name exposed however.

I can almost understand why the review itself might not be publicly displayed, but as a result you learn nothing about the factual basis of the review and whether it might have been conducted responsibly, thoroughly and with integrity. Instead, I now suspect that the presence of my name on this site might merely encourage other publishers to deluge me with requests for further (freely donated) refereeing.

Discussing this at lunch, a colleague (thanks Ed!) reminded me of a veritable journal called Organic Syntheses. Here, authors submit a synthetic procedure and open identified “checkers” are invited to repeat the procedure and comment on it. The two roles are kept separate (i.e. the checkers do not become co-authors), but they could get credit for their activity. Thus if you view a typical recent entry[cite]10.15227/orgsyn.094.0217[/cite] you will see a full biography and affiliation of the checkers given at the end, with footnotes often describing their own observations if they differ from those of the authors.

This set me thinking whether an open peer review process might also contain such an element of checking, as well as informed comment, nay opinion, about the article itself and the conclusions it makes. The opportunity arose when I was contacted by an author who was about to submit a computational article to a journal. This journal allowed open peer review. If I agreed to review, my name would be attached to the article if accepted for publication. I undertook this on the basis that I would use this review to conduct some limited checking of the computations and other assumptions underpinning the conclusions in the submitted article. I also wanted this open process to include the data on which my review was based. Most importantly if anyone wished to replicate my replication, the barriers to doing so should be as low as is possible. Shortly thereafter, I received a formal invitation from the journal and I set about my task. Crucially, all my own calculations supporting the review were archived in a data repository, albeit under embargo. In my cover letter I included the DOI for my data and the embargo access code, so that the authors (and the editor of the journal if they so wished) could inspect the data against which I wrote my review.

Then followed standard procedures, whereby the authors took my comments into consideration, revised the article and the final version was indeed accepted and published.[cite]10.1073/pnas.1709586114[/cite] You will find the two referees/checkers listed, although unlike Organic Syntheses, there is no bibliographic information about them or their affiliation. I did ask the journal if they could at least link my ORCID identifier to my name, but that request was refused. If my name had been a common one, then disambiguating it into a unique identity could be a challenge. There was also no mechanism to associate my identity on the journal with any data on which I had based my review. Really, the only open aspect of this process was just my (potentially ambiguous) name, nothing else. No follow-up was received from the journal to add the review to Publons.

The next stage was to contact the author who had originally set the process under way to ask them if they would mind my releasing the data on which my review had been based. They agreed, as also they did to my telling this story. The overall outcome is thus a published article with the reviewers (if not their reviews or any supporting evidence for their review) openly named. In this specific case, there is also an open dataset with a formal link back to the article in the form of a DOI (10.14469/hpc/2640, although I suspect this aspect is unique, even precedent setting), but one driven by the reviewer and not the journal. It would be nice to have bidirectional links between both article and the review data, but I do not know any publishers currently operating such a mechanism (if anyone knows such, please tell).

Now to the broader questions about the process described above. I think that the aspiration to encourage collaboration and speed up scientific development may indeed have been promoted by this association between article and the data assembled by the reviewer. Whether the final article was improved as a result of the processes described here I will leave the authors to comment if they wish. As with the checkers employed by Organic Syntheses, such a review process takes not just time, but resources. Resources that currently have to be freely donated by the reviewers and their host institution and which clearly cannot become expensive, time-consuming or onerous. That was not the case as it happens here; my contributions were facilitated by my having sufficient expertise to perform the tasks I undertook really quite quickly.

I will raise one more issue; that of whether to add my review to the dataset which is now openly available. In fact it is not included, in part because it related to the initially submitted version of the MS. The final MS version has been revised and so many of the comments in my review may only make sense if you have the first version to hand. It would be perhaps unreasonable to make the first drafts of manuscripts routinely available (although historians of science would probably love that!) alongside the reviews of that first draft. But I could also see a case for doing so if the community agreed to it. One to discuss for the future I think. There is also the associated issue of what should happen to any dataset associated with a review in the event that the final article is rejected and not accepted. Should the data remain permanently under embargo and the reviewer’s identity permanently anonymous? Perhaps opening up even such datasets might nevertheless encourage collaboration and speed up scientific development, but I fancy some would consider that a step too far!

October 5, 2017