Tag: Technology/Internet

The “Accessible” in FAIR (data).
In a previous post, I looked at the Findability of FAIR data in common chemistry journals. Here I move on to the next letter, the A = Accessible.

The attributes of A[cite]10.1038/sdata.2016.18[/cite] include:
1. (meta)data are retrievable by their identifier using a standardized communication protocol.
2. the protocol is open, free and universally implementable.
3. the protocol allows for an authentication and authorization procedure.
4. metadata are accessible, even when the data are no longer available.
5. The metadata should include access information that enables automatic processing by a machine as well as a person.
Items 1-2 are covered by associating a DOI (digital object identifier) with the metadata. Item 3 relates to data which is not necessarily also OPEN (FAIR and OPEN are complementary, but do not mean the same).

Item 4 mandates that a copy of the metadata be held separately from the data itself; currently the favoured repository is DataCite (and this metadata way well be duplicated at CrossRef, thus providing a measure of redundancy). It also addresses an interesting debate on whether the container for data such as a ZIP or other compressed archive should also contain the full metadata descriptors internally, which would not directly address item 4, but could do so by also registering a copy of the metadata externally with eg DataCite.

Item 4 also implies some measure of separation between the data and its metadata, which now raises an interesting and separate issue (introduced with this post) that the metadata can be considered a living object, with some attributes being updated post deposition of the data itself. Thus such metadata could include an identifier to the journal article relating to the data, information that only appears after the FAIR data itself is published. Or pointers to other datasets published at a later date. Such updating of metadata contained in an archive along with the data itself would be problematic, since the data itself should not be a living object.

Item 5 is the need for Accessibility to relate both to a human acquiring FAIR data and to a machine. The latter needs direct information on exactly how to access the data. To illustrate this, I will use data deposited in support of the previous post and for which a representative example of metadata can be found at (item 4) a separate location at:
data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/5496

This contains the components:
1. <relatedIdentifier relatedIdentifierType="URL" relationType="HasMetadata" relatedMetadataScheme="ORE"schemeURI="http://www.openarchives.org/ore/ ">https://data.hpc.imperial.ac.uk/resolve/?ore=5496</relatedIdentifier>
2. <relatedIdentifier relatedIdentifierType="URL" relationType="HasPart" relatedMetadataScheme="Filename" schemeURI="filename://aW5wdXQuZ2pm">https://data.hpc.imperial.ac.uk/resolve/?doi=5496&file=1</relatedIdentifier>
Item 6 is an machine-suitable RDF declaration of the full metadata record. Item 7 allows direct access to the datafile. This in turn allows programmed interfaces to the data to be constructed, which include e.g. components for immediate visualisation and/or analysis. It also allows access on a large-scale (mining), something a human is unlikely to try.

It would be fair to say that the A of FAIR is still evolving. Moreover, searches of the DataCite metadata database are not yet at the point where one can automatically identify metadata records that have these attributes. When they do become available, I will show some examples here.

Added: This search: https://search.test.datacite.org/works?
query=relatedIdentifiers.relatedMetadataScheme:ORE shows how it might operate.
April 18, 2019
Questions about the (metadata) components of a scientific article.
The conventional procedures for reporting analysis or new results in science is to compose an “article”, augment that perhaps with “supporting information” or “SI”, submit to a journal which undertakes peer review, with revision as necessary for acceptance and finally publication. If errors in the original are later identified, a separate corrigendum can be submitted to the same journal, although this is relatively rare. Any new information which appears post-publication is then considered for a new article, and the cycle continues. Here I consider the possibilities for variations in this sequence of events.

The new disruptors in the processes of scientific communication are the “data“, which can now be given a separate existence (as FAIR data) from the article and its co-published “SI”. Nowadays both the “article+SI” and any separate “data” have another, mostly invisible component, the “metadata“. Few authors ever see this metadata. For the article, it is generated by the publisher (as part of the service to the authors), and sent to CrossRef, which acts as a global registration agency for this particular metadata. For the data, it is assembled when the data is submitted to a “data repository”, either by the authors providing the information manually, or by automated workflows installed in the repository or by a combination of both. It might also be assembled by the article publisher as part of a complete metadata package covering both article and data, rather than being separated from the article metadata. Then, the metadata about data is registered with the global agency DataCite (and occasionally with CrossRef for historical reasons).^‡ Few depositors ever inspect this metadata after it is registered; even fewer authors are involved in decisions about that metadata, or have any inputs to the processes involved in its creation.

Let me analyse a recent example.
1. For the article[cite]10.1021/acsomega.8b03005[/cite] you can see the “landing page” for the associated metadata as https://search.crossref.org/?q=10.1021/acsomega.8b03005 and actually retrieve the metadata using https://api.crossref.org/v1/works/10.1021/acsomega.8b03005, albeit in a rather human-unfriendly manner.^† This may be because metadata as such is considered by CrossRef as something just for machines to process and not for humans to see!
  - This metadata indicates “references-count":22, which is a bit odd since 37 are actually cited in the article. It is not immediately obvious why there is a difference of 15 (I am querying this with the editor of the journal). None of the references themselves are included in the metadata record, because the publisher does not currently support liberation using Open References, which makes it difficult to track the missing ones down.
    
    This last inference can be tested using metadata from this article[cite]10.1039/C7SC03595K[/cite] using e.g.
    https://api.crossref.org/v1/works/10.1039/C7SC03595K or
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1039/C7SC03595K
    which reveals a full citation list, including explicit citations to data objects as per: https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/1620
    
    Of the 37 citations listed in the article itself,[cite]10.1021/acsomega.8b03005[/cite] #22, #24 and #37 are different, being citations to different data sources. The first of these, #22 is an explicit reference to its data partner for the article.
    
    An alternative method of invoking a metadata record;
    https://data.datacite.org/application/vnd.datacite.datacite+xml/10.1021/acsomega.8b03005
    retrieves a sub-set of the article metadata available using the CrossRef query,^‡ but again with no included references and again nothing for the data citation #22.
2. Citation #22 in the above does have its own metadata record, obtainable using:
  https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/4751
  - This has an entry
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsReferencedBy">10.1021/acsomega.8b03005</relatedIdentifier>
    which points back to the article.[cite]10.1021/acsomega.8b03005[/cite]
3. To summarise, the article noted above[cite]10.1021/acsomega.8b03005[/cite] has a metadata record that does not include any information about the references/citations (apart from an ambiguous count). A human reading the article can however can easily identify one citation pointing to the article data, which it turns out DOES have a metadata record which both human and machine can identify as pointing back to the article. Let us hope the publisher (the American Chemical Society) corrects this asymmetry in the future; it can be done as shown here![cite]10.1039/C7SC03595K[/cite]
For both types of metadata record, it is the publisher that retains any rights to modify them. Here however we encounter an interesting difference. The publishers of the data are, in this case, also the authors of the article! A modification to this record was made post-publication by this author so as to include the journal article identifier once it had been received from the publisher,[cite]10.1021/acsomega.8b03005[/cite] as in 2 above. Subsequently, these topics were discussed at a workshop on FAIR data, during which further pertinent articles[cite]10.1002/mrc.4806[/cite], [cite]10.1006/jmre.1997.1214[/cite], [cite]10.1006/jmre.2000.2071[/cite] relating to the one discussed above[cite]10.1021/acsomega.8b03005[/cite] were shown in a slide by one of the speakers. Since this was deemed to add value to the context of the data for the original article, identifiers for these articles were also appended to the metadata record of the data.

This now raises the following questions:
1. Should a metadata record be considered a living object, capable of being updated to reflect new information received after its first publication?
2. If metadata records are an intrinsic part of both a scientific article and any data associated with that article, should authors be fully aware of their contents (if only as part of due diligence to correct errors or to query omissions)?
3. Should the referees of such works also be made aware of the metadata records? It is of course enough of a challenge to get referees to inspect data (whether as SI or as FAIR), never mind metadata! Put another way, should metadata records be considered as part of the materials reviewed by referees, or something independent of referees and the responsibility of their publishers?
4. More generally, how would/should the peer-review system respond to living metadata records? Should there be guidelines regarding such records? Or ethical considerations?
I pose these questions because I am not aware of much discussion around these topics; I suggest there probably should be!

^‡Actually CrossRef and DataCite exchange each other’s metadata. However, each uses a somewhat different schema, so some components may be lost in this transit. ^†JSON, which is not particularly human friendly.
April 8, 2019
“Richer metadata makes content more useful”
The title of this post comes from the site www.crossref.org/members/prep/ Here you can explore how your favourite publisher of scientific articles exposes metadata for their journal.

Firstly, a reminder that when an article is published, the publisher collects information about the article (the “metadata”) and registers this information with CrossRef in exchange for a DOI. This metadata in turn is used to power e.g. a search engine which allows “rich” or “deep” searching of the articles to be undertaken. There is also what is called an API (Application Programmer Interface) which allows services to be built offering deeper insights into what are referred to as scientific objects. One such service is “Event Data“, which attempts to create links between various research objects such as publications, citations, data and even commentaries in social media. A live feed can be seen here.

So here are the results for the metadata provided by six publishers familiar to most chemists, with categories including;
1. References
2. Open References
3. ORCID IDs
4. Text mining URLs
5. Abstracts
RSC

ACS

Elsevier

Springer-Nature

Wiley

Science

One immediately notices the large differences between publishers. Thus most have 0% metadata for the article abstracts, but one (the RSC) has 87%! Another striking difference is those that support open references (OpenCitations). The RSC and Springer Nature are 99-100% compliant whilst the ACS is 0%. Yet another variation is the adoption of the ORCID (Open Researcher and Collaborator Identifier), where the learned society publishers (RSC, ACS) achieve > 80%, but the commercial publishers are in the lower range of 20-49%.

To me the most intriguing was the Text mining URLs. From the help pages, “The Crossref REST API can be used by researchers to locate the full text of content across publisher sites. Publishers register these URLs – often including multiple links for different formats such as PDF or XML – and researchers can request them programatically“. Here the RSC is at 0%, ACS is at 8% but the commercial publishers are 80+%. I tried to find out more at e.g. https://www.springernature.com/gp/researchers/text-and-data-mining but the site was down when I tried. This can be quite a controversial area. Sometimes the publisher exerts strict control over how the text mining can be carried out and how any results can be disseminated. Aaron Swartz famously fell foul of this.

I am intrigued as to how, as a reader with no particular pre-assembled toolkit for text mining, I can use this metadata provided by the publishers to enhance my science. After all, 80+% of articles with some of the publishers apparently have a mining URL that I could use programmatically. If anyone reading this can send some examples of the process, I would be very grateful.

Finally I note the absence of any metadata in the above categories relating to FAIR data. Such data also has the potential for programmatic procedures to retrieve and re-use it (some examples are available here[cite]10.1021/acsomega.8b03005[/cite]), but apparently publishers do not (yet) collect metadata relating to FAIR. Hopefully they soon will.
February 16, 2019

Re-inventing the anatomy of a research article.

The traditional structure of the research article has been honed and perfected for over 350 years by its custodians, the publishers of scientific journals. Nowadays, for some journals at least, it might be viewed as much as a profit centre as the perfected mechanism for scientific communication. Here I take a look at the components of such articles to try to envisage its future, with the focus on molecules and chemistry.

The formula which is mostly adopted by authors when they sit down to describe their chemical discoveries is more or less as follows:

An introduction, setting the scene for the unfolding narrative
Results. This is where much of the data from which the narrative is derived is introduced. Such data can be presented in the form of:
- Tables
- Figures and schemes
- Numerical and logical data embedded in narrative text
Discussion, where the models constructed from the data are illustrated and new inferences presented. Very often categories 2 and 3 are conflated into one single narrative.
Conclusions, where everything is brought together to describe the essential aspects of the new science.
Bibliography, where previous articles pertinent to the narrative are listed.

In the last decade or so, the management of research data has developed as a field of its own, with three phases:

Setting out a data management plan at the start of the project, often a set of aspirations together with putative actions,
the day-to-day management of the data as it emerges in the form of an electronic laboratory notebook (ELN),
the publication of selected data from the ELN into a repository, together with the registration of metadata describing the properties of the data.

In the latter category, item 8 can be said to be a game-changer, a true disruptive influence on the entire process. The key aspect is that it constitutes independent publication of data to sit alongside the object constructed from 1-5. More disruption emerges from the open citations project, whereby category 5 above can be released by publishers to adopt its own separate existence. So now we see that of the five essential anatomic components of a research article, two are already starting to achieve their own independence. Clearly the re-invention of the anatomy of the research article is well under way already.

Next I take a look at what sorts of object might be found in category 8, drawing very much on our own experience of implementing 7 and 8 over the last twelve years or so. I start by observing that in 2 above, figures are perhaps the object most in need of disruptive re-invention. In the 1980s, authors were much taken by the introduction of colour as a means of conveying information within a figure more clearly; although the significant costs then had to be borne directly by these authors (and with a few journals this persists to this day). By the early 1990s, the introduction of the Web[cite]10.1039/C39940001907[/cite] offered new opportunities not only of colour but of an extra dimension (or at least the illusion of one) by means of introducing interactivity for three-dimensional models. Some examples resulting from combining figures from category 2 with 8 above are listed in the table below.

Examples of re-invented data objects from category 2
Example	Object title	Object DOI	Article DOI
1	Figure 9. Catalytic cycle involving one amine …etc.	10.14469/hpc/1854	10.1039/C7SC03595K
2	FAIR Data Figure. Mechanistic insights into boron-catalysed direct amidation reactions	10.14469/hpc/4919	10.1039/C7SC03595K
3	FAIR Data table. Computed relative reaction free energies (kcal/mol-1) of Obtusallene derived oxonium and chloronium cations	10.14469/hpc/1248	10.1021/acs.joc.6b02008
4	(raw) NMR data for Epimeric Face-Selective Oxidations …	10.14469/hpc/1267	10.1021/acs.joc.6b02008
5	Bibliography	10.14469/hpc/1116	10.1021/acs.joc.6b02008

Example 1 illustrates how a figure from category 2 above can be augmented with active hyperlinks specifying the DOI of the data in category 8 from which the figure is derived, thus creating a direct and contextual connection between the research article and the research data it is based upon. These links are embedded only in the Acrobat (PDF) version of the article as part of the production process undertaken by the journal publisher. Download Figure 9 from the link here and try it for yourself or try the entire article from the journal, where more figures are so enhanced.

Example 2 takes this one stage further. The hyperlinks in the published figure in example 1 were embedded in software capable of resolving them, namely a PDF viewer. But that is all that this software allows. By relocating the hyperlink into a Web browser instead, one can add further functionality in the form of Javascripts perhaps better described as workflows (supported by browsers but not supported by Acrobat). There are three such workflows in example 2.

The first uses an image map to associate a region of the figure data object defined by a DOI.
The second interrogates the metadata specifically associated with the DOI (the same DOIs that are seen in the figure itself) to see if there is any so-called ORE metadata available (ORE= Object Re-use and Exchange). If there is, it uses this information to retrieve the data itself and pass it through to
the third workflow represented by a set of JavaScripts known as JSmol. These interpret the data received and construct an interactive visual 3D molecular model representing the retrieved data.

All this additional workflowed activity is implemented in a data repository. It is not impossible that it could also be implemented at the journal publisher end of things, but it is an action that would have to be supported by multiple publishers. Arguably this sort of enhancement is far better suited and more easily implemented by a specialised data publisher, i.e. a data repository.

Example 3 does the same thing for a table.

Example 4 enhances in a different manner. Conventionally NMR data is added to the supporting information file associated with a journal article, but such data is already heavily processed and interpreted. The raw instrumental data is never submitted to the journal and is pretty much always possibly only available by direct request from the original researchers (at least if the request is made whilst the original researchers are still contactable!). The data repository provides a new mechanism for making such raw instrumental (and indeed computational) data an integral part of the scientific process.

Example 5 shows how a bibliography can be linked to a secondary bibliography (citations 35 and 36 in this example in the narrative article) and perhaps in the future to Open Citations semantic searches for further cross references.

So by deconstructing the components of the standard scientific article, re-assembling some of them in a better-suited environment and then linking the two sets of components to each other, one can start to re-invent the genre and hopefully add more tools for researchers to use to benefit their basic research processes. The scope for innovation seems considerable. The issue of course is (a) whether publishers see this as a viable business model or whether they instead wish to protect their current model of the research article and whether (b) authors wish to undertake the learning curve and additional effort to go in this direction. As I have noted before, the current model is deficient in various ways; I do not think it can continue without significant reinvention for much longer. And I have to ask that if reinvention does emerge, will science be the prime beneficiary?

December 29, 2018

Open Access journal publishing debates – the elephant in the room?

For perhaps ten years now, the future of scientific publishing has been hotly debated. The traditional models are often thought to be badly broken, although convergence to a consensus of what a better model should be is not apparently close. But to my mind, much of this debate seems to miss one important point, how to publish data.

Thus, at one extreme is COAlition S, a model which promotes the key principle that “after 1 January 2020 scientific publications on the results from research funded by public grants provided by national and European research councils and funding bodies, must be published in compliant Open Access Journals or on compliant Open Access Platforms.” This includes ten principles, one of which “The ‘hybrid’ model of publishing is not compliant with the above principles” has revealed some strong dissent, as seen at forbetterscience.com/2018/09/11/response-to-plan-s-from-academic-researchers-unethical-too-risky I should explain that hybrid journals are those where the business model includes both institutional closed-access to the journal via a subscription charge paid by the library, coupled with the option for individual authors to purchase an Open Access release of an article so that it sits outside the subscription. The dissenters argue that non-OA and hybrid journals include many traditional ones, which especially in chemistry are regarded as those with the best impact factors and very much as the journals to publish in to maximise both the readership, hence the impact of the research and thus researcher’s career prospects. Thus many (not all) of the American Chemical Society (ACS) and Royal Society of Chemistry (RSC) journals currently fall into this category, as well as commercial publishers of journals such as Nature, Nature Chemistry,Science, Angew. Chemie, etc.

So the debate is whether funded top ranking research in chemistry should in future always appear in non-hybrid OA journals (where the cost of publication is borne by article processing charges, or APCs) or in traditional subscription journals where the costs are borne by those institutions that can afford the subscription charges, but of course also limit the access. A measure of how important and topical the debate is that there is even now a movie devoted to the topic which makes the point of how profitable commercial scientific publishing now is and hence how much resource is being diverted into these profit margins at the expense of funding basic science.

None of these debates however really takes a close look at the nature of the modern research paper. In chemistry at least, the evolution of such articles in the last 20 years (~ corresponding to the online era) has meant that whilst the size of the average article has remained static at around 10 “pages” (in quotes because of course the “page” is one of those legacy concepts related to print), another much newer component known as “Supporting information” or SI^♥ has ballooned to absurd sizes. It can reach 1000 pages[cite]10.1021/jacs.6b13229[/cite] and there are rumours of even larger SIs. The content of SI is of course mostly data. The size is often because the data is present in visual form (think spectra). As visual information, it is not easily “inter-operable” or “accessible”. Nor is it “findable” until commercial abstracting agencies chose to index it. Searches of such indexed data are most certainly “closed” (again depending on institutional purchases of access) and not “open access”. You may recognise these attributes as those of FAIR (Findable, accessible, inter-operable and re-usable). So even if an article in chemistry is published in pure OA form, in order to get FAIR access to the data associated with the article, you will probably have to go to a non-OA resource run by a commercial organisation for profit. Thus a 10 page article might itself be OA, but the full potential of its 1000+ page data (an elephant if ever there was one) ends up being very much not OA.

You might argue that the 1000+ pages of data does not require the services of an abstracting agency to be useful. Surely a human can get all the information they want from inspecting a visual spectrum? Here I raise the future prospects of AI (artificial intelligence). The ~1000 page SI I noted above[cite]10.1021/jacs.6b13229[/cite] includes e.g NMR spectra for around 70 compounds (I tried to count them all visually, but could not be certain I found them all). A machine, trained to identify spectra from associated metadata (a feature of FAIR), could extract vastly more information than a human could from FAIR raw data^‡ (a spectrum is already processed data, with implied information/data loss) in a given time. And for many articles, not just one. Thus FAIR data is very much targeted not only at humans but at the AI-trained machines of the future.

So I again repeat my assertion that focussing on whether an article is OA or not and whether publishing in hybrid journals is to be allowed or not by funders is missing that 100-fold bigger elephant in the room. For me, a publishing model that is fit for the future should include as a top priority a declaration of whether the data associated with it is FAIR. Thus in the Plan-S ten principles, FAIR is not mentioned at all. Only when FAIR-enabled data becomes part of the debates can we truly say that the article and its data are on its way to being properly open access.

^‡The FAIR concept did not originally differentiate between processed data (i.e. spectra) and the underlying primary or raw data on which the processed data is based. Our own implementation of FAIR data includes both types of data; raw for machine reprocessing if required, and processed data for human interpretation. Along with a rich set of metadata, itself often created using carefully designed workflows conducted by machines.

^♥The proportion of articles relating to chemistry which do not include some form of SI is probably low. These would include articles which simply provide a new model or interpretation of previously published data, reporting no new data of their own. A famous historical example is Michael Dewar’s re-interpretation of the structure of stipitatic acid[cite]10.1038/155050b0[/cite] which founded the new area of non-benzenoid aromaticity.

November 4, 2018
A Theoretical Method for Distinguishing X‐H Bond Activation Mechanisms.
Consider the four reactions. The first two are taught in introductory organic chemistry as (a) a proton transfer, often abbreviated PT, from X to B (a base) and (b) a hydride transfer from X to A (an acid). The third example is taught as a hydrogen atom transfer or HAT from X to (in this example) O. Recently an article has appeared[cite]10.1002/anie.201805511[/cite] citing an example of a fourth fundamental type (d), which is given the acronym cPCET which I will expand later. Here I explore this last type a bit further, in the context that X-H bond activations are currently a very active area of research.

To help understand these four types, I have colour-coded the electron pair constituting the X-H covalent bond in red.
1. In mechanism (a), this electron pair stays with X, thus liberating a proton which is captured by the base.
2. The hydride transfer (b) is so-called because in fact this electron pair travels together with the proton, hence the term hydride or H^–.
3. Hydrogen atom transfers as in (c) in effect transfer both a proton and one electron to another atom (oxygen in the example above), leaving behind one electron on X. The electron and the proton are said to travel together as a “true” hydrogen atom.
4. The fourth mechanism (d) is fundamentally different from (c) in that whilst the electron and the proton travel in concert (at the same time), they do not travel together. In this example the proton travels to the oxygen, whilst the electron travels to the iron centre, in the process reducing its oxidation state. This mode is now called a concerted proton-coupled electron transfer, or cPCET as above.
The tool employed to distinguish between mechanisms (c) and (d) is the IBO or intrinsic bond orbital localisation scheme.[cite]10.1021/ct400687b[/cite] One practical advantage of such a scheme over better known localisation methods such as NBO (Natural bond orbitals) is that IBOs can be made to transform smoothly during the course of a reaction, as followed by say an IRC (Intrinsic reaction coordinate). NBOs may instead show discontinuous behaviour along a reaction IRC. Klein and Knizia have located transition states for examples of both (c) and (d) above and studied the IBOs along such IRCs. The two IBO reaction transformations are very different, as illustrated below (used, with permission, from the article itself). For the HAT type (X=C above), an α-spin IBO morphs from a C-H bond into a H-O bond, whilst the β-spin counterpart morphs from being located on the C-H bond into a carbon-centered radical. For the cPCET mode, the α-spin IBO morphs from C-H to a C-centered radical, but the β-spin region grows onto an iron d-orbital. It is in fact even more complex than the diagram above implies, since some reorganisation of the O-Fe region occurs and the H…:O region is still anti-bonding at the transition state.

We can see from this that mechanistic reaction analysis is starting to track the “curly arrows” we conventionally use to represent reactions in some detail, as well as informing us about the relative detailing timing of the various curly arrows used. Of course this latter aspect cannot be easily represented by conventional curly arrows. It seems timely to revisit the vast corpus of organic and organometallic “curly arrow pushing” to starting adding such information!
July 25, 2018
FAIR Data in Amsterdam – FAIR data points.
FAIR is one of those acronyms that spreads rapidly, acquires a life of its own and can mean many things to different groups. A two-day event has just been held in Amsterdam to bring some of those groups from the chemical sciences together to better understand FAIR. Here I note a few items that caught my attention.
1. Fairsharing.org was the basis for several presentations. It serves as “a curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies.” It promotes establishing metrics which strive to quantify how FAIR any given resource is.[cite]10.1038/sdata.2018.118[/cite] Any site which achieves a good FAIR metric can be described as a FAIR data point (a term new to me), and which can serve as an exemplar of what FAIR data aspires to.
2. Intrigued, I offered this page and hope to establish its FAIR metric in the near future, if only to understand how to improve its “score” so that future pages can be improved. It is based on the following Figure[cite]10.1039/c7sc03595k[/cite] which appeared in a recent article and appears to be a publishing “first” in as much as the figure contains hyperlinks directly to the data sources upon which it is based. The putative FAIR data point takes this a little further by wrapping the figure with visualisation tools which take the FAIR data and convert it to interactive models with the help of an added toolbox.
3. Another topic for discussion was spectroscopy and a veritable file format for its distribution, JCAMP-DX. One emerging theme is the idea of promoting two types of spectral distribution. The first is the use of a common standard format (JCAMP-DX) which strives to eliminate much of the proprietary character associated with data emerging from instruments. At the other extreme is to to offer to readers the raw instrumental data,[cite]10.1039/c7np00064b[/cite] which has the advantage of having none of the inevitable loss of information when transforming to standard formats. The downside is that it almost always can only be processed using proprietary software provided by the instrument vendor. One way of avoiding this is Mpublish (the topic of an earlier blog) and we heard interesting updates on progress from MestreLabs, the originators of this procedure. It is still my hope that more vendors (both of instruments and of software) will adopt such a model.
4. A further topic was metadata, which is at the heart of each of the terms in FAIR (F = findable, A = accessible, I = interoperable and R = re-usable), which are all defined in part at least by the metadata associated with any item. The state of metadata associated with research data is often dire, and often too little resource has been assigned to its improvement. I presented an example of how richer metadata might be injected. The below is a snippet of the metadata associated with one entry in a data repository (download the metadata here and open the file with a text editor). An advantage of doing this is that rich searches against these terms become enabled.
5. Finally, I note events such as e.g. Harnessing FAIR data are starting to spring up. This one is at Queen Mary University of London on 3rd September 2018, for which “PhDs and Post Docs from a range of disciplines” are welcomed, they of course being the pre-eminent generators of data and often the ones in charge of making it “FAIR”.
July 18, 2018
Ten years on: Jmol and WordPress.

Ten years are a long time when it comes to (recent) technologies. The first post on this blog was on the topic of how to present chemistry with three intact dimensions. I had in mind molecular models, molecular isosurfaces and molecular vibrations (arguably a further dimension). Here I reflect on how ten years of progress in technology has required changes and the challenge of how any necessary changes might be kept “under the hood” of this blog.

That first post described how the Java-based applet Jmol could be used to present 3D models and animations. Gradually over this decade, use of the Java technology has become more challenging, largely in an effort to make Web-page security higher. Java was implemented into web browsers via something called Netscape Plugin Application Programming Interface or NPAPI, dating from around 1995. NPAPI has now been withdrawn from pretty much all modern browsers.^‡ Modern replacements are based on JavaScript, and the standard tool for presenting molecular models, Jmol has been totally refactored into JSmol.^† Now the challenge becomes how to replace Jmol by JSmol, whilst retaining the original Jmol Java-based syntax (as described in the original post). Modern JSmol uses its own improved syntax, but fortunately one can use a syntax converter script Jmol2.js which interprets the old syntax for you. Well, almost all syntax, but not in fact the variation I had used throughout this blog, which took the form:

[caption]<img onclick=”jmolApplet([450,450],’load a-data-file;spin 3;’);” src=”static-image-file” width=”450″ /> Click for 3D structure[/caption]

This design was originally intended to allow browsers which did not have the Java plugin installed to default to a static image, but that clicking on the image would allow browsers that did support Java to replace (in a new window) the static image with a 3D model generated from the contents of a-data-file. The Jmol2.js converter script had not been coded to detect such invocations. Fortunately Angel came to my rescue and wrote a 39 line Javascript file that does just that (my Javascript coding skills do not extend that far!). Thanks Angel!!

In fact I did have to make one unavoidable change, to;

[caption]<img onclick=”jmolApplet([450,450],’load a-data-file;spin 3;’,’c1′);” src=”image-file” width=”450″ /> Click for 3D structure[/caption]

to correct an error present in the original. It manifests when one has more than one such model present in the same document, and this necessitates that each instance has a unique name/identifier (e.g. c1). So now, in the WordPress header for the theme used here (in fact the default theme), the following script requests are added to the top of each page, the third of which is the new script.

<script type=”text/javascript” src=”JSmol.min.js”></script>
<script type=”text/javascript” src=”js/Jmol2.js”></script>
<script type=”text/javascript” src=”JmolAppletNew.js”></script>

The result is e.g.

Click for 3D structure of GAVFIS

Click for 3D interaction

This solution unfortunately is also likely to be unstable over the longer term. As standards (and security) evolve, so invocations such as onclick= have become considered “bad practice” (and may even become unsupported). Even more complex procedures will have to be devised to keep up with the changes in web browser behaviour and so I may have to again rescue the 3D models in this blog at some stage!^¶ Once upon a time, the expected usable lifetime of e.g. a Scientific Journal (print!) was a very long period (>300 years). Since ~1998 when most journals went online, that lifetime has considerably shortened (or at least requires periodic, very expensive, maintenance). For more ambitious types of content such as the 3D models discussed here, it might be judged to be <10 years, perhaps much less before the maintenance becomes again necessary. Sigh!

^‡ At the time of writing, WaterFox is one of the few browsers to still support it. ^†An early issue with using Javascript instead of Java was performance. For some tasks, the former was often 10-50 times slower. Improvements in both hardware and software have now largely eliminated this issue. ^¶Thus using Jquery.

May 16, 2018
Examples please of FAIR (data); good and bad.

The site fairsharing.org is a repository of information about FAIR (Findable, Accessible, Interoperable and Reusable) objects such as research data.

A project to inject chemical components, rather sparse at the moment at the above site, is being promoted by workshops under the auspices of e.g. IUPAC and CODATA and the GO-FAIR initiative. One aspect of this activity is to help identify examples of both good (FAIR) and indeed less good (unFAIR) research data as associated with contemporary scientific journal publications.

Here is one example I came across in 2017.[cite]10.1021/jacs.6b13229[/cite]. The data associated with this article is certainly copious, 907 pages of it, not including data for 21 crystal structures! The latter is a good example of FAIR, being offered in a standard format (CIF) well-adapted for the type of data contained therein and for which there are numerous programs capable of visualising and inter-operating (i.e. re-using) it. The former is in PDF, not a format originally developed for data and one could argue is closer to the unFAIR end of the spectrum. More so when you consider this one 907-page paginated document contains diverse information including spectra on around 60 molecules. Thus the spectra are all purely visual; they are obviously data but in a form largely designed for human consumption and not re-use by software. The text-based content of this PDF does have numerous pattens, which lends itself to pattern recognition software such as OSCAR, but patterns are easily broken by errors or inexperience and so we cannot be certain what proportion of this can be recovered. The metadata associated with such a collection, if there is any at all, must be general and cannot be easily related to specific molecules in the collection. So I would argue that 907 pages of data as wrapped in PDF is not a good example of FAIR. But it is how almost all of the data currently being reported in chemistry journals is expressed. Indeed many a journal data editor (a relatively new introduction to the editorial teams) exerts a rigorous oversight over the data presented as part of article submissions to ensure it adheres to this monolithic PDF format.

You can also visit this article in Chemistry World (rsc.li/2HG7lTk) for an alternative view of what could be regarded as rather more FAIR data. The article has citations to the FAIR components, which is not published as part of the article or indeed by the journal itself but is held separately in a research data repository. You will find that at doi: 10.14469/hpc/3657 where examples of computational, crystallographic and spectroscopic data are available.

The workshop I allude to above will be held in July. Can I ask anyone reading this blog who has a favourite FAIR or indeed unFAIR example of data they have come across to share these here. We also need to identify areas simply crying out for FAIRer data to be made available as part of the publishing process beyond the types noted above. I hope to report back on both such feedback and the events at this workshop in due course.

May 6, 2018

How FAIR are the data associated with the 2017 Molecules-of-the-Year?

C&EN has again run a vote for the 2017 Molecules of the year. Here I take a look not just at these molecules, but at how FAIR (Findable, Accessible, Interoperable and Reusable) the data associated with these molecules actually is.

I went about finding out as follows:

The article DOI for all seven candidates was linked to the C&EN site.
From there I manually tracked down the Supporting information
Some of this SI gave a CCDC deposition number for crystal structure data for the molecule in question. The easiest way of going directly to the data was to use the search.datacite.org search engine and to enter the keywords CCDC + deposition number. This gives a DOI for the data, examples of which are included in the table below.
In other examples, I used the CSD Conquest search program and entered the names of 2-3 of the authors of the articles. This also worked well.
Most of the SI files, downloaded as PDF files also had static images of NMR spectra included. This is not active data, and hence does not fulfil the F and I of FAIR, and probably the A as well. None of it is FAIR as defined by my post here although it is actually really easy to make it so. One of the examples had ~116 spectra so unFAIRed.
In another example there was also computational data, included simply as a set of XYZ coordinates and again contained in the PDF file. This too is not really FAIR, since one has to know how to extract it from this container and repurpose it. It also represents a tiny subset of the data potentially available.

How FAIR are the data associated with the 2017 Molecules-of-the-Year?
#	Title	Article DOI	Data DOI
1	Persulfurated Coronene: A New Generation of “Sunflower”	10.1021/jacs.6b12630	Data available only as PDF Hosted by Figshare The SI also has its own DOI: 10.1021/jacs.6b12630.s001
2	A Truncated Molecular Star	10.1021/jacs.6b12630	Crystal structure data: 10.5517/ccdc.csd.cc1nb303
3	Synthesis of trinorbornane	10.1039/c7cc06273g	Crystal structure data: 10.5517/ccdc.csd.cc1p7806
4	Braiding a molecular knot with eight crossings	10.1126/science.aal1619	Crystal structure data: 10.5517/ccdc.csd.cc1m85y0
5	Unique physicochemical and catalytic properties dictated by the B₃NO₂ ring system	10.1038/nchem.2708	Crystal structure data: 10.5517/ccdc.csd.cc1lkff0
6	Total synthesis of mycobacterial arabinogalactan containing 92 monosaccharide units	10.1038/ncomms148510	116 NMR spectra available only as PDF. No crystal structure
7	Nitrogen Lewis Acids	10.1021/jacs.6b12360	NMR spectra available only as PDF. Computed coordinates available only as PDF Crystal structures data: CCDC 1457983-1457987,1458000-1458001 e.g. 10.5517/ccdc.csd.cc1ky4qc 10.5517/ccdc.csd.cc1ky4rd

The FAIRness of the data for these molecules of the year is largely rescued by the crystal structure data deposited with the CCDC in their CSD database and rendered F of FAIR by the persistent identifiers such as the (parochial) deposition numbers or the more general DOI. Now if the NMR and computational data were also covered in this way, we would be making great progress. There are of course many other types of data included with these examples, and procedures for making such data also FAIR have to be worked out by the community.

In order to construct the table above, I had to put about two hours of effort into tracking down the items (and this only because I have done this sort of search before). Perhaps next year I might persuade C&EN to include such a table in their own article!

March 7, 2018