Tag: Identifiers

  • A search of some major chemistry publishers for FAIR data records.

    In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

    Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

    One can query thus:

    1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
      which retrieves the very healthy looking 6,179,287 works.
    2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
      ?query=relatedIdentifiers.relatedIdentifier:10.1021*
      which returns a respectable 210,240 works.
    3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
      ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
      and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

    I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

    Publisher Search 2 Search 3
    ACS 210,240 14,213
    RSC 138,147 1,279
    Elsevier 185,351 56,373
    Nature 12,316 8,104
    Wiley 135,874 9,283
    Science 3,384 2,343

    These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

    How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

    1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
      returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
    2. And just to show the searches are behaving as expected:
      ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
      returns 196,027 works.

    It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

    Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

  • Harnessing FAIR data: A suggested useful persistent identifier (PID) for quantum chemical calculations.

    Harnessing FAIR data is an event being held in London on September 3rd; no doubt all the speakers will espouse its virtues and speculate about how to realize its potential.♥ Admirable aspirations indeed. Capturing hearts and minds also needs lots of real life applications! Whilst assembling a forthcoming post on this blog, I realized I might have one nice application which also pushes the envelope a bit further, in a manner that I describe below.

    The post I refer to above is about using quantum chemical calculations to chart possible mechanistic pathways for the reaction between a carboxylic acid and an amine to form an amide. The FAIR data for the entire project is collected at DOI: 10.14469/hpc/4598. Part of what makes it FAIR is the metadata not only collected about this data but also formally registered with the DataCite agency. Registration in turn enables Finding; it is this aspect I want to demonstrate here.

    The metadata for the above DOI includes information such as;

    1. The ORCID persistent identifier (PID) for the creator of the data (in this instance myself)
    2. Date stamps for the original creation date and subsequent modifications.
    3. A rights declaration, in this case the CC0 license which describes how the data can be re-used.
    4. Related identifiers, in this case describing members of this collection.

    The data itself is held in the members of the collection, each of which is described by a more specific set of metadata in addition to the more general types in the above list (e.g. 10.14469/hpc/4606).

    1. One important additional metadata descriptor is the ORE locator (Object Re-use and Exchange, itself almost a synonym for FAIR). This allows a machine to deduce a direct path to the data file itself, and hence to retrieve it automatically if desired. It is important to note that the DOI itself (i.e. 10.14469/hpc/4606) points only to the “landing page” for the dataset, and does not necessarily describe the direct path to any specific file in the dataset. The ORE path can be used with e.g. software such as JSmol to directly load a molecule based only on its DOI. You can see an example of this here.
    2. Each molecule-based dataset contains additional specific metadata relating to the molecule itself. For example this is how the InChiKey, an identifier specific to that molecule, is expressed in metadata;
      <subject subjectScheme="inchikey" schemeURI="http://www.inchi-trust.org/">PVXKWVPAMVWJSQ-UHFFFAOYSA-N</subject>
      The advantage of expressing the metadata in this way is that a general search of the type:
      https://commons.datacite.org/doi.org?query=subjexts.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N
      can be used to track down any molecule with metadata corresponding to the above InChIkey.
    3. Here is more metadata, introduced in this blog. It relates to the (computed) value of the Gibbs energy (the energy unit is in Hartree†), as returned by the Gaussian program;
      <subject subjectScheme="Gibbs_Energy" schemeURI="https://goldbook.iupac.org/html/G/G02629.html" valueURI="http://gaussian.com/thermo/">-649.732417</subject>
      I here argue that it represents a unique identifier for a molecule calculation using the quantum mechanical procedures implemented in e.g. Gaussian. This identifier is different from the InChIkey, in that it can be truncated to provide different levels of information.

      • At the coarsest level, a search of the type
        https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*
        should reveal all molecules with the same number of atoms and electrons whose Gibbs energy has been calculated, but not necessarily with the same InChI (i.e. they may be isomers, or transition states, etc). This level might be useful for revealing most (not necessarily all‡) molecules involved in say a reaction mechanism. It should also be insensitive to the program system used, since most quantum codes will return a value for the Gibbs energy if the same procedures have been used (i.e. DFT method, basis set, solvation model and dispersion correction) accurate to probably 0.01 Hartree.
      • The top level of precision however is high enough to almost certainly relate to a specific molecule and probably using a specific program;
        https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.732417
      • The searcher can experiment with different levels of precision to narrow or broaden the search.
      • I would also address the issue (before someone asks) of why I have used the Gibbs energy rather than the Total energy. Put simply, the Gibbs energy is far more useful in a chemical context. It can be used to relate the relative Gibbs energies of different isomers of the same molecule to e.g. the equilibrium constant that might be measured. Or the difference in Gibbs energies between a reactant and a transition state can be used to derive the free energy activation barrier for a reaction. The total energy is not so useful in such contexts, although of course it too could be added as a subject in the metadata above if a real use for it is found.
    4. The searcher can also use Boolean combinations of metadata, such as specifying both the InChIKey and the Gibbs Energy, along with say the ORCID of the person who may have published the data;
      https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*+AND+ subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390♥

    I have tried to show above how FAIR data implies some form of rich (registered) metadata. And how the metadata can be used to Find (the F in FAIR) data with very specific properties, thus Harnessing FAIR data.


    †It is a current limitation of the V4.1 DataCite schema that there appears no way to specify the data type of the subject, including any units.

    ‡In theory, a range query of the type:
    https://commons.datacite.org/doi.org?query=subjects.subjectScheme:Gibbs_energy+AND+subjects.subject:[\-649.1 TO \-649.8] should be more specific, but I have not yet gotten it to work, probably because of the lack of data-typing means it is not recognised as a range of numeric values.

    ♥Implicit in this search is the grouping
    https://commons.datacite.org/doi.org?query=(subjects.subjectScheme:Gibbs_Energy+AND+subjects.subject:\-649.*) + (subjects.subjectScheme:inchikey+AND+subjects.subject:CZABGBRSHXZJCF-UHFFFAOYSA-N)+AND+contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390
    Currently however DataCite do not correctly honour this form of grouping.

    ♥Video of the speakers and the panel session at the end is now available.

  • PIDapalooza 2018. A conference like no other!

    Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?

    The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.

    1. Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.

      You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.

    2. Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
    3. Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
    4. Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
    5. Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.

    Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!

     

  • Data-free research data management? Not an oxymoron.

    I occasionally post about "RDM" (research data management), an activity that has recently become a formalised essential part of the research processes. I say recently formalised, since researchers have of course kept research notebooks recording their activities and their data since the dawn of science, but not always in an open and transparent manner. The desirability of doing so was revealed by the 2009 "Climategate" events. In the UK, Climategate was apparently the catalyst which persuaded the funding councils (such as the EPSRC, the Royal Society, etc) to formulate policies which required all their funded researchers to adopt the principles of RDM by May 2015 and in their future researches. An early career researcher here, anxious to conform to the funding body instructions, sent me an email a few days ago asking about one aspect of RDM which got me thinking.

    The question related to the divide between data as a separate research object (and which therefore has to be managed), and data as an inseparable part of the article narrative, which is of course ostensibly managed by the journal publication processes. Such data may often be the description of a process rather than simply tables of numbers or graphs. In chemistry it may include chemical names and chemical terms as part of an experimental procedure. For one nice illustration of such embedded data, go look at the chemical tagger page. Here the data is blending with the semantics, and the two are not easily separated. So, when such separation is not easily achieved, should the specific processes required by RDM as illustrated in the five bullet points below actually be followed?

    1. Specify a data management plan to be followed, as for example points 2-5 below.
    2. Decide upon a location for your data, separated into one for "live" or working data (the purpose simply being to ensure it is properly backed up) and the other for a sub-set of formally "published data" which has to be available for at least ten years after its publication.
    3. Use 2 to gather metadata (see 6-14 below) and in return get a DOI representing the location of the combined metadata + data, from a suitable registration authority such as DataCite.
    4. Quote this DOI(s) in the article describing the results of analysing the data and presenting hypotheses, and conversely once the article itself is allocated its own DOI from a registration authority such as CrossRef, update the metadata in item 3 so as to achieve a bidirectional link between the data and its narrative (and we assume that DataCite and CrossRef will also increasingly exchange the metadata they each hold about the items).
    5. Add both the data and the article DOIs to any institutional CRIS or current research information system (parenthetically, I regard this last stipulation as rather redundant if items 3 and 4 are working effectively, but its a good interim measure whilst the overall system matures).

    So, should step 2 be included if the data itself is inextricably intertwined with the narrative and cannot be separated? The slightly surprising advice I would suggest is yes! And the answer is that it IS possible to generate metadata (data about the, possibly entwined, data) which CAN be processed in such a step. What forms would such metadata take?

    1. Identification of the researcher(s) involved. This would nowadays take the form of an ORCID (Open Research and Collaborator Identifier).
    2. Identification of the hosting institution where the data has been produced. There is currently no equivalent to an ORCID for institutions, but it is very likely to come in the future.
    3. A date stamp formalising when the (meta)data is actually deposited.
    4. A title for the project being described. Here we see a blurring between the narrative/article and the data; a title is the shortest possible description of the narrative/article, and it may also apply to the data object(s) or it could have its own title.
    5. A slightly fuller abstract of the project being described. Here we see further blurring between the narrative and data objects.
    6. One can include "related identifiers", in particular the DOIs of any other relevant articles that might have been published which may expand the context of the data, and also the DOIs of any other relevant datasets which may have been allocated in step 2 above.
    7. It is also beneficial to include "chemical identifiers". These can take the form of InChI strings and InChI keys, which allow discretely defined molecular objects which were the object of the research to be tracked and which relate to both the narrative and any other data objects.
    8. If specific software has been used to analyse data, it too can be included as a "related identifier" (e.g. [cite]10.5281/zenodo.19272[/cite]
    9. Potentially at least, if a well-defined instrument has been involved, it too could be included with its own "related identifier". With both 13-14, other issues may need addressing, such as versioning etc, but this no doubt will be sorted in due course.
    10. etc.

    So items such as 6-14 can be collected and sent to e.g. DataCite with a DOI received in return as part of item 2 of the RDM processes. No "pure" data need be involved, only metadata. Nonetheless such metadata can only increase the visibility and discoverability of the research, as illustrated in how such metadata can be searched for the components described above.

  • Collaborative FAIR data sharing.

    I want to describe a recent attempt by a group of collaborators to share the research data associated with their just published article.[cite]10.1021/jacs.5b13070[/cite]

    I am here introducing things in a hierarchical form (i.e. not necessarily the serial order in which actions were taken).

    1. The data repository selected for the data sharing is described by (m3data) doi: 10.17616/R3K64N[cite]10.17616/R3K64N[/cite]
    2. A collaborative project collection was established on this repository (doi: 10.14469/hpc/244[cite]10.14469/hpc/244[/cite]). This data collection has some of the following attributes:
    3. Its metadata is sent here: https://search.datacite.org/ui?&q=10.14469/hpc/244 where it can be queried for other details.
    4. The project collaborators are all identified by their ORCID, used to obtain further individual information about the researchers. This information is also propagated to the metadata sent to DataCite.
    5. In the section labelled associated DOIs there is a link to the recently published peer-reviewed article, which itself cites the data via doi: 10.14469/hpc/244 and which thus establishes a bidirectional link between the article and its data.
    6. Also in the associated DOIs section are other DOIs (to two figures and two tables) held in a separate location. One example: doi: 10.14469/hpc/332[cite]10.14469/hpc/332[/cite]) which illustrates the original type of data sharing we started about 10 years ago. This form has been variously called a "WEO" or Web-enhanced object (by the ACS) or interactivity boxes (RSC, etc). In such WEOs, we wrap the data into an interactive visual appearance using Jmol or JSmol software. The data itself is directly available to the reader using the Jmol export functions (right mouse click in the visual window).

       

      • In this specific example the WEO has been assigned its DOI using the repository noted above.[cite]10.17616/R3K64N[/cite] 
      • We have in the past also used Figshare[cite]10.17616/R3PK5R[/cite]) for this purpose, see e.g. 10.6084/m9.figshare.1181739
      • The WEO itself can itself reference a more complete set of data used to create the visual appearance, for example data that allows the wavefunction of the molecule to be computed,  doi: 10.6084/m9.figshare.2581987.v1[cite]10.6084/m9.figshare.2581987.v1[/cite] In this instance this is held on the Figshare[cite]10.17616/R3PK5R[/cite] repository.
    7. The collection has another section labelled Members. These are individual datasets associated with the collection and held on the SAME repository as the collection itself. In this case, there are five such members, two of which are listed below:

       

      1. 10.14469/hpc/281[cite]10.14469/hpc/281[/cite] contains a variety of other data such as outputs from an IRC (intrinsic reaction coordinate), energy profile diagrams and ZIP archives of other calculations.
      2. 10.14469/hpc/272[cite]10.14469/hpc/272[/cite] itself contains five members, one of which is e.g.

         

        • 10.14469/hpc/267[cite]10.14469/hpc/267[/cite] which contains a ZIP archive with NMR data (see here for how this might be packaged in the future) and a file for a GPC (chromatography) instrument.
        • This last item also contains a new section labelled Metadata, which includes e.g. the InChI key and InChI string for the molecule whose properties are reported.

    If this mode of presenting data seems a little more complex than a single monolithic PDF file, its because its designed for:

    1. collaboration between scientists, potentially at different locations and institutions.
    2. attribution of provenance/credit for the individual items (via ORCID).
    3. separate date stamping by the various contributors.
    4. providing bi-directional links between data and publications.
    5. holding what we call FAIR (findable, accessible, interoperable and reusable) data, rather than just data encapsulated in a PDF file.
    6. Collecting, storing and sending metadata for aggregation in a formal way, i.e. to DataCite using a formal schema to render the metadata properly searchable.

    Thus 10.14469/hpc/244 represents our most complex attempt yet at such collaborative FAIR data sharing with multiple contributors. The tools for packaging many of the datasets are still quite limited (see again here) and the design is still being optimised (call it α). When the repository[cite]10.17616/R3K64N[/cite] has been more extensively tested, we intend to make it available as open source for others to experiment with. And of course, when this happens the source code too will have its own DOI!


    A refactoring of the Figshare site in December 2015 meant that the DOI no longer points directly to the WEO, and you have to follow a manually inserted link on that page to see it.

  • Global initiatives in research data management and discovery: searching metadata.

    The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

    Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS,  Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

    We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

    Search queries enabled by the use of metadata in data publication
    # Search query* Instances retrieved:
    1 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey\:*  InChI identifier
    2 http://search.datacite.org/ui?q=alternateIdentifier:InChI\:*  InChI key 
    3 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N 
    4 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey\:* ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
    5 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI\:InChI=1S\/C9H11N5O3* ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
    6 http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469 Has content media for Publisher 10.14469 (Imperial College)
    7 http://search.datacite.org/ui?q=format:chemical/x-* Data format type chemical/x-* 
    8 http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey\:*& fl=doi,title,alternateIdentifier& wt=json&rows=15
    http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey\:*
    First 15 hits in JSON format, batch query mode
    9 http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London" resolution statistics for publisher 10.14469 (Imperial College) per month
    10 http://service.re3data.org/search?query=&subjects[]=31 Chemistry Research data repository search for Chemistry (135 hits)

    In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[cite]10.1021/ci9803233[/cite] for chemical MIME (multipurpose internet media extensions).


    Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems.  Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

    If more of interest related to this topic emerges at the ACS session,  I will report back here.