Tag: Information science

  • A search of some major chemistry publishers for FAIR data records.

    In recent years, findable data has become ever more important (the F in FAIR). Here I test that F using the DataCite search service.

    Firstly an introduction to this service. This is a metadata database about datasets and other research objects. One of the properties is relatedIdentifier which records other identifiers associated with the dataset, being say the DOI of any published article associated with the data, but it could also be pointers to related datasets.

    One can query thus:

    1. https://search.datacite.org/works?query=relatedIdentifiers.relatedIdentifier:*
      which retrieves the very healthy looking 6,179,287 works.
    2. One can restrict this to a specific publisher by the DOI prefix assigned to that publisher:
      ?query=relatedIdentifiers.relatedIdentifier:10.1021*
      which returns a respectable 210,240 works.
    3. It turns out that the major contributor to FAIR currently are crystal structures from the CCDC. One can remove them from the search to see what is left over:
      ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+NOT+(identifier:*10.5517*) 
      and one is down to 14,213 works, of which many nevertheless still appear to be crystal structures. These may be links to other crystal datasets.

    I have performed searches 2 and 3 for some popular publishers of chemistry (the same set that were analysed here).

    Publisher Search 2 Search 3
    ACS 210,240 14,213
    RSC 138,147 1,279
    Elsevier 185,351 56,373
    Nature 12,316 8,104
    Wiley 135,874 9,283
    Science 3,384 2,343

    These publishers all have significant numbers of datasets which at least accord with the F of FAIR. A lot of data sets may not have metadata which in fact points back to a published article, since this can be something that has to be done only when the DOI of that article appears, in other words AFTER the publication of the dataset. So these numbers are probably low rather than high.

    How about the other way around? Rather than datasets that have a journal article as a related identifier, we could search for articles that have a dataset as a related identifier?

    1. ?query=(identifier:*10.1039*)+AND+(relatedIdentifiers.relatedIdentifier:*)
      returns rather mysterious nothing found. It might also be that there is no mapping of this search between the CrossRef and DataCite metadata schemas.
    2. And just to show the searches are behaving as expected:
      ?query=(relatedIdentifiers.relatedIdentifier:10.1021*)+AND+(identifier:*10.5517*)
      returns 196,027 works.

    It will also be of interest to show how these numbers change over time. Is there an exponential increase? We shall see.

    Finally, we have not really explored adherence to eg the AIR of FAIR.  That is for another post.

  • Examples please of FAIR (data); good and bad.

    The site fairsharing.org is a repository of information about FAIR (Findable, Accessible, Interoperable and Reusable) objects such as research data.

    A project to inject chemical components, rather sparse at the moment at the above site, is being promoted by workshops under the auspices of e.g. IUPAC and CODATA and the GO-FAIR initiative. One aspect of this activity is to help identify examples of both good (FAIR) and indeed less good (unFAIR) research data as associated with contemporary scientific journal publications.

    Here is one example I came across in 2017.[cite]10.1021/jacs.6b13229[/cite]. The data associated with this article is certainly copious, 907 pages of it, not including data for 21 crystal structures! The latter is a good example of FAIR, being offered in a standard format (CIF) well-adapted for the type of data contained therein and for which there are numerous programs capable of visualising and inter-operating (i.e. re-using) it. The former is in PDF, not a format originally developed for data and one could argue is closer to the unFAIR end of the spectrum. More so when you consider this one 907-page paginated document contains diverse information including spectra on around 60 molecules. Thus the spectra are all purely visual; they are obviously data but in a form largely designed for human consumption and not re-use by software. The text-based content of this PDF does have numerous pattens, which lends itself to pattern recognition software such as OSCAR, but patterns are easily broken by errors or inexperience and so we cannot be certain what proportion of this can be recovered. The metadata associated with such a collection, if there is any at all, must be general and cannot be easily related to specific molecules in the collection. So I would argue that 907 pages of data as wrapped in PDF is not a good example of FAIR. But it is how almost all of the data currently being reported in chemistry journals is expressed. Indeed many a journal data editor (a relatively new introduction to the editorial teams) exerts a rigorous oversight over the data presented as part of article submissions to ensure it adheres to this monolithic PDF format.

    You can also visit this article in Chemistry World (rsc.li/2HG7lTk) for an alternative view of what could be regarded as rather more FAIR data. The article has citations to the FAIR components, which is not published as part of the article or indeed by the journal itself but is held separately in a research data repository. You will find that at doi: 10.14469/hpc/3657 where examples of computational, crystallographic and spectroscopic data are available.

    The workshop I allude to above will be held in July. Can I ask anyone reading this blog who has a favourite FAIR or indeed unFAIR example of data they have come across to share these here. We also need to identify areas simply crying out for FAIRer data to be made available as part of the publishing process beyond the types noted above. I hope to report back on both such feedback and the events at this workshop in due course.

  • PIDapalooza 2018. A conference like no other!

    Another occasional conference report (day 1). So why is one about “persistent identifiers” important, and particularly to the chemistry domain?

    The PID most familiar to most chemists is the DOI (digital object identifier). In fact there are many; some 60 types have been collected by ORCID (themselves purveyors of researcher identifiers). They sometimes even have different names; in life sciences they tend to be known instead as accession numbers. One theme common to many (probably not all) is that they represent sources of metadata about the object being identified. Further information if which allows you (or a machine) to decide if acquiring the full object is worthwhile. So in no particular order, here are some of the things I learnt today.

    1. Mark Hahnel noted the recent launch of the Dimensions resource which links research data with other research activities; I have not yet had a chance to learn its capabilities, but it seems an interesting alternative to other stalwarts such as eg Google Scholar etc.

      You can try this example: https://app.dimensions.ai/discover/publication?search_text=10.6084&search_type=kws&full_search=true which retrieves articles in which the data repository with prefix 10.6084 (Figshare) is cited. Try also the prefix 10.14469 which is the Imperial College repository.

    2. Andy Mabbett talked about the deployment and use of persistent identifiers (the Q numbers) in Wikidata, which increasingly underpin the basis for the various flavours of Wikipedia. He also noted their use of some 50 different identifiers.
    3. Johanna McEntyre noted some 5M published articles in life sciences which reference 1M+ ORCID identifiers, easily the domain with the fastest uptake of this type. Also noted was the new FREYA project; aiming to connect open identifiers for discovery, access and use of research resources.
    4. Tom Gillespie talked about RRID, or Research Resource Identifiers. Included in this are hardware, including instruments and with around 6000 RRIDs systematized so far. They argue this area promotes both the A and I of FAIR (accessible and inter-operable). Of course A and I mean many things to many people.
    5. Several other presentations talked about the finer detail of metadata, such as sub-classifications into e.g. descriptive/admin/technical, but I did rather miss demos showing how search queries of such fine-grained metadata could be constructed.

    Apart from the presentations themselves, PIDapalooza is unusual for some other activities. Thus you could go get your PIDnails done, with a selection of 8 or so tasteful logos to choose from. There will be tattoos tomorrow (this is a conference for younger people after all). I may grab a photo or two to provide evidence!

     

  • PIDapalooza 2018: the open festival for persistent identifiers.

    PIDapalooza is a new forum concerned with discussing all things persistent, hence PID. You might wonder what possible interest a chemist might have in such an apparently arcane subject, but think of it in terms of how to find the proverbial needle in a haystack in a time when needles might look all very similar. Even needles need descriptions, they are not all alike and PIDs are a way of providing high quality information (metadata) about a digital object.  

    The topics for discussion along with descriptions are now available at https://pidapalooza18.sched.com/list/descriptions/ and yes, before you ask, the event has its own PID (DOI: 10.5438/11.0002). Check out the speakers at https://pidapalooza18.sched.com/directory/speakers. I will be telling some stories from chemistry, and who knows, even some of the posts on this blog might feature. And if you do not brush up on the topic, no doubt your librarian, your funding body and your publisher will be telling you about it soon!