Tag: Technology/Internet

Metametadata: data about data about (chemical) data.

Scientists are familiar with the term data, at least in a scientific or chemical context, but appreciating metadata (meaning "after", or "beyond") is slightly more subtle, in the sense of using it to mean data about data. The challenge lies in clarifying where the boundary between data and its metadata lies and in specifying and controlling the vocabulary used for these metadata descriptions. Items in a chemical metadata dictionary might include e.g. subject classifications such as Organic Molecular Chemistry or identifiers such as InChIkey. But what could metametadata be? Here I briefly show some examples by way of illustration.

Let me start by defining a data repository as a store of both data and the metadata describing it. The metadata is to be exposed in a standard manner which allows it to be aggregated by other agencies. Nowdays, it is becoming common to identify such a data object together with its metadata using a persistent identifier, or DOI. But to decide if any particular repository and the data objects contained therein is generally useful to you, you need information about the metadata itself. Technically, this is defined using a schema[cite]10.2312/re3.008[/cite] describing the metadata (which might e.g. identify any dictionaries used); hence metametadata. Now you need to store the metametadata and so I introduce the concept of a registry which does this. This metametadata object is itself assigned a DOI^‡ and here I list these DOIs for a personal selection of some chemically oriented examples, in this case deriving from the largest registry of research data repositories re3data.org. You can search for your own entry at their site: http://service.re3data.org/search.

Data repository	The repository metametadata DOI^♣	Badge
Figshare	10.17616/R3PK5R[cite]10.17616/R3PK5R[/cite]
Zenodo	10.17616/R3QP53[cite]10.17616/R3QP53[/cite]
Cambridge structure database	10.17616/R36011[cite]10.17616/R36011[/cite]
Crystallographic open database	10.17616/R37S31[cite]10.17616/R37S31[/cite]
Oxford University Research Archive	10.17616/R3Q056[cite]10.17616/R3Q056[/cite]
Open Notebook Science	10.17616/R3859D[cite]10.17616/R3859D[/cite]
Usefulchem	10.17616/R3Z89N[cite]10.17616/R3Z89N[/cite]
Chemotion	10.17616/R34P5T[cite]10.17616/R34P5T[/cite]
Chemspider	10.17616/R38P4P[cite]10.17616/R38P4P[/cite]
Chemical Database Service	10.17616/R36P42[cite]10.17616/R36P42[/cite]
Imperial College HPC data repository.	10.17616/R3K64N[cite]10.17616/R3K64N[/cite],[cite]10.14469/hpc/382[/cite]
Imperial College SPECTRa repository.[cite]10.1021/ci7004737[/cite]	10.17616/R30316[cite]10.17616/R30316[/cite]

Not all of the repositories listed in the table above assign formal DOIs to their data collections, meaning that the metadata for their entries cannot be aggregated in a searchable manner using e.g. search.datacite.org/ui (or search.datacite.org/api for the machine version). Currently, the metametadata does not fully carry this information, an aspect which I gather will be rectified in a future revision of the re3data schema.[cite]10.2312/re3.008[/cite]

Importantly, both metadata and (repository) metametadata can be searched using APIs (application programmer interface), ensuring that the entire flow of meta information can be subject to automated software analysis rather than just visual inspections by a human.This should allow a rich and open infrastructure for handling research objects or data to be built up using hierarchical metadata. The examples above indeed show that the chemical space is already the largest component of the Natural Sciences space.

Although the edifice is still largely in its infancy, already I think we can start to see an alternative open approach emerging to "Googling" for data, or the even older traditional bespoke (i.e. non-open) services offered by commercial human-based abstractors of chemical metadata.

^‡This DOI is information about the metametadata, and hence it is metametametadata, or m3data. Sorry! ^♣The citations at the foot of this post are generated entirely automatically (by a WordPress plugin called Kcite) from the m3data associated with each entry, i.e. the DOI listed. Were the persistent identifier for the entry ever to be changed, this would propagate automatically to the citation, unlike the static entries in the table.

April 16, 2016

Publishing embargoes.
Publishing embargoes seem a relatively new phenomenon, probably starting in areas of science when the data produced for a scientific article was considered more valuable than the narrative of that article. However, the concept of the embargo seems to be spreading to cover other aspects of publishing, and I came across one recently which appears to take such embargoes into new and uncharted territory.

One example (there are many others) of embargoes continuing to operate in the era of open science and open data relates to crystallographically derived coordinates for macromolecules. Biomolecular structures are allowed to be embargoed for a maximum of one year before becoming openly available or "released" (considered a friendlier term than embargo). A more recent phenomenon is of embargoes on press releases which may be prepared by authors and or publishers to accompany the appearance of any article considered especially newsworthy. The publisher will then request that the press release is only released to coincide with the actual publication time and date of the article itself. Both of these types of embargo are more or less accepted by both parties. But in the last five years or so, new types of embargo have been introduced and it is these I want to discuss here.
1. The self-archive or "green open access" version of an article, in the form of the last author version of an accepted manuscript prior to copy-editing and other operations by a publisher. Such Green OA versions are now a mandatory requirement from funders (in the UK), arising from the need to conduct a "REF" or research excellence framework assessment of all (UK) universities every seven years or so. In order to allow assessors and funding councils unencumbered access to these research outputs, the authors must self-archive their publications in a suitable institutional repository. In general therefore, there should always exist two versions of any scientific paper authored within these guidelines, the AV (author version) and VoR (Version of Record, held by the publisher, and carrying the guarantee of peer review). Publishers now embargo author versions until the VoR version has been published, and sometimes even up to 18 months beyond this period.
2. The "supporting information" or SI embargo. This is closely related to the crystallographic data embargo noted above, but it applies in general to most other data and information associated with an article. Until very recently, most SI was in fact handled by the publisher themselves, and so it was released at the same time as the article. Since it is becoming more common to deposit data and SI in a separate repository, some publishers mandate that the release dates of this material must not precede the article itself. Deposition of such data has also become a mandatory requirement from (UK) funders since May 2015, and I have blogged about such "research data management" often here. In effect, both the scientific article and the data supporting it achieve their own DOIs or persistent digital identifiers, allowing easy and independent access to either the article OR its data. In fact, assigning such a DOI has a more subtle effect; creating a DOI means that metadata describing the object is also created and then aggregated by the agency issuing the DOI such as CrossRef and DataCite. Importantly, one should note that SI which is handled purely by the publisher will not have its own separate DOI and it will not have its own metadata. The data metadata for example can include the DOI for the article, and vice versa. I have shown examples of the utility of such metadata for data in an earlier post.
3. So now we come to the most recent embargo, which has surfaced since around May 2015, as increasingly data has become a first class object in its own right with its own DOI and importantly its own metadata. There is now evidence that some publishers are requesting that this very metadata about data is also subjected to an embargo, not to be released before the article which makes use of that data is itself released. So data can be deposited in "dark form" prior to a publication, but the metadata (which carries the date stamp and provenance for the deposition) may have to be "dark" or embargoed. Actually, this is not yet very common; for example I asked the Royal Society of Chemistry what their policy was, with the reply "the Royal Society of Chemistry wouldn’t require metadata about the data files to be embargoed".
We live in an era where the very careers of reseachers can be determined by their claim to priority about scientific discoveries. The date stamps for priority continue to be largely controlled and issued by publishers and some may decide that it will be in their business interests to extend their control to data. Perhaps they may even wish to control all aspects of publication including the data and its metadata, acting as self-proclaimed research facilitators.

At this moment, this has not happened; both data and its metadata can remain open and FAIR. Which is where I think we should go in the future in the interests of open science itself.
April 13, 2016

Global initiatives in research data management and discovery: searching metadata.

The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS, Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

Search queries enabled by the use of metadata in data publication
#	Search query^*	Instances retrieved:
1	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey\:*	InChI identifier
2	http://search.datacite.org/ui?q=alternateIdentifier:InChI\:*	InChI key
3	http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N	InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N
4	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey\:*	ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
5	http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI\:InChI=1S\/C9H11N5O3*	ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
6	http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469	Has content media^‡ for Publisher 10.14469 (Imperial College)
7	http://search.datacite.org/ui?q=format:chemical/x-*	Data format type chemical/x-*
8	http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey\:& fl=doi,title,alternateIdentifier& wt=json&rows=15 http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey\:	First 15 hits in JSON format, batch query mode
9	http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London"	resolution statistics for publisher 10.14469 (Imperial College) per month
10	http://service.re3data.org/search?query=&subjects[]=31 Chemistry	Research data repository search for Chemistry (135 hits)

^‡In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[cite]10.1021/ci9803233[/cite] for chemical MIME (multipurpose internet media extensions).

Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems. Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

If more of interest related to this topic emerges at the ACS session, I will report back here.

March 7, 2016

LEARN Workshop: Embedding Research Data as part of the research cycle
I attended the first (of a proposed five) workshops organised by LEARN (an EU-funded project that aims to ...Raise awareness in research data management (RDM) issues & research policy) on Friday. Here I give some quick bullet points relating to things that caught my attention and or interest. The program (and Twitter feed) can be found at https://learnrdm.wordpress.com where other's comments can also be seen.
- Henry Oldenburg, founder member and first secretary of the Royal Society, was the first Open Scientist.
- About 100 people attended the workshop. Of these ~3-5 identified themselves as researchers creating data, and the rest comprised research data managers, administrators, librarians, publishers (but see below) etc. Many were new to their posts.
- Not publishing scientific data should become recognised as scientific malpractice.
- Central libraries should pro-actively disperse their knowledge to data scientists in departments.
- If a scientist is concerned that openly publishing their data might give advantage to their competitors, they are urged to counteract this by "being cleverer than the others".
- The three great bastions of open science are (a) Open Data, (b) Open access articles and (c) doing science openly. Examples of this third category include open notebook science (ONS), a form notably pioneered by Jean-Claude Bradley. One attribute of ONS was noted as no insider knowledge.
- Learned societies should endow medals for Open Science.
- (Some) publishers are reinventing themselves as Research Facilitators.
The plenaries are all well worth dipping into (certainly the video and in some cases all the slides are scheduled to appear).

If you are a researcher (undergraduate students, PGs, PDRAs, early career researchers and academics) you should immediately track down your local evangelist/expert in RDM and ask what the local infrastructures are (or will be shortly built).
February 1, 2016
A visualization of the anomeric effect from crystal structures.
The anomeric effect is best known in sugars, occuring in sub-structures such as RO-C-OR. Its origins relate to how the lone pairs on each oxygen atom align with the adjacent C-O bonds. When the alignment is 180°, one oxygen lone pair can donate into the C-O σ* empty orbital and a stabilisation occurs. Here I explore whether crystal structures reflect this effect.

The torsion angles along each O-C bond are specified, along with the two C-O distances. All the bonds are declared acyclic, and the usual R < 5%, no disorder and no errors specified.
1. You can see from the plot below that the hotspot occurs when both RO-CO torsions are ~65°. From this we will assume that the two (unseen)^‡ lone pairs at any one of the oxygens are distributed approximately tetrahedrally around each oxygen, and if this is true then one of them must by definition be oriented ~ 180° with respect to the same RO-CO bond (the other is therefore oriented -60°). This allows it to be antiperiplanar to the adjacent C-O bond and hence interact with its σ* empty orbital. So the hotspot corresponds to structures where BOTH oxygen atoms have lone pairs which interact with the adjacent O-C anti bond.
2. There is a tiny cluster for which both RO-CO torsions are ~180° and hence neither oxygen has an antiperiplanar lone pair.
3. Only slightly larger are clusters where one torsion is ~65° and the other ~180°, meaning that only one oxygen has an antiperiplanar lone pair.
4. A plot of the two C-O lengths indeed shows an overall hotspot at ~1.40Å for both distances. If the search is filtered to include only torsions in the range 150-180°, the hotspot value increases to 1.415Å for both. If one torsion is restricted to 40-80° and the other to 150-180° the hotspot shows one C-O bond is about 0.012Å shorter than the other.
I also include a further constraint, that the diffraction data must be collected below 140K. The hotspot moves to ~ 55/60° indicating values free of some vibrational noise.

Interestingly, replacing oxygen with nitrogen reveals relatively few examples of the effect (C(NR₂)₄ is an exception). Replacing O by divalent S produces only 13 hits, with the surprising result (below) that in all of them only one S sets up an anomeric interaction. Arguably, the number of examples is too low to draw any firm conclusions from this observation.

^‡Most diffractometers measure low angle scattering of X-rays by high density electrons. These are the core electrons associated with a nucleus rather than the valence electrons associated with lone pairs. Hence very few positions of valence lone pairs have ever been crystallographically measured.

Acknowledgments

This post has been cross-posted in PDF format at Authorea.
August 27, 2015
Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[cite]10.12688/f1000research.6742.1[/cite] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

The second email alerted me to an article[cite]10.15200/winn.143871.12809[/cite] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone^† can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.

^†Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but I would argue if the intent is serious, it certainly should.

August 5, 2015
The 2015 Bradley-Mason prize for open chemistry.

Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.

Peter and I are delighted to announce the inaugural winners of this prize.

The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[cite]10.5281/zenodo.19033[/cite]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[cite]10.6084/m9.figshare.1447208[/cite] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[cite]10.1039/C5LC00430F[/cite]

The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[cite]10.1145/2629489[/cite] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.

June 26, 2015

Personal web pages on digital repositories.

The university sector in the UK has quality inspections of its research outputs conducted every seven years, going by the name of REF or Research Excellence Framework. The next one is due around 2020, and already preparations are under way! Here I describe how I have interpreted one of its strictures; that all UK funded research outputs (i.e. research publications in international journals) must be made available in open unrestricted form within three months of the article being accepted for publication, or they will not be eligible for consideration in 2020.

At the outset, I should say that one infrastructure to help researchers adhere to the guidelines is being implemented in the form of the Symplectic system. This allows a researcher to upload the final accepted version of a manuscript. At Imperial College, a digital repository called Spiral serves this purpose and also acts as the front end for collecting informative metadata to enhance discoverability. The final accepted version is then converted by the publisher into a version-of-record. This contains styling unique to the publisher and the content is subjected to further scrutiny by the authors as proof corrections. In an ideal world, these latter changes should also be faithfully propagated back to the final accepted version, as would all the supporting information associated with the article. Since most authors do not exactly enjoy the delights of proof corrections, this final reconciliation of the two versions may not always be assiduously undertaken.

I became concerned about the existence of two versions of any given scientific report and that the task of ensuring total fidelity in the content of both versions may negatively impact on the author’s time. Much better if the publisher could grant permission for the author to archive the version-of-record into a digital repository.

Some experiments were needed, and I decided to start them in reverse, by archiving my oldest publications. Since Symplectic now provides a system to do this, I began by using it. Symplectic identifies each publisher’s policies for archival, of which the most liberal are known as ROMEO GREEN. To quote from the definition, this colour allows the author to “archive pre-print and post-print or publisher’s version/PDF“. In an afternoon I had processed most of my ROMEO green articles. You know how it is sometimes, you do not read the fine print! And so the library soon informed me that archival of ROMEO GREEN was in fact only permitted on the author’s “personal web page”. Spiral, as an institutional repository, does not apparently constitute a personal web page for me and so none of my Symplectic submissions could be accepted for archival there.

Time to rethink the experiment. Firstly, I very much wanted the reprints to be held by a proper digital repository rather than a conventional web page. Why? I wanted my reprints to adhere as much as possible to FAIR: findable, accessible, interoperable and re-usable. Well, at least the first two of those (the last two relate more to data). A repository is designed to hold metadata in a formal and standards-based manner and metadata helps achieve FAIR. So I asked the Royal Society of Chemistry (as a ROMEO GREEN publisher) whether a personal web page hosted on a digital repository would qualify. I was soon informed that I had proposed a neat solution here, and they couldn’t see an issue.

Now, all I had to do is find a repository where I could create such a personal web page. The chemistry department at Imperial College has for ten years hosted a DSpace repository called SPECTRa[cite]10.1021/ci7004737[/cite] which already has the functionality for individuals to create personal collections. I had also picked up on the increasing attention being given to Zenodo, like the World-Wide Web itself an offshoot of CERN (of large Hadron Collider fame) and born from the need for researchers to more permanently archive the outputs of their researches. These outputs include software, videos, images, presentations, posters, publications and (most obviously for CERN) datasets. I thought I would include them in my experiment as well. There results are summarised below.

	DSpace-SPECTRa	Zenodo
Community	Henry Rzepa personal web page reprint collection	Rzepa personal computational chemistry data and reprint page
Collection	Royal Society of Chemistry reprints
Publication	10042/195577	10.5281/zenodo.18758[cite]10.5281/zenodo.18758[/cite]
Thesis	10044/1/20860[cite]http://doi.org/10044/1/20860[/cite]	10.5281/zenodo.18777[cite]10.5281/zenodo.18777[/cite]
Dataset	10.14469/ch/191342[cite]10.14469/ch/191342[/cite]	10.5281/zenodo.18632[cite]10.5281/zenodo.18632[/cite]
Harvesting	OAI-ORE	OAI-PMH

The last line of this table includes a link to another design feature of a repository, facilitating the ability to harvest the content. The ContentMine project (“The right to read is the right to mine!“) has shown how such harvesting of facts from the literature can be automated on a vast scale, and (IMHO) represents an example of those disruptive innovations that have the power to change the world forever. It also enshrines the idea that scientific facts funded by the public purse should be capable of being openly liberated from their containers. A harvestable repository seems an ideal container for achieving this.

My experiment is part of what might be seen as the increasingly subtle interplay between:

scientific authors, whose creative endeavour research is and without whom scientific publishers would not exist
publishers who create a business model from the content freely given them by authors but also (especially if a commercial publisher) need to be accountable to their shareholders.
the funding councils, many of whom now wish the outcomes of the research they fund to be openly available to all
the local libraries/administrators who have to adhere to/enforce all the rules contractually handed down to them by publishers whose direct customers they are, but who also need to serve their community of readers and authors.
researchers who would rather do research than fret about the above, and who would rather spend limited resources doing that research rather than diverting an increasing amount of their attention into the above system.
readers, who need unimpeded access to the research endeavours of others, but often have little influence on the policies and actions of all the other stakeholders, since they are NOT considered customers (of the publishers).
etc. etc.

My experiment was in part designed to explore these rules, their interpretations and their boundaries. For the time being at least I seem to have found an arrangement that allows me to distribute versions-of-record of my own work, thanks to a generous and far-sighted learned society publisher. Watch this space!

Acknowledgments

This post has been cross-posted in PDF format at Authorea.

June 20, 2015

Discovering chemical concepts from crystal structure statistics: The Jahn-Teller effect
I am on a mission to persuade my colleagues that the statistical analysis of crystal structures is a useful teaching tool. One colleague asked for a demonstration and suggested exploring the classical Jahn-Teller effect (thanks Milo!). This is a geometrical distortion associated with certain molecular electronic configurations, of which the best example is illustrated by octahedral copper complexes which have a d⁹ electronic configuration. The e_g level shown below is occupied by three electrons and which can therefore distort in one of two ways to eliminate the e_g degeneracy by placing the odd electron into either a x²-y² or a z² orbital. Here I explore how this effect can be teased out of crystal structures.

The search is set up with Cu specified as precisely 6-coordinate, and X=oxygen. The six X-Cu distances are defined as DIST1-DIST6. The R-factor is specified as < 0.05 (no disorder, no errors). The problem now is how to plot what is in effect a six-dimensional set of data, from which we are exploring whether four of the distances are different from the other two, and whether those four are the longer or the shorter. This requires analysis beyond the capability (as far as I know) of the Conquest program, and so here I will show sets of plots showing just the relationship between any two distances at a time. Of the 15 possible combinations of two distances, only four are shown below.

Some obvious patterns can already be spotted in the 400 or so compounds which satisfy the search criteria.
- The largest clustering occurs at ~1.95Å, with two clusters each of fewer hits at ~2.5Å. The Wikipedia page notes that for Cu(OH₂)₆ the Jahn-Teller distortion favours four short bonds at ~1.95Å and two long ones at ~2.38Å, which agrees approximately with the positions and sizes of the centroids of these clusters.^†
- Plots 1 and 2 show very little along the diagonals, where the two plotted distances have the same value. This probably means that one of the distances relates to an equatorial ligand and the other to an axial ligand.
- Plots 3 and 4 show a strong diagonal trend, and so these distances both relate to either axial or equatorial, but not one of each.
- All four plots show a hot spot at ~1.95Å, which hints that the Jahn-Teller distortion is four short bonds/two long.
- Plot 4 also shows a green spot at ~2.5Å which is a tantalising suggestion of examples of four long bonds/two short.^‡
Clearly this analysis can be followed up by a visual inspection of individual molecules in each cluster (as well as the outliers which appear to follow no pattern!), together with a more bespoke analysis of the six distances. Unfortunately, the spin state of the complexes cannot be quickly checked (are they all doublets?) since the database does not record these. But the basic search described above takes only a few minutes to do, and it is surprising at how quickly the Jahn-Teller effect can be statistically tested with real experimental data obtained for ~400 molecules. Of course, here I have only explored X=O but this can easily be extended to X=N or X=Cl, to other metals or to alternative coordination numbers such as e.g. 4 where the Jahn-Teller effect can also in principle operate.

^‡ One genuine example of this type, also called compressed octahedral coordination, was reported for the species CuFAsF₆ and CsCuAlF₆[cite]10.1002/chem.200400397[/cite]

^† The measured geometry of Cu(H₂O)₆ may in fact manifest with six equal Cu-O bond lengths due to the dynamic Jahn-Teller effect, because the kinetic barrier separating one Jahn-Teller distorted form and another (equivalent) isomer is small and hence averaged atom positions are measured which mask the effect. Thus the Jahn-Teller effects shown in the plots above may be under-estimated because of this dynamic masking. Reducing the temperature of the sample at which data was collected would reduce this dynamic effect. Indeed, Cu(D₂O)₆ collected at 93K shows a very clear Jahn-Teller distortion[cite]10.1021/ja905399x[/cite] with four short bonds ranging from 1.97-1.99Å and two long bonds 2.37-2.39Å.[cite]10.5517/CCTBSPL[/cite] Another example measured at 89K with dimethyl formamide replacing water and coordinated via oxygen[cite]10.5517/CC14CL36[/cite] shows four short (1.97-1.98Å) and two long (2.315Å) bonds. This latter example is also noteworthy because this analysis is as yet unpublished in a journal, but the data itself has a DOI via which it can be acquired. A nice example of modern research data management!
May 30, 2015