Tag: Technology/Internet

  • Publishing embargoes.

    Publishing embargoes seem a relatively new phenomenon, probably starting in areas of science when the data produced for a scientific article was considered more valuable than the narrative of that article. However, the concept of the embargo seems to be spreading to cover other aspects of publishing, and I came across one recently which appears to take such embargoes into new and uncharted territory.

    One example (there are many others) of embargoes continuing to operate in the era of open science and open data relates to crystallographically derived coordinates for macromolecules. Biomolecular structures are allowed to be embargoed for a maximum of one year before becoming openly available or "released" (considered a friendlier term than embargo). A more recent phenomenon is of embargoes on press releases which may be prepared by authors and or publishers to accompany the appearance of any article considered especially newsworthy. The publisher will then request that the press release is only released to coincide with the actual publication time and date of the article itself. Both of these types of embargo are more or less accepted by both parties. But in the last five years or so, new types of embargo have been introduced and it is these I want to discuss here.

    1. The self-archive or "green open access" version of an article, in the form of the last author version of an accepted manuscript prior to copy-editing and other operations by a publisher. Such Green OA versions are now a mandatory requirement from funders (in the UK), arising from the need to conduct a "REF" or research excellence framework assessment of all (UK) universities every seven years or so. In order to allow assessors and funding councils unencumbered access to these research outputs, the authors must self-archive their publications in a suitable institutional repository. In general therefore, there should always exist two versions of any scientific paper authored within these guidelines, the AV (author version) and VoR (Version of Record, held by the publisher, and carrying the guarantee of peer review). Publishers now embargo author versions until the VoR version has been published, and sometimes even up to 18 months beyond this period.
    2. The "supporting information" or SI embargo. This is closely related to the crystallographic data embargo noted above, but it applies in general to most other data and information associated with an article. Until very recently, most SI was in fact handled by the publisher themselves, and so it was released at the same time as the article. Since it is becoming more common to deposit data and SI in a separate repository, some publishers mandate that the release dates of this material must not precede the article itself. Deposition of such data has also become a mandatory requirement from (UK) funders since May 2015, and I have blogged about such "research data management" often here. In effect, both the scientific article and the data supporting it achieve their own DOIs or persistent digital identifiers, allowing easy and independent access to either the article OR its data. In fact, assigning such a DOI has a more subtle effect; creating a DOI means that metadata describing the object is also created and then aggregated by the agency issuing the DOI such as CrossRef and DataCite. Importantly, one should note that SI which is handled purely by the publisher will not have its own separate DOI and it will not have its own metadata. The data metadata for example can include the DOI for the article, and vice versa. I have shown examples of the utility of such metadata for data in an earlier post.
    3. So now we come to the most recent embargo, which has surfaced since around May 2015, as increasingly data has become a first class object in its own right with its own DOI and importantly its own metadata. There is now evidence that some publishers are requesting that this very metadata about data is also subjected to an embargo, not to be released before the article which makes use of that data is itself released. So data can be deposited in "dark form" prior to a publication, but the metadata (which carries the date stamp and provenance for the deposition) may have to be "dark" or embargoed. Actually, this is not yet very common; for example I asked the Royal Society of Chemistry what their policy was, with the reply "the Royal Society of Chemistry wouldn’t require metadata about the data files to be embargoed".

    We live in an era where the very careers of reseachers can be determined by their claim to priority about scientific discoveries. The date stamps for priority continue to be largely controlled and issued by publishers and some may decide that it will be in their business interests to extend their control to data. Perhaps they may even wish to control all aspects of publication including the data and its metadata, acting as self-proclaimed research facilitators.

    At this moment, this has not happened; both data and its metadata can remain open and FAIR. Which is where I think we should go in the future in the interests of open science itself.

  • Global initiatives in research data management and discovery: searching metadata.

    The upcoming ACS national meeting in San Diego has a CINF (chemical information division) session entitled "Global initiatives in research data management and discovery". I have highlighted here just one slide from my contribution to this session, which addresses the discovery aspect of the session.

    Data, if you think about it, is rarely discoverable other than by intimate association with a narrative or journal article. Even then, the standard procedure is to identify the article itself as being of interest, and then digging out the "supporting information", which normally takes the form of a single paginated PDF document. If you are truly lucky, you might also get a CIF file (for crystal structures). But such data has little life of its own outside of its parent, the article. Put another way, it has no metadata it can call its own (metadata is data about an object, in this case research data). An alternative is to try to find the data by searching conventional databases such as CAS,  Beilstein/Reaxys or CSD, and there of course the searches can be very precise. But (someone) has to pay the bills for such accessibility.

    We are now starting to see quite different solutions to finding data (the F in FAIR data, the other letters representing accessibility, interoperability and re-usability). These solutions depend on metadata being a part of the solution from the outset, rather than any afterthought produced as a commercial solution. The collection of metadata is part of the overall process called RDM, or research data management, perhaps even the most important part of it. In exchange for identifying metadata about one's data, one gets back a "receipt" in the form of a persistent identifier for the data, more commonly known as a DOI. The agency that issues the DOI also undertakes to look after the donated metadata, and to make it searchable. The table below shows eight searches of such metadata, one example of how to acquire statistics relating to the usage of the data and one search of how to find repositories containing the data.

    Search queries enabled by the use of metadata in data publication
    # Search query* Instances retrieved:
    1 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey\:*  InChI identifier
    2 http://search.datacite.org/ui?q=alternateIdentifier:InChI\:*  InChI key 
    3 http://search.datacite.org/ui?q=alternateIdentifier:InChIKey:CULPUXIDFLIQBT-UHFFFAOYSA-N InChI key CULPUXIDFLIQBT-UHFFFAOYSA-N 
    4 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChIKey\:* ORCID 0000-0002-8635-8390 AND (boolean) InChI key.
    5 http://search.datacite.org/ui?q=ORCID:0000-0002-8635-8390+alternateIdentifier:InChI\:InChI=1S\/C9H11N5O3* ORCID 0000-0002-8635-8390 AND (boolean) + InChI string 1S/C9H11N5O3 with the * wild.
    6 http://search.datacite.org/ui?q=has_media:true&fq=prefix:10.14469 Has content media for Publisher 10.14469 (Imperial College)
    7 http://search.datacite.org/ui?q=format:chemical/x-* Data format type chemical/x-* 
    8 http://search.datacite.org/api?&q=prefix:10.14469& fq=alternateIdentifier:InChIKey\:*& fl=doi,title,alternateIdentifier& wt=json&rows=15
    http://api.labs.datacite.org/works?q=prefix:10.14469+AND+alternateIdentifier:InChIKey\:*
    First 15 hits in JSON format, batch query mode
    9 http://stats.datacite.org/?fq=datacentre_facet:"BL.IMPERIAL – Imperial College London" resolution statistics for publisher 10.14469 (Imperial College) per month
    10 http://service.re3data.org/search?query=&subjects[]=31 Chemistry Research data repository search for Chemistry (135 hits)

    In this instance the three MIME media types are chemical/x-wavefunction, chemical/x-gaussian-checkpoint and chemical/x-gaussian-log. See[cite]10.1021/ci9803233[/cite] for chemical MIME (multipurpose internet media extensions).


    Anyone familiar with the standard ways of finding data (CAS, CSD, Reaxys) will appreciate that the above does not yet have the finesse to find eg sub-structures of chemical structures, synthetic procedures or molecular properties. My including it here is primarily to show some of the potential such systems have, and to remark particularly that the batch query capability of this infrastructure could indeed be used in the future to construct much more sophisticated systems.  Oh, and to the end-user at least, the searches shown above do not require institutional licenses to use. Both the data and its metadata is free, mostly with a CC0 or CC BY 3.0 license for re-use (the R of FAIR).

    If more of interest related to this topic emerges at the ACS session,  I will report back here.

  • LEARN Workshop: Embedding Research Data as part of the research cycle

    I attended the first (of a proposed five) workshops organised by LEARN (an EU-funded project that aims to ...Raise awareness in research data management (RDM) issues & research policy) on Friday. Here I give some quick bullet points relating to things that caught my attention and or interest. The program (and Twitter feed) can be found at https://learnrdm.wordpress.com where other's comments can also be seen. 

    • Henry Oldenburg, founder member and first secretary of the Royal Society, was the first Open Scientist.
    • About 100 people attended the workshop. Of these ~3-5 identified themselves as researchers creating data, and the rest comprised research data managers, administrators, librarians, publishers (but see below) etc. Many were new to their posts.
    • Not publishing scientific data should become recognised as scientific malpractice.
    • Central libraries should pro-actively disperse their knowledge to data scientists in departments.
    • If a scientist is concerned that openly publishing their data might give advantage to their competitors, they are urged to counteract this by "being cleverer than the others". 
    • The three great bastions of open science are (a) Open Data, (b) Open access articles and (c) doing science openly. Examples of this third category include open notebook science (ONS), a form notably pioneered by Jean-Claude Bradley. One attribute of ONS was noted as no insider knowledge.
    • Learned societies should endow medals for Open Science.
    • (Some) publishers are reinventing themselves as Research Facilitators.

    The plenaries are all well worth dipping into (certainly the video and in some cases all the slides are scheduled to appear).

    If you are a researcher (undergraduate students, PGs, PDRAs, early career researchers and academics) you should immediately track down your local evangelist/expert in RDM and ask what the local infrastructures are (or will be shortly built). 

  • A visualization of the anomeric effect from crystal structures.

    The anomeric effect is best known in sugars, occuring in sub-structures such as RO-C-OR. Its origins relate to how the lone pairs on each oxygen atom align with the adjacent C-O bonds. When the alignment is 180°, one oxygen lone pair can donate into the C-O σ* empty orbital and a stabilisation occurs. Here I explore whether crystal structures reflect this effect.

    Scheme

    The torsion angles along each O-C bond are specified, along with the two C-O distances. All the bonds are declared acyclic, and the usual R < 5%, no disorder and no errors specified.

    1. You can see from the plot below that the hotspot occurs when both RO-CO torsions are ~65°. From this we will assume that the two (unseen) lone pairs at any one of the oxygens are distributed approximately tetrahedrally around each oxygen, and if this is true then one of them must by definition be oriented ~ 180° with respect to the same RO-CO bond (the other is therefore oriented -60°). This allows it to be antiperiplanar to the adjacent C-O bond and hence interact with its σ* empty orbital. So the hotspot corresponds to structures where BOTH oxygen atoms have lone pairs which interact with the adjacent O-C anti bond.
    2. There is a tiny cluster for which both RO-CO torsions are ~180° and hence neither oxygen has an antiperiplanar lone pair.
    3. Only slightly larger are clusters where one torsion is ~65° and the other ~180°, meaning that only one oxygen has an antiperiplanar lone pair.
    4. A plot of the two C-O lengths indeed shows an overall hotspot at ~1.40Å for both distances. If the search is filtered to include only torsions in the range 150-180°, the hotspot value increases to 1.415Å for both. If one torsion is restricted to 40-80° and the other to 150-180° the hotspot shows one C-O bond is about 0.012Å shorter than the other.

    Scheme

    Scheme

    I also include a further constraint, that the diffraction data must be collected below 140K. The hotspot moves to ~ 55/60° indicating values free of some vibrational noise.

    Scheme

    Interestingly, replacing  oxygen with  nitrogen reveals relatively few examples of the effect (C(NR2)4 is an exception). Replacing  O by divalent S produces only 13 hits, with the surprising result (below) that in all of them only one S sets up an anomeric interaction. Arguably, the number of examples is too low to draw any firm conclusions from this observation.

    Scheme


    Most diffractometers measure low angle scattering of X-rays by high density electrons. These are the core electrons associated with a nucleus rather than the valence electrons associated with lone pairs. Hence very few positions of valence lone pairs have ever been crystallographically measured.


    Acknowledgments

    This post has been cross-posted in PDF format at Authorea.

  • Single Figure (nano)publications, reddit AMAs and other new approaches to research reporting

    I recently received two emails each with a subject line new approaches to research reporting. The traditional 350 year-old model of the (scientific) journal is undergoing upheavals at the moment with the introduction of APCs (article processing charges), a refereeing crisis and much more. Some argue that brand new thinking is now required. Here are two such innovations (and I leave you to judge whether that last word should have an appended ?).

    To set the scene for the first, I will quote the abstract: “The single figure publication is a novel, efficient format by which to communicate scholarly advances. It will serve as a forerunner of the nano-publication, a modular unit of information critical for machine-driven data aggregation and knowledge integration[cite]10.12688/f1000research.6742.1[/cite] The kernel of this suggestion is (again I quote) “We offer the idea of the micro-publication unit, the single figure publication (SFP), to provide scholars with a real-world, manageable method to inform research.” I was struck by the overlap between this suggestion and the one you may find on many of the posts on this blog, where what I refer to as FAIR Data is assigned a digital object identifier (DOI) and included in the citation lists at the end of the post. The key phrase in the above abstract is machine-driven data aggregation and knowledge, although the article does not really go into any mechanisms for easily achieving this. It is my argument that the act of assigning a DOI carries with it the association that there is machine searchable metadata which can be retrieved and used for the aggregation and knowledge mining. The authors of this article, Do and Mobley, advocate adoption of nanopublications defined by inclusion of just a single figure (notably, not a table of results!) and some accompanying context which they claim would reduce the unit of publication to a more tractable size. This does raise the question of whether science needs more publications (in chemistry alone there are said to be more than a million published each year) or whether we should instead be concentrating our efforts on improving the data side of things by increasing its semantic content and formalising its structures, its preservation and curation. I certainly argue that far too little effort has been poured into these latter activities. You only have to look at the typical SI (supporting information) associated with many chemistry articles to realise that in many cases they are still hardly fit for purpose. There is one concept introduced by Do and Mobley that also deserves mention. Their nanopublications are structured to be read by machines, not people. They will therefore not be refereed by people (my inference). They do not really discuss how else the quality will be assessed, but of course if you treat their nanopublication as essentially FAIR data, then it does become possible to develop methods of machine refereeing.

    The second email alerted me to an article[cite]10.15200/winn.143871.12809[/cite] in the Winnower, a forum that offers a bridge between “traditional scholarly publishing tools to traditional and non-traditional scholarly outputs—because scholarly communication doesn’t just happen in scholarly journals“. Here, the concept of scholarly communication is extended to the New Reddit Journal of Science and introduces the concept pioneered by reddit of the AMA, or “ask me anything” environment. I occasionally publish some of the posts on this blog to the Winnower, receiving in return the increasingly ubiquitous DOI. I have also occasionally quoted these DOIs in articles submitted to conventional chemistry journals. What we see now is the propagation of a Winnower DOI on to e.g. https://www.reddit.com/r/science/ where anyone can post a question related to the original research reporting. I must state that I do have some reservations about this. Whilst it is likely that the majority of traditional scholarly reporting is likely to receive no AMAs (just as a very high proportion of research articles attract few if any citations in other articles over a period of decades), it is also likely that the quality of posted AMAs may turn out to be very low. At which point the original researcher has to make a judgement as to whether to devote any of their increasingly precious and fragmented time to answering them. And if few if any answers are posted in response to an AMA, the system seems unlikely to flourish.

    But what we see here are two serious attempts to develop new approaches to research reporting, and not doubt others will emerge. To quote Yogi Berra, the future is not what it used to be.


    Anyone can also post to this blog to ask similar questions. But note that associating an ORCID with such comments is highly recommended. I do not think that reddit currently supports ORCID, but  I would argue if the intent is serious, it certainly should.

  • The 2015 Bradley-Mason prize for open chemistry.

    Open principles in the sciences in general and chemistry in particular are increasingly nowadays preached from funding councils down, but it can be more of a challenge to find innovative practitioners. Part of the problem perhaps is that many of the current reward systems for scientists do not always help promote openness. Jean-Claude Bradley was a young scientist who was passionately committed to practising open chemistry, even though when he started he could not have anticipated any honours for doing so. A year ago a one day meeting at Cambridge was held to celebrate his achievements, followed up with a special issue of the Journal of Cheminformatics. Peter Murray-Rust and I both contributed and following the meeting we decided to help promote Open Chemistry via an annual award to be called the Bradley-Mason prize. This would celebrate both “JC” himself and Nick Mason, who also made outstanding contributions to the cause whilst studying at Imperial College. The prize was initially to be given to an undergraduate student at Imperial, but was also extended to postgraduate students who have promoted and showcased open chemistry in their PhD researches.

    Peter and I are delighted to announce the inaugural winners of this prize.

    The postgraduate winner is Tom Phillips for his open blog describing his experiences as a PhD student and for leading by example. He has published his instrumental codes on Github (and now Zenodo[cite]10.5281/zenodo.19033[/cite]) and data and codes for reproducing the graphs in his work on the “lab on a chip” in Figshare[cite]10.6084/m9.figshare.1447208[/cite] and through his blog has encouraged other research students to do the same. Tom has worked assiduously to ensure that all the articles describing his PhD work are or will be open access.[cite]10.1039/C5LC00430F[/cite]

    The undergraduate winner is Tom Arrow for his “spare time” involvement with WikiMedia (the foundation that underpins the open Wikipedia), including participating in a Wikimedia EU hackathon in Lyon France, and feeding his experiences and skills back into his undergraduate environment as well as enhancing the teaching Wiki used by his fellow students. Tom took the lead in introducing us to Wikidata[cite]10.1145/2629489[/cite] for storing chemical data in an open Wikibase data repository and in promoting its use for enriching Wikipedia chemistry pages and showcasing open data in undergraduate teaching environments.

  • Discovering chemical concepts from crystal structure statistics: The Jahn-Teller effect

    I am on a mission to persuade my colleagues that the statistical analysis of crystal structures is a useful teaching tool.  One colleague asked for a demonstration and suggested exploring the classical Jahn-Teller effect (thanks Milo!). This is a geometrical distortion associated with certain molecular electronic configurations, of which the best example is illustrated by octahedral copper complexes which have a d9 electronic configuration. The eg level shown below is occupied by three electrons and which can therefore distort in one of two ways to eliminate the eg degeneracy by placing the odd electron into either a x2-y2 or a z2 orbital. Here I explore how this effect can be teased out of crystal structures.

    JT

    The search is set up with Cu specified as precisely 6-coordinate, and X=oxygen. The six X-Cu distances are defined as DIST1-DIST6. The R-factor is specified as < 0.05 (no disorder, no errors). The problem now is how to plot what is in effect a six-dimensional set of data, from which we are exploring whether four of the distances are different from the other two, and whether those four are the longer or the shorter. This requires analysis beyond the capability (as far as I know) of the Conquest program, and so here I will show sets of plots showing just the relationship between any two distances at a time. Of the 15 possible combinations of two distances, only four are shown below.

    Some obvious patterns can already be spotted in the 400 or so compounds which satisfy the search criteria.

    • The largest clustering occurs at ~1.95Å, with two clusters each of fewer hits at ~2.5Å. The Wikipedia page notes that for Cu(OH2)6 the Jahn-Teller distortion favours four short bonds at ~1.95Å and two long ones at ~2.38Å, which agrees approximately with the positions and sizes of the centroids of these clusters.
    • Plots 1 and 2 show very little along the diagonals, where the two plotted distances have the same value. This probably means that one of the distances relates to an equatorial ligand and the other to an axial ligand.
    • Plots 3 and 4 show a strong diagonal trend, and so these distances both relate to either axial or equatorial, but not one of each.
    • All four plots show a hot spot at ~1.95Å, which hints that the Jahn-Teller distortion is four short bonds/two long.
    • Plot 4 also shows a green spot at ~2.5Å which is a tantalising suggestion of examples of four long bonds/two short.
    1. CuO-12
    2. CuO-34
    3. CuO-56
    4. CuO-13

    Clearly this analysis can be followed up by a visual inspection of individual molecules in each cluster (as well as the outliers which appear to follow no pattern!), together with a more bespoke analysis of the six distances. Unfortunately, the spin state of the complexes cannot be quickly checked (are they all doublets?) since the database does not record these.  But the basic search described above takes only a few minutes to do, and it is surprising at how quickly the Jahn-Teller effect can be statistically tested with real experimental data obtained for ~400 molecules. Of course, here I have only explored X=O but this can easily be extended to X=N or X=Cl, to other metals or to alternative coordination numbers such as e.g. 4 where the Jahn-Teller effect can also in principle operate.


    One genuine example of this type, also called compressed octahedral coordination, was reported for the species CuFAsF6 and CsCuAlF6[cite]10.1002/chem.200400397[/cite]


    The measured geometry of Cu(H2O)6 may in fact manifest with six equal Cu-O bond lengths due to the dynamic Jahn-Teller effect, because the kinetic barrier separating one Jahn-Teller distorted form and another (equivalent) isomer is small and hence averaged atom positions are measured which mask the effect. Thus the Jahn-Teller effects shown in the plots above may be under-estimated because of this dynamic masking. Reducing the temperature of the sample at which data was collected would reduce this dynamic effect. Indeed, Cu(D2O)6 collected at 93K shows a very clear Jahn-Teller distortion[cite]10.1021/ja905399x[/cite] with four short bonds ranging from 1.97-1.99Å and two long bonds 2.37-2.39Å.[cite]10.5517/CCTBSPL[/cite] Another example measured at 89K with dimethyl formamide replacing water and coordinated via oxygen[cite]10.5517/CC14CL36[/cite] shows four short (1.97-1.98Å) and two long (2.315Å) bonds. This latter example is also noteworthy because this analysis is as yet unpublished in a journal, but the data itself has a DOI via which it can be acquired. A nice example of modern research data management!