Blog

Data Discovery: A pick-n-mix library of useful FAIR Data searches – and a call for new search suggestions.
With AI and Machine learning needing data in abundance, interest in data discovery is intense. However, this type of discovery is somewhat different from more traditional data base searches, in that it is particularly suited for machine discovery as well as by humans. The discovery searches are conducted using an aggregated and federated metadata store, such as that curated by DataCite. How to construct a suitable search is however still not entirely human-friendly. The start point for understanding how to search is this resource: XML to JSON mappings and the XML referred to can be found here. [cite]10.14454/g8e5-6293[/cite] Since the learning curve to construct such data searches can be quite steep, I thought I would share as a library some recent searches I constructed for a talk I am giving. This post is essentially an extension and update of an earlier challenge I was set along these lines and which appeared here.[cite]10.1255/sew.2022.a10[/cite]

You can see that the searches come as components linked by Boolean operators, separated by strings such as +AND+, +OR+ or +NOT+. Essentially like a Lego constructor set, you can create your own searches by combining these components to suit your own needs. No doubt some AI-based procedure will come along that will convert natural language expressions of the intended search into the JSON-friendly strings you see below – at least that is the hope.

Part 1: Data discovery based on general properties such as the reporting Institution, the publisher or the Researcher
1. Find all Data-related Works associated with Cambridge University and the American Chemical Society Publisher
  - https://commons.datacite.org/doi.org?query=((contributors.affiliation.affiliationIdentifier:*013meh722)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*013meh722)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*
    232 Works
2. Find all Data-related Works associated with Imperial College and the American Chemical Society Publisher
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*
    304 Works
3. Find all Datasets OR Collections associated with Imperial College and the American Chemical Society Publisher and the term
  Pyrazol in the Title or Description
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+((types.resourceTypeGeneral:Dataset)+OR+(types.resourceTypeGeneral:Collection))
    3 Works
4. Find all Datasets OR Collections associated with Imperial College and the American Chemical Society Publisher and the term
  Pyrazol in the Title or Description and a specified Researcher
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+((types.resourceTypeGeneral:Dataset)+OR+(types.resourceTypeGeneral:Collection))+AND+((contributors.nameIdentifiers.nameIdentifier:*000-0002-3296-6817)+OR+(creators.nameIdentifiers.nameIdentifier:*000-0002-3296-6817))
    1 Work
5. Find Datasets only associated with Imperial College and the term Pyrazol in the Title or Description
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+types.resourceTypeGeneral:Dataset
    270 Works
6. Find just Datasets associated with a specific researcher
  - ?query=types.resourceTypeGeneral:Dataset+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-7816-0042+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-7816-0042)
    8 Works
7. Find Data-related Works associated with Cambridge University, the SubjectScheme FOS (Field of Science) and the Subject term *Chemical*
  - ?query=(subjects.subjectScheme:*FOS*)+AND+(subjects.subject:*Chemical*)+AND+((creators.affiliation.affiliationIdentifier:*013meh722)+OR+(contributors.affiliation.affiliationIdentifier:*013meh722))
    440 Works
8. Establish if a specified publication with a specified author has an associated FAIR Dataset or FAIR Collection:
  - ?query=(types.resourceTypeGeneral:Dataset+OR+types.resourceTypeGeneral:Collection)+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390)+AND+(relatedIdentifiers.relatedIdentifierType:DOI+AND+relatedIdentifiers.resourceTypeGeneral:JournalArticle+AND+relatedIdentifiers.relatedIdentifier:10.1021/acs.inorgchem.3c01506)
    
    1 Work
9. Establish how many journal publications by a specified author have an associated FAIR Dataset or FAIR Collection:
  - ?query=(types.resourceTypeGeneral:Dataset+OR+types.resourceTypeGeneral:Collection)+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390)+AND+(relatedIdentifiers.relatedIdentifierType:DOI+AND+relatedIdentifiers.resourceTypeGeneral:JournalArticle+AND+relatedIdentifiers.relatedIdentifier:*)
    
    1 Work
Part 2: Data discovery based on chemical properties such as NMR, IR or X-ray spectroscopy
1. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:chemical/x-cdxml+OR+media.media_type:chemical/x-mdl-molfile)+AND+(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    150 datasets
2. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR Nuclei as a Subject, for 13C and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Nucleus)+AND+(subjects.subject:13C)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    41 datasets
3. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject, for HMBC Experiments and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Expt)+AND+(subjects.subject:HMBC)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)”
    26 datasets
4. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject, using solvent “CD₃OD” and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Solvent)+AND+(subjects.subject:*CD3OD)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    22 datasets
5. Find all Datasets associated with NMR Media types,
  NMR as a Subject and InChIKey : OZEYXLXJQKVGCZ-UHFFFAOYSA-L
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+((subjects.subjectScheme:inchikey)+AND+(subjects.subject:OZEYXLXJQKVGCZ-UHFFFAOYSA-L))
    5 datasets
6. Find all Datasets associated with NMR Media types,
  NMR as a Subject and the molecular formula component of the full InChI : InChI=1S/2C18H16N2O3.2C2H6O.Ca/c2*1-23-15-9-7-13 etc
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+((subjects.subjectScheme:inchikey)+AND+(subjects.subject:InChI=1S/2C18H16N2O3.2C2H6O.Ca*)) 5 datasets
7. Find all Datasets associated with Chemical structure representation Media types,
  IR as a Subject and the title or description term
  “Pyrazol”
  - ?query=media.media_type:chemical/x-cdxml+AND+(subjects.subjectScheme:*IFD.IR*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    36 datasets
8. Find all Datasets associated with a Chemical structure representation and Crystal structure
  Media types, XRAY as a Subject and the
  title or description term “Pyrazol”
  - ?query=media.media_type:chemical/x-cif+AND+(subjects.subjectScheme:*IFD.XRAY*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    38 datasets
Part 3: Data discovery based on chemical properties such as Computational modelling
1. Find all Datasets associated with Chemical structure representation and Computation Media
  types, COMP as a Subject and the title
  or description term “Pyrazol”
2. - ?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+(subjects.subjectScheme:*IFD.Comp*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    4 datasets
3. Find all Datasets associated with Computation Media types and the subject KIE for Hydrogen isotopes.
  - Visual search:
    ?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
    17 datasets
  - API Search:
    https://api.datacite.org/dois/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
  - Command line search:
    curl https://api.datacite.org/dois/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
One feature of this approach is that the searches themselves, which are across a globally aggregated metadata store, can change with time. So repeating some of the searches at defined time intervals can also give a dynamic indication of how a particular area of data is growing. Other searches are of course designed to give a single hit which probably will not change with time.

The above is based on an interpretation and implementation of the DataCite Schema, one which will eventually need to be agreed by the communities and sub-communities that might wish to use them. So beware, there may be other implementations covering similar data that would not eg be found by the above searches, particularly in the way the subject terms above are used. They are therefore included here purely to raise awareness of the potential that such an approach has – along with my observation that I had never attended any presentation where they have been discussed or shown. In the future, it seems likely that these JSON-based searches will themselves get automated and generated by software rather than by a human as here. When that comes, searching will never be the same again!

I also welcome suggestions for new search queries. This might either be accommodated using the existing metadata, or might require new additions to the metadata record. Please send them here as comments.
November 25, 2024
Mechanism of the Masamune-Bergman reaction. Part 4. Why was the DFT energy barrier too high for the Calicheamicin reaction?

Michael in a comment here on the mechanism of the Masamune-Bergman reaction notes that when it occurs as part of the Calicheamicin (an antibody-drug conjugate or ADC) version of this mechanism, a pre-step is first necessary. As discussed in this review article,[cite]10.3390/ph14050442[/cite] the trisulfide linkage is reduced and the resulting thiolate undergoes a facile 1,4-addition to the adjacent enone.

DFT calculations on the new form (FAIR Data DOI:[cite]10.14469/hpc/14632[/cite], [cite]10.14469/hpc/14632[/cite] show that the free energy barrier is reduced from 38.6 kcal/mol to 26.2 kcal/mol.

This is now a reasonable value for a thermal reaction, being a 12.4 kcal/mol reduction from the unactivated species. We can conclude that Michael’s suggestion was spot on, and suggests in turn that a DFT-biradicaloid calculation is in fact a reasonable procedure for modelling this type of system.

October 29, 2024
A one-electron bond in methyl-λ1-borane.

In exploring one-electron carbon-carbon bonds, I had noted previously[cite]10.59350/88k04-2×509[/cite] that both hexafluoroethane and ethane itself could each lose an electron to produce such species. A discussion developed in which a molecule isoelectronic with ethane radical cation, namely the methyl-λ1-borane radical (H₃B-CH₃) was proposed by Jacob. The optimised structure at the ωB97XD/6-31G(d) level exhibited a B-C bond length of 1.57Å, with two of the B-H hydrogens forming a a 3c-3e bond with boron and so a one-electron B-C bond was discounted. Here I take a closer look at this system.

At the ωB97XD/Def2-TZVPP level, I located an alternative structure with a longer B-C bond of 1.737Å[cite]10.14469/hpc/14662[/cite] and an “agostic” like interaction between C and one B-H bond.

The electron density difference maps between methyl-λ1-borane and its mono cation is shown below and following it the density difference map between the corresponding anion and methyl-λ1-borane radical. These are very similar to the maps obtained previously for hexafluoroethand and ethane and support the hypothesis that the differences between the two-electron/zero-electron species and the one-electron radical originates at least in part in the B-C bond.

A contour map of the negative region of the electron density Laplacian (-0.04 au) again shows that it lies along the B-C bond, suggesting covalency. Note the -ve Laplacian in the region of the agostic interaction! The NCI (non-covalent-interaction) plot is featureless.

The computed methyl-λ1-borane radical has a B-C stretching vibration corresponding to 494 cm^-1, a Wiberg bond order of 0.660 and Wiberg bond index totals of 3.51 for carbon and 3.28 for boron. These can all be reasonably interpreted as a one-electron “half” bond between C and B. With a computed bond length of 1.737Å, it represents the shortest “one electron” bond thus far identified, and hence extends the length range of such bonds to around 1.16Å.

Postscript 1
I also looked at the radical anion of H₃B-BH₃^– which is isoelectronic to methyl-λ1-borane, revealing r_B-B 2.124Å and has a classic “ethane” D_3d like structure. The electron density difference map between H₃B-BH₃^– and the neutral H₃B-BH₃ is shown below, revealing in a considerable reorganisation of the electron density, only one aspect being the B-B region and different from the reorganisation of the radical cation of ethane itself. This reveals that simply talking about a two-atom region for this sort of system is very simplistic and misleading. The Wiberg B-B bond index is 0.383 and the B-B stretching vibration is 384 cm^-1.

The electron density Laplacian of H₃B-BH₃^– contoured for a -ve value of -0.04 au, again implying a covalent B-B bond.

Postscript 2

Here I add hexamethylethane radical cation to the list. Firstly the density difference map. Note the longer C-C bond (2.31Å) than for ethane radical cation (1.933Å). In this sense, the hexamethyl radical cation has a weaker C-C bond than does the unsubstituted version (191 cm^-1) vs 477 cm^-1)

The Laplacian shows no -ve value in the C-C region (isosurface value -0.01), again placing it in the weak bond category.

Finally some NCI plots. Here the density cut-off threshold is crucial. Typically a second period element covalent density is taken as 0.05 au, and this is removed from the NCI analysis. The feature seen along the C-C bond at this level is typical of weak covalent interactions however.

Reducing the density to 0.023 (typical of density in which one atom is of the third period, ie Si) removes the central C-C feature, leaving only NCI effects between the hydrogen atoms of the methyl groups. These in fact form a continuous weakly stabilizing surface between the two halves.

So with hexamethylethane radical cation, we get messages that the interaction between the two carbons is both weak, but also not a non-covalent interaction. So this is a very weak covalent bond perhaps, but in this strange region, it is difficult to ascribe a single description to it.

October 9, 2024
The one-electron carbon-carbon bond: Hexafluoroethane and ethane radical cations.

In the previous post, I looked[cite]10.59350/xp5a3-zsa24[/cite] at the recently reported[cite]10.1021/ja02261a002[/cite] hexa-arylethane containing a carbon-carbon one-electron bond, its structure having been determined by x-ray diffraction (XRD). The measured C-C bond length was ~2.9aÅ and my conclusion was that the C…C region represented more of a weak “interaction” than of a bond as such. How about a much simpler system, hexafluoroethane? Here, the two-electron C-F bonds are much lower in energy than the C-C bond, so when the molecule is ionised, it escapes from the C-C bond rather than any of the C-F bonds. The below is the structure computed at the ωB97XD/Def2-TZVPP level, revealing a much shorter C-C bond of 2.149Å. The computed C-C stretching vibrational frequency is 179 cm^-1 (FAIR data DOI: [cite]10.14469/hpc/14642[/cite])

An electron density difference map, obtained by subtracting the computed density of the dication from that of the radical cation at the geometry of the former is shown below, confirming that the electron has been removed from the C-C region, with a smaller removal from the C-F bonds.

The Laplacian of the electron density is shown below contoured for negative values of this function. Unlike the previous molecule, this now has a (small) negative value along the C-C region (contour -0.001).

A calculation of the NCI surface gave a null result! The parameters for computing a non-covalent analysis are thus: [0.5 1 0.0005 0.05 0.95 1.00], being the ones used in the previous analysis. The value of 0.05 is the density cutoff used to remove covalent density and using this value, no non-covalent features are detected. Or, put another way, only covalent features are present, as supported by the -ve Laplacian noted above.

Whilst C₂F₆^+. cannot be claimed to be typical of a molecule with a hypothetical “pure” one-electron C-C bond, it is certainly very different from the previous example.[cite]10.59350/xp5a3-zsa24[/cite],[cite]10.1021/ja02261a002[/cite] Time to go all the way and try ethane itself, C₂H₆^+.. Again the same behavour is seen, whilst the calculated C-C length reduces to 1.933Å. The C-C stretching vibrational frequency is elevated to 477 cm^-1. We might take these last values as the natural ones for a one-electron C-C bond?

This alternative subtraction involves the density difference between neutral ethane and its radical cation. The result is essentially the same.

So these two ethane derivatives add some further context to the properties of a one-electron C-C bond. We have seen them range from a low of ~1.9Å to a high of ~2.9Å This variation of around 1Å as a function of the substituents on the two carbons must be the largest ever seen for any kind of bond!

October 3, 2024
A carbon-carbon one-electron bond! Or a weak carbon-carbon interaction?

More than 100 years ago, before the quantum mechanical treatment of molecules had been formulated, G. N. Lewis proposed[cite]10.1021/ja02261a002[/cite] a simple model for chemical bonding that is still taught today. This is the idea of the three categories of bond we know as single, double and triple, comprising respectively two, four and six shared electrons each, at least for the very common carbon-carbon bond. A little more than a decade ago, this was extended upwards to the eight-electron quadruple bond.[cite]10.1038/nchem.1263[/cite]. Now, at the other extreme of downwards, a molecule has been characterised in the solid state with a one-electron C-C bond.[cite]10.1038/s41586-024-07965-1[/cite] In this sub-two-electron region, bonds such as hydrogen bonds have long been recognised and they form part of a class of “weak” bonding known instead as exhibiting “non-covalent-interactions” or NCI. But specifically a one-electron carbon-carbon bond stands apart from these weaker types and so it is certainly news when one such is reported and characterised in the crystalline state by x-ray diffraction.

To start the investigation, a search of the crystal structure database was performed using the following more general query of the structure above. The central C-C bond (in green below) was not added, leaving the two carbons as 3-coordinate.

This resulted in 10 hits, all revealed as dications, with the central C-C distance ranging from 2.8Å to 3.0Å. So the unique feature of this new report is that they were able to find a system where oxidation did not proceed directly to the dication, but stopped at the 1-electron level to give a radical cation instead. This new structure poses a bit of a quandry for the curators of the CSD. The index for this database is built on the basis of whether any two atoms in a molecule are connected by a “bond”, and the allowed values for bonds range from single to quadruple, with various intermediate descriptions (such as aromatic) and finally “any”. This latter basically means any of the previous, but what I am pretty certain of is that it does not mean “one-electron”, or “half”. The new compound has not yet been indexed in my current version of the CSD, so this presumption is not yet tested.^‡

The authors[cite]10.1038/s41586-024-07965-1[/cite] did also make the dication and they report a length of 3.03Å for this species, broadly in accord with the range shown above and a reduced value of 2.92Å for the radical cation (Δ_r 0.11Å). This is quite a small contraction induced by the formation of the one-electron bond, which is already hinting that it might actually be a weak bond.

Next, I proceeded by performing my own DFT calculations on these species, at the ωB97XD/Def2-TZVPP level.(FAIR data DOI: [cite]10.14469/hpc/14642[/cite]) At this level the di- and monocationic C-C bond lengths came out as 3.075Å and 2.867Å (Δ_r 0.21Å), a slightly larger contraction than that reported, but still representing a weak bond.

With wavefunctions now available for the species, I decided to inspect the electron densities. This was calculated at the geometry of the radical cation, and then at the same geometry, the dication was calculated and the two electron densities subtracted. The resulting density surface (contour level 0.002au) representing one electron is shown below. As expected, the most significant feature occurs in the C-C region, but quite a lot of this one electron is distributed around the aromatic rings (I must find out how to integrate regions!). So already we see that this “1-electron” bond is in fact only a fraction of one electron. Again an indication that it is a weak bond.

A procedure often used to identify weak bonds is called NCI, or noncovalent-interactions.[cite]10.1021/ja100936w[/cite] These are by definition interactions weaker than the single bonds, often being hydrogen bonds and other unusual interactions such as a π-π stacking region (rather than a bond). So here, we see that below the single bond type, we get a continuum of interactions rather than bonds as such. The resulting NCI analysis is shown below for firstly the radical cation and then the di-cation at the same geometry.

The colour coding in the NCI surface analysis above means that dark blue are strong non-covalent interactions such as hydrogen bonds, paler blue or cyan areas are weaker ones and green is weaker still and typical of π-π stacking regions rather than bonds between two atoms. These are all deemed stabilising, whereas orange and red regions are destabilising. Click on the image above to inspect the full three dimensional surface of this NCI function and you will find the π-π stacking features, but also three cyan regions. Enclosed by two of the cyan regions are dark blue ones, whilst the third cyan region contains only a small blue part. This third cyan region is indeed in the C-C one-electron bond region, but using this analysis it emerges as only a “weak” interaction.

But a surprise! The two dark blue regions, deemed strong “interactions” are between a C-H of an aryl group and the two carbon atoms shown with blue dots in the diagram above and these are apparently more stabilizing than the one-electron C-C “bond”. Should they not also be bonds then?

The plot above is for the di-cation at the radical cation geometry. It emerges as very similar to the radical cation itself, although the C-C cyan NCI region is less intense than that for the latter and now contains little trace of the dark blue inner core.

We might conclude from this inspection of the newly reported molecule containing a one-electron C-C bond, is that it probably belongs to the class known as an “interaction” rather than an actual bond. Even as an interaction, it is not particularly strong – in part this is probably because only a proportion of that one electron is actually located in the C-C region, with the rest being distributed around the aromatic rings. However, I rather suspect that despite it resembling an interaction, it will no doubt become known as a bond!

Added in response to comment

Below is shown the Laplacian of the electron density (a definition can be found at eg [cite]10.59350/bk5zm-6rk67[/cite]). Negative values of the Laplacian appear here in purple and positive values in orange (contour value 0.125 a.u). The regular C-C bonds are all enclosed in a negative region of the Laplacian, whilst the one-electron C-C bond lies in the orange region.

^‡ Note added 4/03/2026. The archived crystal structures at CCDC are shown without a bond in the C-C region.[cite]10.5517/ccdc.csd.cc2h7dw1[/cite],[cite]10.5517/ccdc.csd.cc2h7dv0[/cite],[cite]10.5517/ccdc.csd.cc2h7dx2[/cite],[cite]10.5517/ccdc.csd.cc2h7dy3[/cite]

October 1, 2024
Mechanism of the Masamune-Bergman reaction. Part 3: The transition state for Calicheamicin models.

Calicheamicin was noted in the previous post as a natural product with antitumour properties and having many weird structural features such as an unusual “enedidyne” motif. The representation is shown below.

A partial structure shown below for Calicheamicin replaces the -(CH₂)4- substructure with a four carbon chain that includes two sp²centres instead of two sp³ centres. The purpose is to find out how these structural modifications to the classic Bergman affect the mechanism.

TS1 is shown below for this model and the computed free energy barrier for this cyclisation is 42.5 kcal/mol at the uωB97XD/Def2-TZVPP level, <S²> = 0.345. FAIR Data DOI[cite]10.14469/hpc/14583[/cite]. This compares with 33.0 kcal/mol calculated for the -(CH₂)4- version, for which <S²> = 0.266. To prepare for modelling the full Calicheamicin molecule, the basis set for this model was reduced to Def2-SVPP and at this level ΔG^‡ was 43.0 kcal/mol, <S²> = 0.368, the difference being small enough that the reduction in basis set seems unlikely to affect the results. The C-C bond forming lengths are 1.957 (Deft-TZVPP) and 1.989Å (Def2-SVPP).

Now for a larger model containing the entire Calicheamicin molecule. Two possibilities were explored; one where the geometry of the system was fully optimised in isolation to yield a conformation for Calicheamicin which folded in upon itself and for which ΔG^‡ (Def2-SVPP) 40.1 kcal/mol, <S²> 0.368.

The second model used the initial geometry of Calicheamicin as obtained from a crystal structure of the ligand folded into the minor grove of a DNA fragment and which has a much more linear form. The reactant in this mode was +6.1 kcal/mol higher in energy than the previous and TS1 was 4.6 kcal/mol higher, leading to ΔG^‡ 38.6 kcal/mol, <S²> 0.367.

So what conclusions can we draw from these extended models of the Bergman cyclisation? The activation free energies for all three models are in the range 42.5 – 38.6 kcal/mol, which is a great deal higher than a value commensurate with a facile room temperature reaction (~22±3). The observation that Calicheamicin can in fact be characterised as a crystal structure when bound to DNA suggests that the cyclisation barrier cannot be too low, but conversely the range 42.5 – 38.6 kcal/mol appears too large for Calicheamicin to easily activate into a biradical in order to abstract hydrogen atom and end up causing strand scission. Might the simplistic model of a split UHF wavefunction resulting in values of <S²> 0.37 be the problem? Well, a similar approach was taken to modelling the Stevens rearrangement [cite]10.59350/4010f-fvr26[/cite]. Using a plain non-biradical closed shell wavefunction, a barrier of ~48 kcal/mol was obtained, but this reduced to 14 kcal/mol when the UHF method was applied (<S²> 0.421), so this model appears to work well in those circumstances. The jury must still be out on whether the Bergman cyclisation mechanism is being correctly modelled here or whether something more complex is going on.

September 11, 2024
Mechanism of the Masamune-Bergman reaction. Part 2: a possible 3D Model for Calicheamicin revealing the non-covalent-interactions (NCI) present.

Calicheamicin is a natural product with antitumour properties discovered in the 1980s, with the structure shown below. As noted elsewhere, this structure has many weird properties, including amongst other features an unusual “enedidyne” motif and the presence of an iodo group on an aromatic ring. Its isolated 3D structure is quite difficult to get hold of (embedded structures in a DNA fragment are available however); the 3D model associated with the Wikipedia entry is essentially only in 2D. The representation shown below, including the absolute stereochemistry, was obtained from the SciFinder entry.

As a prelude to modelling the mechanism of the Bergman cyclisation (for Part 1 in which a simple cycloendiyne is explored, see DOI: 10.59350/jczra-f0r90 [cite]10.59350/jczra-f0r90[/cite]) of the enediyne ring on this actual molecule, a 3D model was constructed.^‡ One possible such model is shown below, built to maximise wherever possible interactions such as hydrogen bonds and weak dispersion attractions from eg methyl groups. A side benefit of doing this is the natural emergence of a “cavity” in which the very large iodine atoms snuggles, as it happens adjacent to the enediyne component – something you would not naturally infer from the structure representation shown above! A spacefill model of this conformation is shown below (click on the image to get an interactive version), emerging from an ωB97XD/Def2-SVPP energy minimisation (DOI: 10.14469/hpc/14586).[cite]10.14469/hpc/14586[/cite]

The below shows a crystal structure (2pik) of Calicheamicin embedded into a DNA duplex,^† which shows a stretched linear conformation of Calicheamicin rather than the compact form more appropriate for an isolated molecule.

The next step was to use the ωB97XD/Def2-SVPP wavefunction[cite]10.14469/hpc/14586[/cite] to calculate the full electron density for the molecule, and using this to evaluate the NCI (non-covalent-interaction) isosurfaces. These are shown below, and the eye is immediately drawn to the regions surrounding that iodine atom, which are replete with attractive green surfaces. Blue and cyan coloured surfaces derive from hydrogen bonds formed within the 3D structure (click on the image to get an interactive version, but be patient, it takes a little while to load).

The next stage, using the model to evaluate the energetics of the Masamune-Bergman cyclisation for Calicheamicin itself will be reported in part 3.

^‡For those interested, this was constructed in stages. The structure representation had been drawn in Chemdraw, saved as a pseudo 3D molfile and then loaded into Gaussview. There, it was subjected to several cycles of energy minimisation using the MMFF94 molecular mechanics force field. The stereochemistry of all the centres was carefully checked at each stage, if necessary corrected and re-optimised. The next stage was to subject it to a PM7 semiempirical SCF minimisation, a method which includes dispersion attraction terms and which tends to give geometries that are quite close to eg those obtained using dispersion-corrected DFT methods, in this example ωB97XD/Def2-SVPP.

August 26, 2024

Mechanism of the Masamune-Bergman reaction. Part 1.

The Masamune-Bergman reaction[cite]10.1039/C29710001516[/cite],[cite]10.1021/ja00757a071[/cite] is an example of a highly unusual class of chemical mechanism[cite]10.1021/cr4000682[/cite] involving the presumed formation of the biradical species shown as Int1 below by cyclisation of a cycloenediyne reactant. Such a species is so reactive that it will be quickly trapped, as for example by dihydrobenzene to form the final product. This cycloenediyne is not just an obscure chemical curiosity, the motif is incorporated into the natural product Calicheamicin, which is a potent antitumor antibiotic discovered in the 1980s. This drug owes its activity to the cyclisation TS1 shown below, which for n=2 occurs at the low temperature of 310K. The resulting biradical Int1 is a potent hydrogen abstractor, the species acting this way for hydrogen atoms associated with deoxyribose of DNA, ultimately leading to strand scission. Although I have explored many a mechanism on this blog using computational methods, I have never included any biradical examples. Here I explore the computational aspects of this reaction, and also include a pathway proceeding vis TS2- Int2 – TS3 in which hydrogen abstraction precedes cyclisation, in order to see how competitive such an alternative might be as a function of the ring size (n in scheme below).

The computational procedure was ωB97XD/Def2-TZVPP and the FAIR data is collected at DOI: 10.14469/hpc/14546 [cite]10.14469/hpc/14546[/cite]. A spin unrestricted procedure is adopted using an approximation to allow for biradicaloid species, namely an initial first guess at the wavefunction using the keyword guess(mixed) which mixes what would be the HOMO and the LUMO of the molecule in a closed shell sense to allow a combination which includes an open shell singlet with one electron in the HOMO and one electron in the LUMO (a biradical). Part of the purpose of this approach is to try to find out if it gives reasonable results for such a mechanism. I will introduce the spin expectation operator <S²> to help identify biradicals. For closed shell singlets it has the value 0.0, for a pure biradical it has the value 1.0. Thus for species Int1, the values are typically ~0.995 and for the preceding TS1 ~ 0.3 to 0.57. IRC (Intrinsic reaction coordinate) calculations for TS1 show a smooth transition from values of <S²> = 0.0 (Reactant) through to 1.0 (Int1).

The results are shown below for three values of n, revealing that as the ring size increases (ending with an acyclic system Et₂) the free energy barrier increases significantly, as indeed is reported[cite]10.1039/C29710001516[/cite],[cite]10.1021/ja00757a071[/cite]. The alternative pathway proceeding via TS2 is always higher in free energy and varies much less with ring size. This route can therefore be firmly excluded from contention.

Table. Free energies for two mechanistic routes
System	Reactant	TS1	TS2	Int2	TS3	Int1
n=1	-580.843968 0.0	-580.806441 23.6	-580.771042 45.8	-580.797829 29.0	-580.795875 30.2	-580.838524 3.5
n=2	-620.142895 0.0	-620.090298 33.0^†	-620.068740 46.5	-620.094239 30.5	-620.088955 33.8	-620.131731 7.0^‡
n=3	-659.434065 0.0	-659.370635 39.8	-659.356146 48.9	-659.384469 31.1	-659.375211 36.9	-659.413969 12.6
Et₂	-621.348992 0.0	-621.278904 44.0	-621.265041 52.7	-621.292104 35.7	-621.280607 42.9	-621.319521 18.5

†<S²> =0.27. ^‡Final Product (n=2) = -620.327868 (-116.1 kcal/mol)

This computational modelling largely agrees with the observations made for this reaction, with just one inconsistency. For n=2, the reaction is reported as taking place at 37°C, for which a typical free energy barrier would be in the region of ~24±2 kcal/mol,[cite]10.1021/ja00045a005[/cite] around 9 kcal/mol lower than the computed value at this level of theory. This could originate from either a deficiency in the computational model, possibly in the handling of the open shell biradicaloid character by use of a simple spin unrestricted model,[cite]10.1021/ja402445a[/cite] or incursion of some lower energy process into the mechanism (free radical involvement?). I will continue probing this issue to see if its origins can be identified.

In the next part of this blog, I will investigate the mechanism as applied to Calicheamicin to see how the more complex bicycloenediyne nature of this natural product affects it.

August 24, 2024

Revisiting open/transparent peer review.

Back in 2017, I was asked to peer review an article and its author asked if I would like the review to be “open” – that is that my name would be shown as a reviewer; [cite]10.1073/pnas.1709586114[/cite/] indeed it was!

Open peer review

However, I soon found out that neither of the reviews themselves would be shown alongside the article, an experience that I commented on at the time.[cite]10.59350/j44h6-d3e36[/cite] Replication in Science was – and still is – a hot topic and I had taken the opportunity with this article to try to (successfully I might add) replicate its main (computational) findings. This is something relatively easy to do with computation, but of course far more of challenge to do for experimental work for obvious reasons. I still regularly attempt some level of replication when I review articles nowadays.

So on to 2024, when I was asked – this time as an author – whether I would like the reviews of our own article to be so included, [cite]10.1039/D3DD00246B[/cite] now called transparent peer review.

Now the open aspects have been inverted! Whereas the identity of the reviewers continues to be withheld, their actual reviews are now available to be read, along with the authors’ responses. There is still no way in which any attempt at “replication” can be indicated – the reviews themselves are in free-text form and the reader has to judge for themselves what they might mean and whether replication was part of the process. I also wonder if replication whilst preserving reviewer anonymity can be achieved?

Not all journals by the Royal Society of Chemistry publisher offer transparent review and it is optional of course. But a search of the string “To support increased transparency, we offer authors the option to publish the peer review history alongside their article” suggests around 73 articles in several journals have such review. What is more difficult to establish is what proportion of published articles expose their reviews – is it a high or a low percentage? Time will probably reveal this aspect.

It is also worth noting another experiment along these lines, the so-called Octopus publishing[cite]10.59350/qxjaz-a2298[/cite] model, where a scholarly article can have up to eight distinct components, each in theory written by different authors and where any one section could have several contributions – including a replication study. Each set of authors gets credit, in the form of one or more publication DOIs. This publishing experiment has been running now for almost four years, although I note there are few if any submissions in the area of physical sciences and chemistry.

It might be fair to suggest that with innovations such as these, scholarly publishing is likely to evolve significantly over the next few years.

July 31, 2024
How should data be cited in journal articles? A Crossref request for public comment!
Metadata is something that goes on behind the scenes and is rarely of concern to either author or readers of scientific articles. Here I tell a story where it has rather greater exposure. For journals in science and chemistry, each article published has a corresponding metadata record, associated with the persistent identifier of the article and known to most as its DOI. The metadata contains information about the article such as its authors and their affiliations, the title of the article and its abstract, and is submitted to/registered with Crossref – an organisation set up in 1999 on behalf of publishers, libraries, research institutions and funders. Relatively recent additions to Crossref metadata are the citations included in the article, so-called Open Citations. Doing so has helped to create the new area of article metrics, used by e.g. Altmetrics or Dimensions to help identify the impacts that science publications have. Basically, if one article is cited by another, it is making an impact. Many citations of a given article by other articles means a larger impact. Most researchers love to have a high – and of course positive – impact and perhaps for better or worse, academic careers to some extent depend on such impacts.

With that as the background, I now move to a recent article of ours.[cite]10.1039/D3DD00246B[/cite] The metadata record for this article can be obtained using the query:
https://api.crossref.org/works/10.1039/D3DD00246B/transform/application/vnd.crossref.unixsd+xml (retrieved 14/07/2024).^‡

This has 63 citations in the body of the article, with the unusual but pertinent aspect that 30 of these relate not to other articles or to web links, but to data – specifically FAIR data. We even comment on this in our conclusions – “The citations noted here are included in the metadata record for the article, which is registered with Crossref, albeit with one significant current limitation in that there is currently no formal declaration of these citations as specific pointers to a FAIR data collection.” This statement was made on the premise that the article citations would show a 1:1 match with the metadata entries (which they do, see below. But see also here[cite]10.48550/arXiv.2310.02192[/cite]).

Before I take a look at this, I note that CrossRef metadata does not treat all citations equally. The traditional form of citation appears as such for reference 25 (there are 29 of these in total).
```
<citation key="D3DD00246B/cit25/1">
<journal_title>J. Chem. Phys.</journal_title>
<author>Scalmani</author>
<cYear>2010</cYear>
<first_page>114110</first_page>
<doi>10.1063/1.3359469</doi>
</citation>
```
A variation of this is used for variations on journal articles such as preprints, where an “unstructured” component is added to the citation. This is often used as a short commentary added by the authors relating to the citation – in this case indicating that it relates to a preprint of the article itself. The term “unstructured” also means that the commentary may not have any predictable patterns, or use any terms from a specified dictionary, and may need the special expertise of a human to process it. In other words, “unstructured” components may not be “machine friendly”. Or that a machine may have to work quite hard to work out what to do about the commentary.
```
<citation key="D3DD00246B/cit10/1">
<volume_title>ChemRxiv</volume_title>
<author>Braddock</author>
<cYear>2024</cYear>
<doi>10.26434/chemrxiv-2023-vcmcl</doi>
<unstructured_citation>For a preprint, see, D. C.Braddock, S.Lee and H. S.Rzepa, SWERN Oxidation. 
transition structure Theory is OK, ChemRxiv, 2023, preprint, 10.26434/chemrxiv-2023-vcmcl
</unstructured_citation>
</citation>
```
A third variation on this is present, but this time apparently relating to data itself. Note again the use of an “unstructured” commentary, which effectively adds the information that the citation might “apparently” relate to data. To be fair, the volume title also does that, but this should not be its job!
```
<citation key="D3DD00246B/cit19/1">
<volume_title>Imperial College Research Data Repository</volume_title>
<author>Braddock</author>
<cYear>2023</cYear>
<doi>10.14469/hpc/13108</doi>
<unstructured_citation>
D. C.Braddock , H. S.Rzepa and S.Lee, Imperial College Research Data Repository, 2023, 
10.14469/hpc/13108</unstructured_citation>
</citation>
```
Why might this be important? Well, the mantra nowadays is that information has to be processable not only by humans but also by machines undertaking learning or “artificial intelligence. Such ML/AI is at least in part about finding predictable patterns in data, and unstructured citations imply a certain lack of predictability! A machine can “read” a journal article and that should also be possible for the data on which inferences reported in the article are made. So that data has to be accessible in the first instance and then interoperable and re-useable in the second instance. These attributes are known as FAIR. So it would be great if the metadata for the article could indicate to a machine when associated data might be available – and even better to suggest that this data might have attributes of FAIR.

So we now understand that there does need to be a formal agreed way of specifically expressing a data citation in the CrossRef metadata, rather than just carrying an unstructured commentary in the citation. The good news is that such is on the way! A public discussion document requests comments by August 15th, 2024 and introduces two new Crossref additions to the metadata, which are interpreted below in terms of the article we are discussing.
1. ```
<citation  type=”dataset” key="D3DD00246B/cit19/1">
<volume_title>Imperial College Research Data Repository</volume_title>
<author>Braddock</author>
<cYear>2023</cYear>
<doi>10.14469/hpc/13108</doi>
</citation>
```
A more formal statement is also now added, and I quote Crossref’s reasons for its inclusion “we’d like to support several types of free-text statements in our metadata records as we’ve had feedback that they can be useful for downstream metadata users who are able to parse out and refine chunks of text in ways that may be useful. The statements are also useful for re-use in some situations.” In some ways, it replaces the unstructured citation from the example above, but now using a controlled dictionary term to specifically relate to data.
1. ```
<statement type=”data availability”>Data Availability and Discovery Statement</statement>
```
Let us now see how all this is handled for the article we are discussing.[cite]10.1039/D3DD00246B[/cite]
- The data itself, found as a collection with its own metadata record[cite]10.14469/hpc/13058[/cite] can and does cite the article[cite]10.1039/D3DD00246B[/cite].
- The Crossref metadata record for the article as of 17.07.2024 has 38 entries which include an <unstructured_citation>, including 30 relating to data (which are currently inferred by a human).
- If the metadata changes noted above are implemented, the 30 data citations will be clearly identified as such, as in the example shown in item 1 above, and no human inference would be needed.
The CrossRef public discussion document will remain available for another four weeks or so – meanwhile, public comments are requested! Once these enhancements have been implemented, we hope that the article metadata record we are analysing here can in turn be updated^‡ to reflect the FAIR data richness of the article. And then perhaps Altmetrics or Dimensions can start producing metrics relating to the impact of cited data. Watch this space!

^‡One difference between the article itself and its metadata record is that the former does not change {unless a corrigendum is issued} – it is a so-called Version-of-Record or VOR, whereas the metadata record itself can be responsibly updated when deemed necessary. So it is important to note the date associated with any given version of a metadata record.
July 18, 2024

Blog

Part 1: Data discovery based on general properties such as the reporting Institution, the publisher or the Researcher

Part 2: Data discovery based on chemical properties such as NMR, IR or X-ray spectroscopy

Part 3: Data discovery based on chemical properties such as Computational modelling