Author: Henry Rzepa

Cycloheptasulfur sulfoxide, S₇O – Anomeric effects galore!

The monosulfoxide of cyclo-heptasulfur was reported along with cycloheptasulfur itself in 1977,[cite]10.1002/anie.197707161[/cite] along with the remarks that “The δ modification of S₇ contains bonds of widely differing length: this has never been observed before in an unsubstituted molecule. and “the same effect having also been observed in other sulfur rings (S₈O, S₇I¹⁺ and S₇O).” Here I take a look at the last of these other molecules, the monosulfoxide of S₇, as a follow up to the commentary on S₇ itself.[cite]10.59350/rzepa.28407[/cite]

The axial oxygen isomer is calculated as being 3.68 kcal/mol more stable than the equatorial form[cite]10.14469/hpc/15228[/cite] and a comparison of its calculated (MN15L/Def2-TZVPP) and observed structure is shown below. The S-S lengths do indeed vary widely.

As before, an explanation is provided by analysing the orbitals of the molecule using NBO7. The interactions tabled below are ordered by the largest first. That from the oxygen into the S4-S5 antibonding NBO (28.2 kcal/mol) is the biggest I have observed for an anomeric effect involving an S-S bond. The greatest all-sulfur effect (16.8 kcal/mol) is increased compared to that previously found for S₇ itself (12.35 kcal/mol).

Donor lone Pair	Acceptor antibonding NBO	E(2), kcal/mol	Acceptor bond distance, Å
O8	S4-S5	28.2	2.28
O8	S5-S6	20.2	2.15
S7	S4-S5	16.8	2.28
S4	S3-S7	14.8	2.18
S2	S3-S7	12.5	2.18
S3	S5-S6	10.3	2.15
O8	S5-S6	9.6	2.15
S6	S1-S2	9.1	2.10

E(2)	NBO overlaps^‡ Click on image to load 3D rotatable model
28.2
20.0
16.8
14.8
12.5
10.3
9.6
9.1

The S-S stretching modes also vary by more than a factor of two; ν_4-7 619 cm^-1, ν_2-3 528 cm^-1, ν_1-6 548 cm^-1, ν_3-7 368 cm^-1, ν_5-6 331 cm^-1, ν_4-5 287 cm^-1.

It is indeed remarkable that this small molecule can exhibit as many as eight different anomeric interactions, including two unusually large ones and three regular ones. The result is the profusion of different S-S bond lengths originally commented[cite]10.1002/anie.197707161[/cite] on accompanied by the wide variety of S-S stretching modes. Can this record be beaten, either in the number or the magnitude of the effects. The answer is YES, but not for a known molecule. See next post!

May 19, 2025

Cyclo-Heptasulfur, S₇ – a classic anomeric effect discovered during a pub lunch!

Way back in 1977, the crystal structure of the sulfur ring S₇ was reported.[cite]10.1002/anie.197707151[/cite] The authors noted that “The δ modification of S₇ contains bonds of widely differing length: this has never been observed before in an unsubstituted molecule.” No explanation was offered, although they note that similar effects have been observed in S₈O, S₇I⁺ and S₇O. The S₇ molecule was yesterday brought to my attention (thanks Derek!) over a pub lunch and in the time honoured manner of scientists, sketched out on a napkin – with a pen obtained from the waitress!. As an “organic chemist”, I immediately thought “anomeric effects”. And so indeed it has proven. A calculation using the MN15L/Def2-TZVPP DFT method and analysis using the Weinhold NBO7 procedure[cite]10.14469/hpc/15228[/cite] reveals the following structure (with Cs symmetry) and indeed the four unique S-S distances are all different (experimental values in parentheses). So how does this arise?

Effect 1 is the donation of a lone pair from sulfur S4 or S2 into the antibonding orbital of the long S3-S7 bond labelled 2.174Å. The NBO E(2) perturbation energy is 12.35 kcal/mol, a fairly large effect when you consider that the more conventional value involving oxygen instead of sulfur is ~16 kcal/mol. There are two such donations (black and red) and so this long bond is doubly lengthened. Simultaneously the S4-S7 or S2-S3 bonds associated with the donor sulfur are shortened to 1.982Å.

You can see the orbitals involved below (click on the image to obtain a 3D rotatable model) and consider that the blue phase overlaps positively with the purple and also the red with orange. These overlaps conspire to move electrons from the S4 lone pair into the S4-S7 bond and to move electrons from the S3-S7 bond into an S3 lone pair and hence to shorten the first to give it some π-bond character (Wiberg bond index 1.1796) and to lengthen the second bond (Wiberg bond index 0.8295).

Effect 2 is the donation of a lone pair from sulfur S3 or S7 into the antibonding orbital of the S1-S2 bond with length 2.087Å. Only one donation – E(2) is now 10.12 kcal/mol – for each of the two S-S antibonding orbitals occurs (S1-S2 and S4-S5) and hence the lengthening of these is less than before. This again serves to shorten the S2-S3 and S4-S7 bonds labelled with the distance of 1.982Å

A smaller effect (E(2) 4.6 kcal/mol) occurs between S2/S4 and S1-S6/S5-S6.

So this adds a nice stereoelectronic explanation to an observation made almost 50 years ago. Perhaps this example should be included in all taught inorganic curricula?

Postscript: The S-S stretching frequencies vary a great deal. The symmetric and antisymmetric S2-S3 and S4-S7 modes are respectively ν 564 and 557 cm^-1 whilst the S3-S7 mode is way less at 370 cm^-1

Postscript 1: The smaller S₅ ring also shows this effect, but to a smaller extent (E(2) = 6.1 kcal/mol) and νS1-S2 = 382 vs 546 and 540cm^-1

Also for fun, how about singlet state cyclo-O₇ (heptaoxolane)? Unsurprisingly, the anomeric effects noted for S₇ itself are amplified to the point that the molecule dissociates to O₃ and 2O₂ (singlet).

Finally, singlet state cyclo-O₅ (pentaoxolane)

Here ν_O-O cover the remarkable range from 1519, 1101, 953, 227 to 200 cm^-1 (purple values in diagram above)

These vibrations are associated with the following NBO E(2) Energies; O2_Lp-O3O4_σ* 34.1, O2_Lp-O1O5_σ* 23.3, O4_Lp-O1O5_σ* 20.7, O5_Lp-O3O4_σ* 19.9, O1_Lp-O2O3_σ* 12.1, O3_Lp-O1O2_σ* 10.6.

In addition to these lone pair to σ* interactions, there are two very high σ to σ* interactions (O1O5 to O3O4^* 39.9 and O3O4 to O1O5* 33.5 kcal/mol) which strongly suggest very high so-called multi-reference character to the wavefunction.

Although not a molecule that is ever likely to be isolated in a laboratory, cyclo-O₅ still has a lot to teach us.

Note added April 2026: Anomeric effects in linear polysulfide anions such as S₈^2-[cite]https://doi.org/10.5517/ccykj88[/cite] were previously noted on this blog[cite]10.59350/ae8gx-pqy35[/cite].

May 16, 2025
Referencing and citing a science-based blog post.
Back in early 2012, I pondered about the relationships between a science-based blog post and a science-based journal article[cite]10.59350/3pbz1-vcd67[/cite]. This was in part induced by my discovering a blog plugin called Kcite, which allow a journal articles to be appended to the blog in the form of a numbered reference list. The only required input for Kcite was the DOI of the article (as you can see earlier in this paragraph). For around 500 posts after that moment, I always strove to add such references to my posts. Around 2016, I started including references to data in the form of repository DOIs to sit alongside the journal references, but this feature stopped working a year or two later because of changes in the metadata resolved by the DOI. Kcite itself lasted until January 2024 for this blog, when a required update to the software running the blog (WordPress) meant that it no longer worked and had to be removed as a plugin. Two years ago, Rogue Scholar (Science blogging on steroids) started coming along to the rescue.[cite]10.53731/axtz227-73n18e7[/cite] ,[cite]10.53731/4bvt3-hmd07[/cite] It provides some amazing automated features and infrastructure to blogs; I will illustrate from those listed on the top page of Rogue Scholar itself:
1. No waiting time — blogs can join via a simple form. Blog posts are automatically archived within minutes after publication on your blog.
2. No fees — blog posts are archived without fees to either readers or authors. Rogue Scholar is sustained by donations and sponsorships.
3. Archived — blog posts are archived by Rogue Scholar, and semiannually by the Internet Archive Archive-It service.
4. Findable — every blog post is searchable via rich metadata and full-text search.
5. Citeable — every blog post is assigned a Digital Object Identifier (DOI), to make them citable and trackable. Rogue Scholar shows citations to blog posts found by Crossref.
6. Interoperable — metadata are distributed via Crossref and ORCID, and downstream services using their metadata catalogs.
7. Reusable — the full-text of every blog post is distributed under the terms of the Creative Commons Attribution 4.0 license.
8. Communities — blog posts automatically become part of communities for your blog, the blog subject area, and topic communities based on blog post tags.
Part of the stuff that goes on behind the scenes is integration with CrossRef (which handles information about journal articles) and that in turn enables insights such as how Blogs abstracted by Rogue Scholar can be cited within journal articles and other blogs and gives some idea of the impact that these blogs are making. Here I illustrate some searches so enabled by having Rogue Scholar abstract a blog;
1. https://rogue-scholar.org/search?q=references:*&sort=newest This shows that Rogue Scholar has captured (currently) 2003 references abstracted from blogs.
2. https://rogue-scholar.org/communities/rzepa/records?q=references:*&sort=newest Of these (currently) 504 have come from mostly identifying the [cite]…[/cite] entries in my own blogs.
3. https://rogue-scholar.org/search?q=citations:*&sort=newest shows all citations of the blogs in the Rogue Scholar community, currently at 519.
4. https://rogue-scholar.org/search?q=citations:10.59350/*&sort=newest This lists the number of citations originating from the DOI prefix 10.59350 (which is that of the Rogue Scholar community).
5. https://docs.rogue-scholar.org/dashboard lists other statistics. This are revealing, indicating currently only 6% of posts currently have references, although the uptake of institutional origins (ROR) and researcher ID (ORCID) is much better.
6. The distribution amongst subject areas is 6.8% in the chemical sciences:
Meanwhile, work is under way to resuscitate the Kcite plugin, so that references are once again collected at the bottom of each post. Meanwhile, such a list can instead be found at the archived version of the posts at Rogue Scholar, as for example for this post itself. Also for the future is identifying how many of the references cited in blogs relate to research objects such as journal articles, and how many are instead to data held in e.g. data repositories. Such data reference richness in journal articles themselves is gradually increasing[cite]10.59350/th26w-gev67[/cite],[cite]10.1039/D3DD00246B[/cite] and it to be hoped also in science-based blogs themselves in the future.
April 8, 2025
Crystallography meets DFT Quantum modelling.
X-ray crystallography is the technique of using the diffraction of x-rays by the electrons in a molecule to determine the positions of all the atoms in that molecule. Quantum theory teaches us that the electrons are to be found in shells around the atomic nuclei. There are two broad types, the outermost shell (also called the valence shell) and all the inner or core shells. The density of the core electrons is much higher (more compact) than the more diffuse valence shell for all but the hydrogen atom, which only has valence electrons. How does this relate to x-ray diffraction by electrons? Well, core electrons, because of their relative compactness, diffract X-rays more strongly than the valence electrons. This compactness of the core also means that its electron density distribution can be well (but not exactly) approximated by a sphere, with the nucleus at the centre of that sphere. And from this it follows that the density for each atom can be treated independently, the so-called IAM or independent atom model. For example all the carbon atoms in a molecule are approximated as having the same value for the electron density of their core shell. But the IAM approximation is much less good for hydrogen atoms, especially when they are attached to very polar atoms (Li, O, F, etc) and even atoms such as carbon or oxygen have noticeable deviations as illustrated in figure 1 below. [cite]10.1039/d0sc05526c[/cite]

Figure 1 from [cite]10.1039/d0sc05526c[/cite] with caption: Deformation Hirshfeld densities for the carbon (left) and oxygen (right) atoms in the carboxylate group of Gly-l-Ala, i.e. difference between the spherical atomic electron density used in the IAM and the non-spherical Hirshfeld atom density used in Hirshfeld atom refinement=HAR (IAM minus HAR). Red = negative, blue = positive. Isovalue = 0.17 eÅ⁻³.^‡

X-ray crystallography is all about matching the electron density map of a model structure with the electron density map derived from the diffraction data. In “conventional” X-ray crystallography – i.e. that used by most crystallographers – the electron density map of the model is calculated using the IAM approach, where no consideration is given to any distortion of the electron density distribution caused by things like bonds – each atom is treated independently (hence the name). This method especially struggles with hydrogens and hence the inferred position of the hydrogen nucleus at the centre of an assumed spherical distribution is often difficult to obtain accurately. Enter quantum crystallography, whereby a model of the electron density distribution in a molecule can be calculated by solving the Schrodinger equation, nowadays to a very reasonable approximation in a reasonable time (minutes) using so-called density functional theory, or DFT. The resulting electron density map for the model structure might be expected to more closely match reality than the IAM approach. Most obviously affected by this change is the handling of hydrogen atoms. If one considers a C–H bond from an sp³ carbon atom, using an IAM approach the hydrogen atom (i.e. its nucleus or proton) would be placed at the centre of maximum electron density, in the full knowledge that this is not actually where the hydrogen atom nucleus itself is. The direction of the C–H vector would be correct, but the distance would be too short. In the quantum crystallography approach, the positions of e.g. hydrogen atom nuclei are not exactly coincident with the electron density maxima, amounting in effect to non-spherical atoms, thus avoiding the systematic errors seen in the IAM approach. Smaller, but possibly still significant such errors might be expected for e.g. the 2nd row elements and beyond.

Getting reliable hydrogen atom positions has previously required a neutron diffraction study, which is difficult, expensive and time consuming. So the idea of using the non-spherical DFT densities rather than the spherical IAM approach to build a model using X-ray diffraction data is very appealing. But does it work? To test this, we decided to go back to some previously published structures that were handled using the IAM approach, and re-refining them using quantum crystallography. We do not have the corresponding neutron studies to check the answers against, but we can still see how well the structures themselves refine and what new problems this approach might throw up.

Method

The original published structures[cite]10.14469/hpc/2297[/cite] were refined with SHELX-2014[cite]10.1107/S2053229614024218[/cite] which uses an independent atom model (IAM) approach. The results reported here employed NoSpherA2[cite]10.1039/d0sc05526c[/cite], [cite]10.1107/S0021889808042726[/cite] using Hirshfeld atom refinement[cite]10.1107/S2052252514014845[/cite] and selecting Def2-SVP as the (all-atom) basis set^† and ωB97X-V as the DFT method (the results seem relatively insensitive to either), implemented in the ORCA program.[cite]10.1002/wcms.81[/cite] For the first attempts no changes were made to the structures beyond the anisotropic refinement of the now unconstrained hydrogen atoms. For four of the structures a number of the hydrogen atoms went non-positive definite (i.e. one of the radii of a thermal ellipsoid refined to a negative length), which is physically nonsensical and would be a significant barrier to publication. (we don’t quite want to say “unpublishable” as there are almost always exceptions, but a non-positive definite thermal parameter is pretty close to being unacceptable.) For these cases, a second version was created (V2) where all of the hydrogen atoms were refined isotropically but with the distances and thermal parameters still allowed to refine. For AB1709 (18b), this still had the isotropic thermal parameter of one of the hydrogen atoms (H11) go non-positive definite, so for that one hydrogen atom the free isotropic thermal parameter was replaced with a riding one.

The results

We chose a set of seven structures published in 2017[cite]10.1021/acsomega.7b00482[/cite] and refined as noted above using conventional methods. These seven also comprise one of the very first sets of crystal structures for which full diffraction data were made available,[cite]10.14469/hpc/2297[/cite] rather than just the refined structure in the form of a CIF file. The new results have also been deposited[cite]10.14469/hpc/15030[/cite] to augment the record for these compounds. Spreadsheets corresponding to the images below can be obtained by clicking on the image.
1. All seven structures saw a reduction in the final R-factor.[cite]10.14469/hpc/15030[/cite] However, all of the structures also saw a significant increase in the number of parameters (as the hydrogen atoms went from using zero parameters each in a fully riding model to nine parameters each in a fully free anisotropic model). However, all the QM refinements passed the Hamilton test, suggesting that the reduced R-factors do indeed reflect a better model, rather than just being a consequence of the significantly increased number of parameters.
2. All four of the structures containing bromine atoms had a number of the hydrogen atoms go non-positive definite when refined anisotropically. It is not clear exactly why this happened – there does not appear to be any correlation with data quality or intensity (as crudely measured by R(int) and mean I/σ respectively), and though the redundancy for these structures is fairly low (between 1.5 and 1.7), those for the non-bromine structures are not much better (1.5, 2.3 and 4.9). These data sets were the result of experiments designed to collect 98.5% of the symmetry unique data with no consideration for redundancy at all. However comparison of the initial and secondary versions of the refinements of these four structure does show that the substantial majority of the observed R-factor decrease can be achieved without using anisotropic hydrogen atoms.
3. As regards the precision of the structures, using one C(sp2)–C(sp3) bond as a proxy (the C7–C8 bond) we can see that the estimated standard deviation is either the same or only slightly lower in all seven structures, suggesting that getting lower e.s.d.s would not be a motivating factor for using quantum crystallography.
4. One of the more unexpected results was the variation in F(000). In X-ray crystallography (deliberate emphasis on X-ray, as neutron diffraction is different) F(000) is supposed to be the total number of electrons present in the unit cell, and is used as an overall scale factor for the electron density map. It is very much not supposed to be variable, and any discrepancy would indicate an error in the calculated or reported formula and should be corrected. We do not understand why the QM refinements give a different answer than the IAM ones (some up and some down — normalised to a per molecule basis the range is –1.1 to +2.2), though it seems likely to be associated with cut-offs (boundaries) in measuring the “smeared out” electron density in the QM models, The IAM models all give the expected “correct” values.
5. Based on the checkCIF reports for the QM structures, if quantum crystallography catches on in a big way, then checkCIF will probably need to be updated, there now being a number of high level alerts for long X–H bonds.
6. One of the major areas of uncertainty with quantum crystallography is what/how much data needs to be collected. Symmetry unique data to 0.84 Å seems insufficient, but what would be sufficient — full sphere, redundancy, higher resolution? Would the final results be worth the extra time investment? None of the above aspects are clear at this stage, but it will be interesting to see how the technique develops.
These seven crystal structures also occupy an interesting position for posterity. Data for them has been made available spanning eight years which illustrates two significantly different refinement methods being used during this period, as well as having access to the original complete diffraction image data to allow any completely new analysis to be made in the future. Who knows, maybe in eight years time an even better method may become available for comparison with the results reported here.

^‡To put this into context, 0.17 eA^-3 would generally be regarded as a pretty low level background noise, similar to the value of the maximum residual electron density a crystallographer might be happy with. ^†The structure which showed the smallest change in R factor on using quantum crystallography, i.e. AB1608b, was re-run with the triple-ζ Def2-TZVPP basis set. This did give lower R factors but by very little (3.38% to 3.36% aniso with npd; 3.39 to 3.38 iso).
March 17, 2025
Finding and Discovery Aids as part of data availability statements for research articles.

Starting around 2016, journal publishers started including mandatory “Data Availability” statements as part of research articles; a typical (dated) example is linked here, including guidelines for how to cite the data itself. I wrote about these aspects last year in a blog post for the RSC journal Digital Discovery[cite]10.26434/chemrxiv-2024-dz2dv[/cite] and here I follow up with more news.

In a recently published article about Direct Amidation Reactions[cite]10.1039/D4SC07744J[/cite], the following version of a data availability statement appears: An IUPAC FAIRSpec Finding Aid for the NMR spectroscopic data is available at DOI: 10.14469/hpc/14884. A selection of data discovery searches can be found at DOI: 10.14469/hpc/14822 and it introduces the concept of a Finding Aid. Put simply, knowing where the data supporting a research is available will not necessarily lead you to the particular datum you might be looking for, especially if there is a lot of data. Data is still frequently made available in the form of a supporting document called ESI, and such documents can contain many tens of compounds and possibly hundreds of associated spectra. The aim of a Finding Aid is to help you find the ones you are interested in.

If you are interested in how this works, go explore either of the two links given above. The Finding Aid tool was created by Bob Hanson as part of an IUPAC working party on how to create spectroscopic data in so-called FAIR form (The F of FAIR and the F of Finding Aid are one and the same of course!). This represents its first deployment for a newly published article. The creation tool itself is still α-stage – further tools are being developed – of which more later.

February 19, 2025
Au-pseudocarbyne – a unusual example of a twelve coordination by carbon.

Derek Lowe tells the story of “carbyne”, a potential further allotrope of carbon, comprising linear chains of carbon atoms, C-C≡C-C≡C-C. Whether such a molecule can exist on its own has long been the the topic of speculation. Now a report has appeared of a “pseudocarbyne”, stabilised by gold atoms.[cite]10.1038/s41598-024-80359-5[/cite]

The now thankfully almost ubiquitous data availability statement includes the DOI: https://doi.org/10.48349/ASU/3TWEI0 [cite]10.48349/ASU/3TWEI0[/cite] as a data repository source of replication data and one of the files found there is a CIF containing the crystal data. Playing with this, I noticed one unusual feature of this structure, which oddly is not apparently mentioned in the article itself and so I thought I would tease it out here – 12 coordination.

Ths simplest unit comprises three eight membered carbon rings, each connected by 4-membered rings to form a local structure with D3h symmetry and hence revealing twelve C-Au bonds of the same length; 2.415Å. Click on the image above to view a 3D model.

A larger section of the (polymeric) structure is shown below, now with D2h symmetry and again with twelve identical C-Au bond lengths

Is such coordination unusual?^‡ Well, not for metal clusters, including Au clusters. There are in fact 2014 hits (1985 examples where Y is constrained to be a metal, hence 29 where the central atom is NOT a metal) in the Cambridge crystal structure database for the general search X₁₂Y where X and Y can be any atom, with 244 for X=Au and 576 for X=O but none yet for X=C (the current example has not yet appeared in the distributed database). So certainly Au-pseudocarbyne is a unique and unusual molecule. This also shows that 3D coordinates can always be a useful adjunct to articles to allow quick access for spotting perhaps unexpected features with just a single click!

^‡You might be surprised that a similar search finds 138 hits for X₁₄Y and 16 for X₁₆Y

February 1, 2025
Molecules of the Year 2024: Molecular shuttle in a box.
This is another in the C&E News list of candidates for the Molecule of the Year, Molecular shuttle in a box [cite]10.1002/anie.202318829[/cite]
1. Mirror-image cyclodextrin [cite]10.1038/s44160-024-00495-8[/cite]
2. Molecular shuttle in a box [cite]10.1002/anie.202318829[/cite]
3. Rule-bending strained alkene [cite]10.1126/science.adq3519[/cite]
4. First soluble promethium complex [cite]10.1038/s41586-024-07267-6[/cite]
5. Single-electron carbon-carbon bond [cite]10.1038/s41586-024-07965-1[/cite]
6. Hot MOF for capturing carbon[cite]10.1126/science.adk5697[/cite]
The molecule shown below inside the cavity is coronene. A free energy barrier of ~13 kcal/mol was determined using NMR peak coalescence temperatures, and inferred to correspond to the energy required to move the coronene from one end of the cavity to the other. Here I perform a simple reality check on this result using ωB97XD/Def2-SVP DFT calculations.[cite]10.5281/zenodo.14746877[/cite],[cite]10.5281/zenodo.14746910[/cite],[cite]10.5281/zenodo.14746936[/cite]. This functional includes a second generation dispersion correction, which is the primary effect controlling the position of the coronene inside the cavity.

Firstly, the fully optimised geometry of the complex.

A spacefill representation shows the coronene is a perfect fit inside the cavity!

An NCI analysis (non-covalent-interaction) shows the NCI region around the coronene providing the dispersion stabilisation of the complex. The red regions by the way are related to the Ir, which has very different NCI cut-offs compared to C,N,O and shows up as an artefact.

The barrier is induced by steric interactions between the coronene and the t-butyl groups attached to the edge of the cavitand, shown with a red arrow in the spacefill representation below.

Here is the crunch, the calculated ωB97XD/Def2-SVP barrier is ΔG^‡ 5.7 kcal/mol, significantly less than the value of ~13 kcal/mol measured for this dynamic process. But wait, another intermediate was located, shown below, now only 3.8 kcal/mol above the structure shown above. So the energy potential inside the cavity is more complex than just two minima and one transition state!

What are we to make of the disparity between the measured NMR barrier for the shuttle to move from one end of the cavitand to the other and the calculated value? Well, the barrier is likely to mostly arise from dispersion interactions, thus making this molecule a very sensitive test of how accurately the dispersion interactions are being calculated. It is known that the ωB97XD method does rather over-estimate these, and perhaps this is resulting in a barrier which is considerably too low? So this makes this molecule a useful test of potentially more-accurate dispersion corrected methods! The B3LYP+GD3+BJ method for this barrier is 6.6 kcal/mol [cite]10.5281/zenodo.14748031[/cite],[cite]10.5281/zenodo.14748035[/cite]. When new dispersion methods become available, I might add these as well to see if a trend develops.
January 25, 2025

Molecules of the Year 2024: A crystal structure perspective on anti-Bredt olefins.

Each year C&E News publishes a list of candidates for the Molecule of the Year. For 2024 the list is (in order of votes cast for each)

Mirror-image cyclodextrin [cite]10.1038/s44160-024-00495-8[/cite]
Molecular shuttle in a box [cite]10.1002/anie.202318829[/cite]
Rule-bending strained alkene [cite]10.1126/science.adq3519[/cite]
First soluble promethium complex [cite]10.1038/s41586-024-07267-6[/cite]
Single-electron carbon-carbon bond [cite]10.1038/s41586-024-07965-1[/cite]
Hot MOF for capturing carbon[cite]10.1126/science.adk5697[/cite]

I dealt at length with entry 5 (single-electron carbon-carbon bond) last year, my conclusions rather negating the statement made about it being an example of such a bond. Here I take a look at number 3, A solution to the anti-Bredt olefin synthesis problem.[cite]10.1126/science.adq3519[/cite] Four molecules below (1-4) were identified as examples of anti-Bredt rule compounds from trapping experiments (their properties such as NMR or indeed structures are not reported). Julius Bredt had predicted 100 years ago would be particularly unstable.[cite]10.1002/jlac.19244370102[/cite]

One way of putting these molecules into context is to search for any similarly strained alkenes in the Cambridge crystal database. The search query used defined a centroid of the plane defined by the three carbon atoms attached to the bridgehead carbon atom, and then the distance from that centroid to the carbon atom itself. For entirely planar coordination of that atom, the distance would be ~zero and the deviation from zero is one way of measuring how strained the alkene is.

The results of the search (for which fullerenes are excluded as special cases) is shown below. The upward limits of the centroid distance are between ~0.3 – 0.34Å; the outlier at 0.47Å appears to be an error, since the corresponding C=C distance is 1.565Å.

For comparison, the centroid distance to four-coordinate carbon (a central carbon with four attached carbon ligands) is shown below – the most probable value being ~0.51Å.

Since compounds 1-4 were not actually isolated, no crystal structures or NMR data are available. ωB97XD/Def2-TZVPP calculations were performed to establish trends in these properties (FAIR Data [cite]10.14469/hpc/14898[/cite]).

Molecule	Centroid distance, Å	C=C length	ν cm^-1	δ ¹H	δ ¹³C
1 (“ABO 12”)	0.510 (adduct 62 [cite]10.1126/science.adq3519[/cite])	1.346	1611	6.76	189.8
2	0.505 (adduct 58 [cite]10.1126/science.adq3519[/cite])	1.349	1594	6.76	196.8
3	0.341 (adduct 72 [cite]10.1126/science.adq3519[/cite])	1.341	1684	5.84	170.3
4	0.357 (adduct 68 [cite]10.1126/science.adq3519[/cite])	1.336	1694	6.26	173.3

For compounds 1-2, the largest ring of the three associated with the bridgehead carbon is six, whereas for compounds 3-4 it is seven. This is reflected in the values shown in the table above. The centroid distance for the six-ring examples is close to 0.5Å, for which no examples exist in the crystal structure database. The centroid distance for the seven-ring examples is 0.34-0.35Å, for which a number of crystalline examples are evident. It seems likely then that compounds 3-4 stand a better chance of being isolated as such, rather than having their existence inferred from the cycloadducts they form. Perhaps a modification to the experimental procedures might accomplish this? The predicted ¹H and ¹³C spectra are shown in the table to aid identification if this is ever achieved. Also noteworthy are the C=C stretching vibrations, which are lowered significantly for 1-2 compared to 3-4.

Its good to have experimental evidence for compounds that 100 years ago were predicted to be unusually unstable. Perhaps the next step is to isolate them as pure compounds and study their properties.

January 8, 2025

The secrets of FAIR Metadata: optimisation for Chemical Compounds.
The idea of so-called FAIR (Findable, Accessible, Interoperable and Reusable) data is that each object has an associated metadata record which serves to enable the four aspects of FAIR. Each such record is itself identified by a persistent identifier known as a DOI. The trick in producing useful FAIR data is defining what might be termed the “granularity” of data objects that generate the most readily findable and which most usefully enable the other three attributes of FAIR.

To set the scene for how to do this optimally, I first set out two extreme examples of FAIR objects relating to chemical spectroscopy such as NMR. These will be directly associated with a journal article describing for arguments sake say 50 compounds new to science, with the existence of these data objects identified via a data availability statement appended to the article. Each compound might be characterised by say spectroscopic and crystallographic information and perhaps some computational analysis. For the spectroscopic analysis, perhaps 5 types of NMR experiments might be included, giving a total of around 10 separate types of datasets for each compound, or in round numbers lets say 500 data sets for the 50 compounds reported in such an article.^†
- Method A: The data associated with an articles takes the form of a ZIP (or other type of compressed) archive containing all 500 of the intended FAIR data sets. The resulting ZIP file is then described with a single metadata record and assigned a single DOI using e.g. the tools of a data repository. That one metadata record has the (mammoth) task of describing all of these datasets, across perhaps ten different kinds of experiment. This type of monolithic object is in fact not unusual, for several reasons. Some repositories impose a significant charge for each deposition, and so the temptation to reduce costs would be to adopt this expedient.
- Method B: The other extreme is to literally deposit all 500 data sets separately and assign 500 DOIs, each with a separate metadata record. The issue now is less how well the metadata record can describe each dataset, but more of to establish the relationships between these 501 objects (the journal article and each dataset). Such relationships could include:
  - that between the compound molecular structure and the dataset
  - that between say the dataset and the type of spectroscopic experiment (e.g. IR, MS, NMR, XRD, Comp)
  - that between different eg NMR experiments for the same compound (the nucleus, the pulse sequence, the solvent, etc).
  - These could in total represent a great many individual relationships between both the 500 data sets and the article itself (formally around 501²/2!)
Before setting our solution, I show below how a typical repository such as Zenodo handles the relationships between data objects noted above.

The relation type is selected from a controlled list of about 30, and is entered for each individual metadata record associated with a DOI. So clearly, relationships in the second category would have to be individually entered, hardly feasible for 501²/2 entries. And in the first category, only one relationship between the single large archive of data and the journal DOI can be added. One of the more important relationships in this context are the “Has part” or “Is part of” ones (diagram above).

The use of this now constitutes Method C.
1. One starts by creating what could be called a top or level 1 entry, which will contain important core metadata information such as the contributing authors, the institute where the data was obtained, the title and overall description of the datasets to come, a license, a date, a declaration of the published article associated with the data and finally the DOI of this metadata record. This top-level entry would also list all the compounds on level 2 for which data is available and each being referenced by a “Has part” declaration via a DOI for each compound.
2. Each compound on level 2 would in turn point back to level 1 by an “Is part of” metadata declaration. Each compound on level 2 would also list the spectroscopic experiments available that compound, for example the NMR method as part of level 3. It would have an “Is part of” declaration pointing back to the compound level 2 entry.
3. The list of the different NMR experiments on level 3 also have “Has part” declarations pointing to the list of NMR experiments on level 4.
4. Each NMR experiment conducted on level 4 would contain an “Is part of” declaration back to level 3 and a list of “Has part” entries which describe the individual data files available for that experiment in the metadata record for level 4.
If you wish, you can inspect all “Has part”/”Is part of” declarations in the metadata records for these various levels by invoking e.g. https://data.datacite.org/application/vnd.datacite.datacite+xml/10.14469/hpc/11446 (replacing e.g. 11446 by any of the DOI suffixes shown in red in the diagram below). They are all associated with this published article.[cite]10.1021/acs.inorgchem.3c01506[/cite]

What does this use of relational parts declarations achieve? Well, compared to method A, where everything had to be achieved within a single metadata record (and in practice never is) or method B, where a very large number of relationships would have to be declared (and again never are), Method C achieves a good balance between the two.^‡ By collecting the metadata information into groups, one can achieve a more readily navigable structure for the information and also allow sub-groups to effectively inherit properties from the higher group.

I end by noting that far too few FAIR data collections associated with published journal articles adopt such procedures, in large part because of very little current exploitation of relationships between the data such as the one used above (“Has part”/”Is part of”). The repository itself has to be carefully designed to do this as automatically as possible and not require the human depositor to invoke each instance by hand (as shown for e.g. Zenodo above). An example of just such a repository is described here.[cite]10.1186/s13321-017-0190-6[/cite]

^†The data sets themselves might be made available in more than one form (for NMR, a Bruker ZIP archive, an Mnova file, a JCAMP-DX format or just a PDF spectrum), thus increasing the number even further.
^‡It reminds me of when I used to teach molecular orbital theory using the Hückel method, which requires a secular matrix to be diagonalised. For e.g. naphthalene, this operation would have to be conducted on a 10*10 matrix, something almost impossible by hand. However, one could use group theory to block diagonalise this matrix into much smaller matrices with the off-diagonal elements between them set to zero, thus considerably reducing the task at hand.
December 11, 2024
Data Discovery: A pick-n-mix library of useful FAIR Data searches – and a call for new search suggestions.
With AI and Machine learning needing data in abundance, interest in data discovery is intense. However, this type of discovery is somewhat different from more traditional data base searches, in that it is particularly suited for machine discovery as well as by humans. The discovery searches are conducted using an aggregated and federated metadata store, such as that curated by DataCite. How to construct a suitable search is however still not entirely human-friendly. The start point for understanding how to search is this resource: XML to JSON mappings and the XML referred to can be found here. [cite]10.14454/g8e5-6293[/cite] Since the learning curve to construct such data searches can be quite steep, I thought I would share as a library some recent searches I constructed for a talk I am giving. This post is essentially an extension and update of an earlier challenge I was set along these lines and which appeared here.[cite]10.1255/sew.2022.a10[/cite]

You can see that the searches come as components linked by Boolean operators, separated by strings such as +AND+, +OR+ or +NOT+. Essentially like a Lego constructor set, you can create your own searches by combining these components to suit your own needs. No doubt some AI-based procedure will come along that will convert natural language expressions of the intended search into the JSON-friendly strings you see below – at least that is the hope.

Part 1: Data discovery based on general properties such as the reporting Institution, the publisher or the Researcher
1. Find all Data-related Works associated with Cambridge University and the American Chemical Society Publisher
  - https://commons.datacite.org/doi.org?query=((contributors.affiliation.affiliationIdentifier:*013meh722)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*013meh722)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*
    232 Works
2. Find all Data-related Works associated with Imperial College and the American Chemical Society Publisher
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*
    304 Works
3. Find all Datasets OR Collections associated with Imperial College and the American Chemical Society Publisher and the term
  Pyrazol in the Title or Description
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+((types.resourceTypeGeneral:Dataset)+OR+(types.resourceTypeGeneral:Collection))
    3 Works
4. Find all Datasets OR Collections associated with Imperial College and the American Chemical Society Publisher and the term
  Pyrazol in the Title or Description and a specified Researcher
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+relatedIdentifiers.relatedIdentifier:10.1021*+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+((types.resourceTypeGeneral:Dataset)+OR+(types.resourceTypeGeneral:Collection))+AND+((contributors.nameIdentifiers.nameIdentifier:*000-0002-3296-6817)+OR+(creators.nameIdentifiers.nameIdentifier:*000-0002-3296-6817))
    1 Work
5. Find Datasets only associated with Imperial College and the term Pyrazol in the Title or Description
  - ?query=((contributors.affiliation.affiliationIdentifier:*041kmwe10)+AND+(contributors.affiliation.affiliationIdentifierScheme:ROR))+OR+((creators.affiliation.affiliationIdentifier:*041kmwe10)+AND+(creators.affiliation.affiliationIdentifierScheme:ROR))+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)+AND+types.resourceTypeGeneral:Dataset
    270 Works
6. Find just Datasets associated with a specific researcher
  - ?query=types.resourceTypeGeneral:Dataset+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-7816-0042+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-7816-0042)
    8 Works
7. Find Data-related Works associated with Cambridge University, the SubjectScheme FOS (Field of Science) and the Subject term *Chemical*
  - ?query=(subjects.subjectScheme:*FOS*)+AND+(subjects.subject:*Chemical*)+AND+((creators.affiliation.affiliationIdentifier:*013meh722)+OR+(contributors.affiliation.affiliationIdentifier:*013meh722))
    440 Works
8. Establish if a specified publication with a specified author has an associated FAIR Dataset or FAIR Collection:
  - ?query=(types.resourceTypeGeneral:Dataset+OR+types.resourceTypeGeneral:Collection)+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390)+AND+(relatedIdentifiers.relatedIdentifierType:DOI+AND+relatedIdentifiers.resourceTypeGeneral:JournalArticle+AND+relatedIdentifiers.relatedIdentifier:10.1021/acs.inorgchem.3c01506)
    
    1 Work
9. Establish how many journal publications by a specified author have an associated FAIR Dataset or FAIR Collection:
  - ?query=(types.resourceTypeGeneral:Dataset+OR+types.resourceTypeGeneral:Collection)+AND+(contributors.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390+OR+creators.nameIdentifiers.nameIdentifier:*0000-0002-8635-8390)+AND+(relatedIdentifiers.relatedIdentifierType:DOI+AND+relatedIdentifiers.resourceTypeGeneral:JournalArticle+AND+relatedIdentifiers.relatedIdentifier:*)
    
    1 Work
Part 2: Data discovery based on chemical properties such as NMR, IR or X-ray spectroscopy
1. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:chemical/x-cdxml+OR+media.media_type:chemical/x-mdl-molfile)+AND+(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    150 datasets
2. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR Nuclei as a Subject, for 13C and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Nucleus)+AND+(subjects.subject:13C)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    41 datasets
3. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject, for HMBC Experiments and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Expt)+AND+(subjects.subject:HMBC)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)”
    26 datasets
4. Find all Datasets associated with Chemical structure representation and NMR Media types,
  NMR as a Subject, using solvent “CD₃OD” and the title or description term
  “Pyrazol”
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR_Solvent)+AND+(subjects.subject:*CD3OD)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    22 datasets
5. Find all Datasets associated with NMR Media types,
  NMR as a Subject and InChIKey : OZEYXLXJQKVGCZ-UHFFFAOYSA-L
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+((subjects.subjectScheme:inchikey)+AND+(subjects.subject:OZEYXLXJQKVGCZ-UHFFFAOYSA-L))
    5 datasets
6. Find all Datasets associated with NMR Media types,
  NMR as a Subject and the molecular formula component of the full InChI : InChI=1S/2C18H16N2O3.2C2H6O.Ca/c2*1-23-15-9-7-13 etc
  - ?query=(media.media_type:application/zip+OR+media.media_type:chemical/x-mnova)+AND+(subjects.subjectScheme:*NMR*)+AND+((subjects.subjectScheme:inchikey)+AND+(subjects.subject:InChI=1S/2C18H16N2O3.2C2H6O.Ca*)) 5 datasets
7. Find all Datasets associated with Chemical structure representation Media types,
  IR as a Subject and the title or description term
  “Pyrazol”
  - ?query=media.media_type:chemical/x-cdxml+AND+(subjects.subjectScheme:*IFD.IR*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    36 datasets
8. Find all Datasets associated with a Chemical structure representation and Crystal structure
  Media types, XRAY as a Subject and the
  title or description term “Pyrazol”
  - ?query=media.media_type:chemical/x-cif+AND+(subjects.subjectScheme:*IFD.XRAY*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    38 datasets
Part 3: Data discovery based on chemical properties such as Computational modelling
1. Find all Datasets associated with Chemical structure representation and Computation Media
  types, COMP as a Subject and the title
  or description term “Pyrazol”
2. - ?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+(subjects.subjectScheme:*IFD.Comp*)+AND+(titles.title:*pyrazol*+OR+descriptions.description:*pyrazol*)
    4 datasets
3. Find all Datasets associated with Computation Media types and the subject KIE for Hydrogen isotopes.
  - Visual search:
    ?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
    17 datasets
  - API Search:
    https://api.datacite.org/dois/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
  - Command line search:
    curl https://api.datacite.org/dois/?query=(media.media_type:chemical/x-gaussian-log+OR+media.media_type:chemical/x-gaussian-checkpoint)+AND+media.media_type:text/plain+AND+(titles.title:*Endo*+OR+descriptions.description:*Endo*+OR+titles.title:*Exo*+OR+descriptions.description:*Exo*)+AND+(subjects.subjectScheme:*KIE*)+AND+subjects.subject:1H/2H
One feature of this approach is that the searches themselves, which are across a globally aggregated metadata store, can change with time. So repeating some of the searches at defined time intervals can also give a dynamic indication of how a particular area of data is growing. Other searches are of course designed to give a single hit which probably will not change with time.

The above is based on an interpretation and implementation of the DataCite Schema, one which will eventually need to be agreed by the communities and sub-communities that might wish to use them. So beware, there may be other implementations covering similar data that would not eg be found by the above searches, particularly in the way the subject terms above are used. They are therefore included here purely to raise awareness of the potential that such an approach has – along with my observation that I had never attended any presentation where they have been discussed or shown. In the future, it seems likely that these JSON-based searches will themselves get automated and generated by software rather than by a human as here. When that comes, searching will never be the same again!

I also welcome suggestions for new search queries. This might either be accommodated using the existing metadata, or might require new additions to the metadata record. Please send them here as comments.
November 25, 2024

Author: Henry Rzepa

Part 1: Data discovery based on general properties such as the reporting Institution, the publisher or the Researcher

Part 2: Data discovery based on chemical properties such as NMR, IR or X-ray spectroscopy

Part 3: Data discovery based on chemical properties such as Computational modelling