The Radiocarbon Data were correct (!)

[ADDENDUM: Since this article was first published, Michael Kowalski has kindly made some significant corrections, for which I am very grateful. They concern the dates when critics of the radiocarbon results first raised their objections, and a numerical error on my part which I certainly shouldn’t have made. I could have made these corrections invisibly, but I don’t think that’s fair, so they, are, as you will see, prominent.]

Yet another scurrilous podcast has recently been published insinuating that the radiocarbon data produced by the laboratories of the Universities of Arizona, USA, Oxford, UK and the ETH University in Zurich, Switzerland was duplicitously manipulated to produce a fraudulent date, and published with minimal review by the scientific journal Nature. As usual, the protagonist has not considered the data or the calculations himself, but relied on the re-analysis of some of the basic data by others, particularly as published in Archaeometry in 2019, ‘Radiocarbon Dating of the Turin Shroud: New Evidence from Raw Data,’ by Tristan Casabianca, Emanuela Marinelli, Giuseppe Pernagallo and Benedetto Torrisi, of whom the first two are enthusiastic authenticists, and last two are economic statisticians. I have pointed out some of the flaws in the reconstructions of the statistics that led to the conclusion stated in the report published in Nature in 1989 (‘Radiocarbon dating of the Shroud of Turin,’ Damon et al.), but not published the complete analysis, demonstrating that nothing dishonest or even incompetent can be demonstrated. So here it is, and anybody with a smattering of mathematical skill can check the whole thing if they want.

PART 1. From Raw Data to Nature Journal.

We must begin with the raw data submitted by the various laboratories, which was of various complexity. Oxford University, for example, sent this:

From the four sets of three dates, weighted averages (μw) and weighted errors (σw) were derived, according to the following formulae:

where xi are individual measurements and σi are individual errors.

These calculations give the following dates and errors, which were printed in Nature with two discrepancies:

The figures in red are anomalies, and probably simple rounding errors, but other explanations are possible. Neither of the Shroud figures (0-1) are anomalous.

The figures from Zurich are more complicated. On 20 July 1988, Willy Wölfli submitted four sheets of data, of which these are extracts including the dates and errors.

Each of the four samples was split into five pieces, of which three were tested in May and two in June. The results from June (the lower two in each box except for Box 3, where the samples disintegrated during cleaning) were noticeably lower than the results from May, and extensive review was carried out to find out why. Eventually the reason was discovered (“the ages obtained during run 2 have not been corrected for the so-called current dependent effect”), the corrections applied, and a new set of results submitted. Here they are:

The changes affect the lower two measurements in each case. The averages of these measurements, using the formulae above, and the values printed in Nature, are:

Again, there are a couple of discrepancies, but the Shroud sample (0-1) is not among them.

Finally, Arizona University submitted eight sheets of data just for the Shroud sample. They may have done the same for the other three samples, but they are not amongst the data file published by the British Museum. Their Shroud sample was divided into 4, labelled AIC(1), AIC(2), AID(1) and AID(2), and each sample tested separately twice, the results headed AIC(1), AIC(1)’, AIC(2), AIC(2)’, AID(1), AID(1)’, and AID(2), AID(2)’. Each sheet of these results shows five tests, giving the proportion of radiocarbon measured for each one. These five were averaged and converted into Years BP as seen on the fragments of each sheet here:

Each pair of BP dates was then combined, giving a weighted average and error for each of the four sub-samples:

In this case, there are no irregularities in the calculations.

The individual sheets for the other three samples are not available, but the summary data arrived as in the table below. At the time, it had not been corrected for δC13 fractionation, which was carried out by British Museum statistician Morven Leese. This is necessary because even in living material, the proportion of C14 to C12 is not quite the same as it is in the atmosphere, as the different masses of atoms are not taken up at exactly the same rate. Unfortunately, because the C14 decays, it is not possible to measure what the the rate of C14 uptake was unless you know the age of the sample, and it is not possible to calculate the age of the sample unless you know what the original rate of uptake was!

Fortunately, we know that where there is a reduction in the take up of heavier atoms, the C14 is taken up at about half the rate of the C13. C13 is stable, so a measurement of the C13 of a sample is the same as it was when it was alive, and an estimation of the C14 fractionation can be derived from it. Calibration into years is based on a standard C13 fraction difference of -25‰, so if the measured value is less, then years were added to the calculated date tab the rate of 8 years per 1‰. This explains the following table:

Where the δC13 is -25‰, there is no difference between the sample and the standard, so nothing is added. This occurred in Sample 1 (the Shroud) and Sample 4.

All the sub-samples of Samples 2, 3 and 4 were measured on different days, so were truly independent, and their averages calculated in the same way as Oxford’s and Zurich’s, coming to 927±32, 1995±46 and 733±43, but Morven Leese, was unhappy about combining the “averages of averages” of Sample 1, as she was not sure how independent each pair of values were, given that each pair was measured at the same time, on the same wheel. For that reason, she recalculated a different error, more in keeping with the first two sets, of 31 years, but exactly how this was calculated is not clear from the British museum file.

SUMMARY OF PART 1. Data published in Nature.

Here is Table 1 from ‘Radiocarbon dating of the Shroud of Turin,’ Damon et al.’ showing that all the derivations of the data described above are faithfully recorded, with only the four discrepancies previously mentioned:

PART 2. Assembling Single Dates.

From the calculations above, a summary table of the three laboratories’ mean dates for the four samples was produced as Table 2, the first few rows of which are:

Given that their figures were arrived at in slightly different ways by each of the laboratories, exactly how to combine them into single values for each artefact is far from determined and can be carried out in several, different but equally valid, ways. The next two rows of Table 2 show two methods:
1) Unweighted Mean: The average is simply the mean of the three values from the three labs. The error is the standard deviation of the three values divided by the square root of the number of values.
2) Weighted Mean: The average and error are calculated according to the formulae given above.

After this, it was noted that the spread of the measurements of the Shroud sample was “somewhat greater than would be expected from the errors quoted.” For example, the standard deviations of each set of three results are respectively 8%, 1%, 1% and 5% of their means. To explore this further, a sophisticated chi-squared test was applied, using a method specifically designed for radiocarbon dating by G.K. Ward and S.R. Wilson, ‘Procedures for Comparing and Combining Radiocarbon Age Determinations: a Critique,’ published in Archaeometry in 1978. Their formula is:

T’ is the chi-square, Ai each individual value, Ap the weighted mean and Si each individual error. The results of this calculation, from the three values of each sample, are 6.35, 0.14, 1.30, 2.39. The significance of these values can now easily be found online, but in 1988 had to be looked up in mathematical tables, specifically a chi-squared distribution table such as this:

from ‘Statistical Tables for Science, Engineering and Management,’
J. Murdoch and J.A. Barnes, MacMillan, 1970

At 2 degrees of freedom (the number of values – 1), our chi-squared values have a significance of 0.05, 0.95, 0.50 and 0.30, or 5%, 95%, 50% and 30%, although the second was printed as 90% in the Nature paper.

SUMMARY OF PART 2. Nature Calculations.

From all the above, there is no reason to suspect dishonesty or incompetence from the authors of the Nature paper or any associated statisticians. Almost all the figures calculated have been calculated correctly, and reasonable explanations seen for anomalies.

PART 3. Liar, Liar, Pants on Fire!

Nevertheless, numerous authenticists have enjoyed casting aspersions on these results, and claimed – or at least implied – that they must have been deliberately falsified in order to fit a preconceived medieval provenance. Discussion usually begins with a conference in Trondheim, when a protocol for dating the Shroud was discussed, and moves to Turin, when, according to Harry Gove, a protocol was finally agreed upon. The story moves to the rejection of STuRP, the incompetent selection of the sample site, the failure to measure the sample, the secret placing of the samples in their sealed steel canisters, and eventually the dishonesty and/or incompetence of the AMS laboratories. None of this is wholly honestly reported, in an attempt to discredit the data as it was finally published. But for the purposes of this article, none of it is relevant. From a statistician’s point of view, small pieces of the Shroud did end up in the three laboratories, were tested according to their normal procedures, and some data was achieved. This was collated as I have detailed above, checked and agreed to by Morven Lease of the British Museum, Anthos Bray of the Instituto di Metrologia ‘G. Colonetti’ in Turin, and broadly confirmed by at least two peer-reviewers, who said:

  1. “I have gone over this data in detail, and have several minor questions, but I feel that in general the data treatment has been appropriately carried out.”
  2. “The sampling strategy, the technical aspects of the measuring process, the statistical interpretation and the scientific analysis all are all in good shape.

As the first reviewer reported, and as I have shown above, there are minor idiosyncrasies, but the overall findings are robust. Nevertheless, there have been challenges, more or less sensible, very occasionally in search of clarification, but mostly with a view to demonstrating incompetence or dishonesty. These challenges focus on two specific areas, which we shall investigate further.

The first is why, and how, an apparent error of 17 years was changed to 31 years in the case of Arizona’s collated results. Slightly counter-intuitively, the more the error, the more easily two different values can be collated. After all, If Laboratory A insists on a value of 10 and Laboratory B insists on 12, then one of them must be wrong, but if A says 10±2, and B says 12±2, then there is sufficient overlap between the two ranges for them both to be considered valid. The implication drawn by authenticists is that the change was deliberately, and dishonestly, made to produce such a valid overlap in the case of the Arizona values. There is no justification for such an accusation, but I can allow that it is based on ignorance as much as malice, and that the data from Arizona released by the British Museum are insufficient to clear the matter up conclusively.

The Arizona Laboratory submitted eight results from four sub-samples of the Shroud, divided into two measurements each, but five results from five sub-samples of each of the three controls. The eight Shroud measurements came from four pairs tested on four different days, whereas the five sub-samples from each of the controls were each tested on five different days. For this reason, Morven Lease, the British Museum statistician, wondered how independent the Shroud measurements were. The pair measured on each day only differed in their positions on the little wheel of graphite pellets, as can be seen from the headings of each of the eight printouts:

Because of this, Leese was reluctant to treat the eight measurements as if they had all been separately measured, and discussed the problem with Douglas Donahue:

Unfortunately the response to this query is not among the files studied, so we have to guess what finally transpired. One possibility, which at least results in the error finally published, is that the second of each of the four pairs was simply ignored for error calculation purposes. The standard error of the four remaining values (the standard deviation of the four values divided by the square root of the number of values) is 30.548, which could have been published as 31. Other methods of ascertaining the overall error are also possible, but we don’t have enough information to decide which was actually used. This does not mean that we need suppose Leese guilty of incompetence or dishonesty.

The second accusation of fraud relies on the chi-squared value of 6.35 being given a significance of 5%, which is a significant figure in the world of statistics. Broadly, in order to reject the hypothesis that two figures are from the same population, you have to show that there is a less than 1 in 20 chance of their being so. Of course this is no more than a convention, and from time to time, extraordinary coincidences occur, but unless they are themselves investigated further, a 95% chance that two figures are different is not good support for a claim that they are the same. At first sight it seemed suspiciously fortunate that the Shroud calculations just made it to the 5% cut-off point, and positively dishonest when it was discovered that entering the chi-squared value (6.35) into an online calculator shows that the real significance is not 5% but 4.24%, which even if rounded, comes to 4% rather than 5%, below the level necessary to maintain that the samples were the same.

[Actually, the significance, using the data I used, is not 4.24 but 4.18, calculated according to the formula: SIGNIFICANCE = 100 x e(- χ-square/2). I don’t remember how I got the wrong answer, and I’m grateful for the correction.]

But this is faulty reasoning, as I have shown. In 1988, the significance was not found online, but by looking up the chi-square value in a book of tables, where any value between 5.3 and 6.7 matches a significance of 5%. Furthermore, had Morven Leese been constrained dishonestly to adjust the Arizona result to produce a more precise 5% significance, she would have given their error as 34, not 31.

With all this in mind, I’m afraid I look somewhat askance at some comments by Michael Kowalski, in an interview on The Gracious Guest podcast, broadcast on 16 September 2023. Of course he’s not a statistician himself, so is only expressing views he has read elsewhere, and he is certainly not the only proponent of the ‘Evil Scientists’ hypothesis, but since that podcast was the trigger for this post, I’m afraid he’s the fall-guy for my criticism.

1 – “If the significance level is less than 5%, then most statisticians would just simply reject the test as being invalid.”

This doesn’t really mean anything. A significance can be a valid and informative result regardless of its value. It is certainly true that if the samples had come from, say, both the side strip and the main body of the cloth, then the hypothesis that they were both cut from the same sheet of fabric would have to be rejected, but as it was known that all the samples had come from the same sheet, then some other explanation would have to be found. Morven Leese herself noted, in a preliminary report, that “the chi-squared statistics are not significant except for sample 1. This is due to the 100 years’ difference between Oxford’s result and the other two results.” At the time, it should be noted, it was not clear that Oxford’s and Arizona’s samples were taken from the two ends of the dissected strip, with Zurich in the middle. In that case, a sensible explanation was that it was “unlikely that the errors quoted by the laboratories for sample 1 fully reflect the overall scatter.” For what it’s worth, here are the 16 results given by the laboratories, with their error ranges (blue), and the overall averages of each laboratory (red), with their error ranges (green), all as quoted in the Nature paper:

It is worth pointing out one or two features. Firstly, that at around 690 Years BP, and also 730 Years BP, nine of the sixteen individual values overlap. Secondly, that the error bars on the overall averages (green) are much shorter than those of the individual values (blue). This is because it is much more likely that the true value lies where the measurements overlap than where they don’t.

2 – “It’s a round number, is 5. I wonder what it was before it was rounded. […] We can recalculate this from the other information that’s contained in that table, and when we do that, we find that it’s not 5%, but 4.18%, […] which does not round up to 5.“I’m tempted to use the term, it’s a schoolboy error.”

This is Junior Common Room banter, not sensible analysis. Sure, anybody can use an online calculator and find the precise value, and it is 4.24% [see above], but it seems that in 1988 the easiest thing to do was to look it up. In the table above, for instance, we find that successive chi-squared values of 4.605, 5.991 and 7.378 correspond to probabilities of 10%, 5% and 2.5% respectively. As 6.35 is closer to 5.991 than it is to 7.378, it indicated a probability of 5%, although in her own notes, Leese was careful to put “<5%.” Either way, the weak correlation was both noticed and carefully discussed, and appropriate consideration given to the final conclusion.

3 – “This was spotted immediately. There were a couple of scientists in particular, who were able to point out errors in the test report fairly shorty after it was produced. One was an Italian engineer named Ernesto Brunati […] and another was a Belgian chemist called Remi van Haelst.”

Immediately? Fairly shortly? Remi van Haelst wrote several papers from 1997 onwards, and Ernesto Brunati chimed in in 2005.

[This was poor research on my part and I’m grateful for the following correction: Several sources note that both Remi van Haelst and Ernesto Brunati were present at the Symposium Scientifique International sur le Linceul de Turin in Paris in September 1989, and presented their cases as best they were allowed. The British Society for the Turin Shroud newsletter of September 1989 also mentions Bruno Bonnet-Eymard, a conspiracy theorist whose views have not gained much traction, Dr Marie-Claire Van Oosterwyck, who thought contamination was responsible for the medieval date, Dr Eberhard Lindner, who thought the Shroud had been irradiated with neutrons, and Arnaud Upinsky. Van Haelst and Brunati, particularly, focussed on the statistical aberrations.]

Van Haelst’s first paper (‘Radiocarbon Dating The Shroud: A Critical Statistical Analysis’) rejects the statistical methods used by Morven Leese (“The British Museum did not use the classical method, but a NEW method, developed by the Australian scientists Drs. Wilson & Ward. (Archeometry 20. 1978).” – Is ten years old NEW?) and instead used his own, derived from ‘Perry’s Chemical Engineers’ Handbook,’ first published in 1934 but re-issued several times since. After pages of calculation rejecting the British Museum’s statistics, he finally arrives at “A New Statistical Assessment for AMS Radiocarbon Dates,” and says: “Any date between those limits 504-859 can be the TRUE date,” and finally “The error range for 691±31 at 95% confidence will be: 691-(4.47×31) = 552 < 691 < 830 = 691+(4.47X31).”

In other words, Remi van Haelst utterly rejected a date of 1260 – 1390, in favour of 1220 -1420. With friends like him…

Ernesto Brunati’s paper is in boisterously abusive Italian, and focuses on the decision to change the Arizona error of 17 to one of 31. He is so convinced that the whole dating procedure was a conspiracy of fraud that there is little point in his analysing the data, so he doesn’t.

4 – “The Oxford result was slightly wrong as well.”

No, it wasn’t. It was apparently Oxford’s practice to round all its figures to the nearest five (see their raw results at the top of this post). That being so, it is quite incorrect to average those figures to a precision greater than five. To adjust ‘750±30’ to ‘749±31’ is ignorant, not clever.

5 – “It does beg the question: how on earth did this test report ever manage to be published in what was one of the leading scientific journals at the time? […] These are fairly basic errors so one has to wonder how it got through? Because it seems to indicate that either myself and many, many other people before me who pointed out these errors – what appear to be very serious flaws – either these flaws are not as significant as we believe them to be, or perhaps the peer review process, well, that was flawed in some way itself. Maybe they didn’t scrutinise the report as carefully as possible.”

Got it in one. The flaws, as you perceive them, are trivial.

Then on we go to the idea that the Nature report was in some way rushed through, received on the 5 December 1988 and published on 16 February 1989, only 73 days later.

6 – “For a peer-reviewed article, that is not a long time; that’s a few weeks. Usually it take several months; six months, sometimes more. So it appears from that that it was maybe rushed through.”

Does it? The Shroud paper appeared in Volume 337 of Nature, covering January and February 1989. The three volumes 336, 337 and 338 cover from November 1988 to April 1989 inclusive and include exactly fifty “Articles,” which are the peer-reviewed scientific meat of the Journal. The average length of time between reception and publication was 114 days, mostly (one standard deviation) between 69 and 153. 10% were quicker than the Shroud paper. Only 10% took “six months, sometimes more.” That’s not “usually.”

I’m sorry that it has become a feature of almost all authenticist criticism of the medieval case, although there are notable exceptions, not to need to investigate its findings analytically, but simply to assume dishonesty, ignorance, incompetence, psychological factors and other personal failings of those who present them. Not only is this an unjustified assault on some of our most celebrated scientists, but it seriously damages the credibility of the authenticist case.