Liars, Damned Liars and Statisticians

The moment the eminent scientific journal Nature published ‘Radiocarbon Dating of the Shroud of Turin’ on 16 February 1989, strenuous efforts began in order to discredit it. The protocol adopted, the selection of the sample site, the size and number of the sub-samples and the personal character of those involved have all been criticised, sometimes quite vehemently, and the statistical derivation of the final date from the raw data taken by the three laboratories analysed and worried to bits in an attempt to make it demonstrate dishonesty. This post will review that side of the attack, and show that, although some kind of chronological gradient along the sample now seems credible, in general the conclusions of the British Museum remain as robust as ever.

The problem, which was flagged by the Nature paper, was that “the spread of the measurements for sample 1 [the Shroud] is somewhat greater than would be expected from the errors quoted.” In other words, the Oxford result, of 750 years ‘BP’ ± 30, did not overlap the Arizona result, of 646 years BP ± 31. Had these samples been taken from different cloths, it must have been assumed that they were from different dates. The anomaly was quantified, using a test recommended by a standard work on radiocarbon statistics, namely ‘Procedures for Comparing and Combining Radiocarbon Age Determinations: A Critique,’ by G.K. Ward and S.R. Wilson. This was a statistical chi-squared test, resulting in a value, and a probability that gave a less than 5% chance that the two samples were from the same textile. However, as it was evident that the samples actually were from the same textile, the anomaly was hardly resolved. Ironically, had the errors been recorded as larger, say ± 60, then the two results would have overlapped, and the problem would not have arisen. Consequently, the British Museum decided to assume that was the case, to ignore the errors quoted as incorrect, and to find the simple average of the three figures given by the three labs (Oxford: 750, Zurich: 676 and Arizona: 646), which was 691, and derive the error from the Standard Deviation of the three figures, divided by the square root of the number of data (three), giving a value of 31. This is clearly explained in the paper.

The three control samples’ values and errors all did overlap, so the overall error for them was based on the quoted error for each one by each laboratory. In fact both the weighted and unweighed means for all four samples were printed, for comparison. The difference in means, in the case of the Shroud, was two years.

At the time it was not clear in which order the samples had been cut from the original strip excised from the Shroud, so that there were six possible ways in which the samples could have been arranged (AOZ, AZO, OAZ, OZA, ZAO, ZOA), all equally probable, but when the actual arrangement became clear (Oxford, Zurich, Arizona, starting from the end nearer the end of the Shroud), there became a prima facie case for considering that the samples might lie along a chronological gradient, with the Oxford end the oldest. Assuming that the original cloth was manufactured from homogeneous material, it seems that some form of contamination has sequentially affected it, or at least that part of it. This calls the precision of the date into question, as it is unknown which end, if either, is the least contaminated (giving the truer date), whether the contamination has led to an older or younger apparent date. or how much contamination there was. In spite of determined investigation by Joe Marino, and some very precise nuclear radiation modelling by Bob Rucker, these questions have not been resolved by examination of the samples or material derived from them.

With nothing but the data to go on, statisticians have tasked themselves to analyse it more thoroughly, sometimes simply to determine how it was analysed before, by the three laboratories and the British Museum, each of whom used slightly different statistical tools, sometimes in an attempt to discredit the laboratories and the British Museum as incompetent or dishonest, and sometimes in an attempt to claim that the entire enterprise and its conclusion should be abandoned as worthless, in the hope of strengthening the case for authenticity.

The data revealed itself in phases. To start with, the Nature paper presented two tables. Table One listed the individual dates each laboratory had derived for the pieces it had cut its sample into and tested; Although each laboratory had been given the same amount of cloth, Arizona tested four pieces, Oxford tested three and Zurich five. Table Two gave single dates for each textile (the Shroud and three controls) from each laboratory, and combined them into one overall date per textile. A third table converted the radiocarbon data, conventionally converted into ‘Years Before Present” into calendar years and ranges.

A controversy arose over Arizona’s results when it was revealed that they had retained two small pieces of their Shroud sample (so had actually only tested about half of it, divided into four pieces), and then that they had submitted two dates for each of the four pieces they tested, which the British Museum asked them to combine into one date per piece, as the other labs had provided. This they did, as appears in Table One of the Nature paper, but, although their procedure was entirely conventional, they have come under severe criticism from authenticist statisticians who claim that a different method should have been used, and that Arizona’s version amounted to dishonesty.

Finally some raw results and correspondence between the laboratories and the British Museum were elicited from the British Museum by Tristan Casabianca, who used them as a basis for further attempts at discrediting the original statisticians’ competence, and by me, to show that nothing either incompetent or underhand had occurred.

J. ANDRÉS CHRISTEN
The first statistician to comment on the Shroud data was Andrés Christen, as part of a long paper for Applied Statistics (1994) called ‘Summarising a Set of Radiocarbon Determinations: A Robust Approach.’ He was writing about determining accuracy from radiocarbon dates in general, with particular attention to the possibility of outliers, and the Shroud was his “Example 2.” After running the data through a computer program designed to find outliers, he concluded that the earliest and the latest dates of the twelve pieces sampled could be outliers, but that there was little difference in the overall result whether they were included or not. Without them, “the whole of the posterior distribution for θ (100% HPD [Highest Posterior Density] region) lies between 1270 and 1310 AD (680 – 640 BP) […] Thus it seems likely that it was made some time between 1300 and 1350 AD, just as concluded in Damon et al (1989).”

As it happens, if the two extremes are excluded, the simple mean of the ten remaining results is 688, and the standard deviation is 48. The two extremes lie fractionally outside two standard deviations from the mean, which according to some criteria means they do classify as outliers.

REMI VAN HAELST
The first detractor to the fray was Remi van Haelst, an excitable Belgian statistician and convinced authenticist. His ‘Radiocarbon Dating the Shroud: A Critical Statistical Analysis’ (at shroud.com – 1997) attempted to discredit the competence of the British Museum statisticians by using different methods of analysis, but although, inevitably, the different methods give slightly different results, they really only show that there was nothing wrong with the methods used in the first place. Insisting, however, that his methods are the only correct ones, and without fully understanding the laboratories’ calculations, van Haelst writes in his conclusion that various arbitrary “corrections” were made by them and the British Museum simply to support their “95% confidence” in their final result. He wonders why the errors, so obvious to him, were not spotted by the peer-reviewers of the paper. Had he taken the trouble to understand their methods he would have understood better.

At the end of his paper Van Haelst determines what he calls the true date: “any date between 504 and 859.” Had he converted that into calendar years, he would have come up with a period between about 1170 AD and 1440 AD, whose midpoint is about 1300. Although the range within which 95% confidence can be assumed is rather larger than that quoted by the Nature paper, the late medieval provenance is as secure after all his perorations as it was before.

Van Haelst wrote several versions of his analysis, culminating in “Radiocarbon Dating the Shroud of Turin: A Critical Review of the Nature Report (authored by Damon et al) with a complete unbiased statistical analysis” (at sindone.info – 2002). It is unstructured, rambling, emotional and incoherent, and littered with unnecessarily capitalised emphasis and sly ellipsis, but even so, it still does not demonstrate any further inaccuracy with the Nature paper than his first paper five years previously.

ERNESTO BRUNATI
Van Haelst was followed by Ernesto Brunati (‘Altro che rammendi! La datazione della Sindone è tutta un falso’, at sindone.info, 2005). Rather than attempt to understand the methods used by the individual laboratories, Brunati calculated his own averages and errors from the data provided in Nature, and, finding them different, accused both the laboratories and the British Museum of deliberate fraud. His main bone of contention was the error of ±31 given by the Arizona laboratories for their overall mean date of 646, which Brunati insists should have been ±17. He was not to know that for all the Arizona results – as for the other laboratories as well – two errors were calculated, the standard deviation from the scatter of results (quoted here), and a standard deviation from simple counting statistics – in this case 24. The larger was used as the better estimate of the real error. This was adjusted for the ∂13C factor, partially by Arizona, giving 26, and later by Morven Leese of the British Museum, giving the quoted error of 31.

MARCO RIANI AND ANTHONY ATKINSON
The next analysis of the results was by Marco Riani and Anthony Atkinson, et al. (‘Regression Analysis with Partially Labelled Regressors: Carbon Dating of the Shroud of Turin,’ Statistical Computing, 2012). They modelled nearly 400,000 possible arrangements of the 12 dates along the original sample strip, and established a strong statistical probability that there was a chronological gradient along it. In other words the strip tested progressively younger from one end to the other, and the neither the Oxford nor the Arizona results were necessarily individually erroneous because they were too high or too low from the mean. The fact of the gradient would explain why the Oxford and Arizona results do not overlap, and obviates any chi-squared comparison, but leads to the inevitable question of what might have caused it.

GIAN MARCO RINALDI
Also in 2012, Gian Marco Rinaldi’s ‘La Statistica della Datazione della Sindone (at sindone.weebly.com) was published: a comprehensive analysis and explanation of all the calculations the Nature paper could have carried out to achieve the results it did, setting out in irrefutable detail exactly why there is nothing suspicious or underhand about its information.

TRISTAN CASABIANCA
After the release of the British Museum data, Tristan Casabianca, Emanuela Marinelli (both convinced authenticists and neither of them statisticians), and two statisticians from the University of Catania (Giuseppe Pernagallo and Benedetto Torrisi), published their own analysis, called ‘Radiocarbon Dating Of The Turin Shroud: New Evidence from Raw Data’ (Archaeometry, 2019). It is seriously biased almost from the start, but even so, the authors admit that “our statistical results do not imply that the medieval hypothesis of the age of the tested sample should be ruled out.” Quite so. However, since this paper is the most thorough in its treatment of all the data currently available to us (with the exception of Rinaldi’s, above), it will now be the subject of some searching investigations of my own.

Let us begin with Casabianca et al’s Table 1, which sets out the basis for his future analyses, and which is deliberately – I’m tempted to say dishonestly – misleading.

Arizona: Starting with Arizona: we note that the sole difference between Arizona Raw 1 and Arizona Raw 2 is in the adjustment of two error figures (from ±40 to ± 59 and from ±37 to ±57). In fact Arizona Raw 1 was never submitted by the Arizona laboratory, and has been derived by Casabianca from eight data printouts (included in the British Museum document bundle), two of which have been modified by hand. In a footnote to their submission, this is explained: “For these results, the average of the ratios of OxII/OxI standards was different by more than two standard deviations from its correct value. (The average differed from the correct value by 1.0%.) An additional error of ±0.5% was added quadratically to the regular standard deviation of these measurements.”

Casabianca rejects this explanation on the grounds that no similar adjustment was necessary to any of the control samples, and implies that the real reason was to make the Arizona data more compatible with that of he other laboratories – presumably at a much later date, when the other laboratories had finished their own tests. This is implausible and irresponsible. The Arizona tests were carried out between 6 May and 2 June, and the results, including the adjustments, sent to the British Museum before 15 June. Oxford did not begin testing until July, and Zurich, although testing on 25-27 May and 5-8 July, did not send anything to the British Museum until 20 July.

Although the laboratory derived an overall date from all eight measurements taken together, the British Museum did not think that they were all truly independent of each other, and requested one date from each tested piece – as was submitted by the other laboratories. An average mean (µw) and error (σw) were therefore calculated for each one using these formulae:

where xi and σi are individual measurements and errors.

This gave the values quoted in the Nature paper.

Oxford: Oxford tested its samples on 13 and 20 July. Its submission to the British Museum listed one date for each of the pieces it tested (three pieces per sample) with its error and, “for interest” its “counting error contribution.” For the Shroud, these dates were:
795 ± 65 (53)
730 ± 45 (30)
745 ± 55 (46)

Naturally, the British Museum took the overall quoted errors, not the “counting error contribution.” For Casabianca to record “Oxford Raw” as:
795 ± 53
730 ± 30
745 ± 46
is nothing short of scurrilous, and an obvious attempt to discredit the data presentation by the Nature paper.

Zurich: Willy Wölfi of Zurich did actually submit the data shown as ‘Zurich Raw’ in Casabianca’s paper, but once again, Table One tells less than the whole truth. His first submission (quoted by Casabianca as ‘Zurich Raw’) on 20 July was followed by a second, on 31 August, with two minor alterations, namely the changing of two data from 617 ±47 to 639 ±45, and from 595 ±46 to 679 ±51. His explanation for this was: “Inbetween [20 July and 30 August] we had enough time to go all over it again and finally discovered that (to our shame) the ages obtained during Run 2 have not been corrected for the so-called current dependent effect. This effect is known to us for many years and we try to minimise it by preparing standards and unknown samples in exactly the same way so that they should deliver about the same current. These conditions were nicely fulfilled for all samples of Run 1 but not for those of Run 2 where the unknown samples delivered about 10% higher 12C-currents than the standards. Our 13C/12C ratio measurements allow us to determine the amount of this current dependency and to evaluate the corresponding correction factor.”

Of course, these amended results were those published in the Nature paper, and it was wrong of Casabianca not to acknowledge them in Table One, even if, as with the Arizona amendments, he dismisses them with an implication of incompetence, at least, or dishonesty, at worst. His evidence for this, curiously, is from a letter from the Arizona scientists (already dismissed as charlatans), who say that they: “do not understand how such a systematic calculational error could have changed the values of their uncertainties.” They could have asked Willy Wölfli, who could have told them.

In short, Casabianca and his statisticians have distorted the ‘raw’ material released by the British Museum, and failed to establish that the versions published in Nature are incorrect. Any further analysis should refer to that data, but much of the rest of the paper uses its versions of the ‘raw’ data, in an attempt to prove that it is incompatible with the final conclusion.

As with the previous analysts – and as pointed out by the Nature paper itself – Casabianca finds that the Oxford result does not overlap the Arizona result. This is hardly news. In fact, if the gradient truly exists, it would not be surprising if the individual results for each laboratory are more dispersed that they would be if each laboratory’s sample represented a single date. A graphical representation will illustrate the idea. Here are the dates of the twelve dated pieces, with the means marked with vertical bars:

 It is easy to see that two of the Arizona samples do not overlap the other two, and of the five Zurich samples, two pairs scarcely overlap. If these were supposed to be dispersed around a single mean, we might be disposed to wonder about this, but if there were a gradient along the cloth, we would be much less concerned.

In the graph below, the dates are evenly spread across each sample as if each sample was cut into little bands across the original strip. Consequently the Arizona and Zurich data points do not overlap. A trend line is derived from the points. The samples for each lab were about the same size, so Zurich’s five pieces are closer together (on the vertical scale) than Oxford’s three.

Although the data points now fit the statistical generalisation more closely, we cannot honestly say that this hypothetical interpretation is much more than a pointer to further investigation. For one thing, it is very unlikely that any gradient created by contamination action will be linear when analysed in detail; for a second it is very unlikely that the tested pieces truly lay at the distances along the sample modelled in the graph above, thirdly the samples were likely to have been taken at different distances across the width of the sample, and finally whatever the contamination was, it was as likely to have varied across the width of the sample as it did along the length. Some ingenious recreations of possible configurations have been constructed, assuming quite good precision for each measurement, showing contamination centred on a point on the Holland backing cloth near the Arizona sample area and radiating concentrically outwards towards the Oxford area, but they cannot be considered statistically significant. Some examples were given in an earlier post on the the blog: The Chronological Gradient.

Having falsified Table One, by claiming or implying that the ‘raw’ data was more accurate than the ‘Nature’ data, Casabianca goes on to make a lot of unnecessary comparisons using it, which nevertheless add nothing to what has been obvious from the start: the Oxford results do not overlap the Zurich results, and the two pairs of the Arizona results do not overlap each other. In the Discussion section, Casabianca completely misunderstands Oxford’s description of its errors, implies that arbitrary corrections were made to initial error measurements, and picks out three data points for specific criticism. None of it is justified. The Oxford results submitted were:

Shroud
795 ±65 (53)
730 ±45 (30) Casabianca thinks this has been arbitrarily changed from 30 to 45 …
745 ±55 (46)
Nubia
980 ±55 (45)
915 ±55 (45)
925 ±45 (32) … but he does not comment on this.
Thebes
1955 ±70 (61)
1975 ±55 (50)
1990 ±50 (33) Casabianca thinks this has been arbitrarily changed from 33 to 50
Provence
785 ±50 (35) Casabianca thinks this has been arbitrarily changed from 33 to 50 …
710 ±40 (29)
790 ±45 (32) … but he does not comment on this.

As mentioned above, “the accounting error contribution is added in brackets for interest.”

In fact Oxford explained clearly the factors which contributed to the overall error, then said that the “overall effect is to increase a combined purely statistical error by about 10 to 15 years. The final error quoted is this result rounded up to the nearest five years except when the final error is less than 40 years. […] Final errors less than 40 years have been arbitrarily increased to 40 or 45 years.” There is nothing in the results presented to suggest that this did not occur.

The conclusions at the ends of Casabianca’s paper do not add to what was known and reported by the Nature paper, as amended later by the information as to where, along the strip, the laboratories’ samples came. There are vague insinuations that gross contamination might not have been removed, an irrelevant comment from Harry Gove, and at the end of it all, a demand for “a new radiocarbon dating to compute a new reliable interval.” However, nobody reading any of these statistical analyses, from the Nature paper to the present day, can be in any expectation that any new test is likely to support anything other than a medieval origin for the Shroud of Turin.