Uncertainty Estimates for Routine Temperature Data Sets.

Part One.

Geoff Sherrington

Modern climate research commonly fails adequate recognition of three guiding principles about uncertainty.

  1. Uncertainty estimation is essential to understanding.

It is generally agreed that the usefulness of measurement results, and thus much of the information that we provide as an institution, is to a large extent determined by the quality of the statements of uncertainty that accompany them.”

2. Uncertainty estimation has two dominant parts.

“The uncertainty in the result of a measurement generally consists of several components which may be grouped into two categories according to the way in which their numerical value is estimated:

 A. those which are evaluated by statistical methods,

 B. those which are evaluated by other means.”


3. Uncertainty estimation needs to value diverse views.

“In 2009, the Obama Administration identified six principles of scientific integrity.” (Including these two) –

“Dissent. Science benefits from dissent within the scientific community to sharpen ideas and thinking. Scientists’ ability to freely voice the legitimate disagreement that improves science should not be constrained.

“Transparency in sharing science. Transparency underpins the robust generation of knowledge and promotes accountability to the American public. Federal scientists should be able to speak freely, if they wish, about their unclassified research, including to members of the press.”


This article examines how well the Australian Bureau of Meteorology, BOM, satisfies these requirements in respect of the uncertainty estimated for routine daily temperatures.

Part One deals more with the social aspects like transparency. Part Two addresses mathematics and statistics.

This article uses Australian practice and examples dominantly involving BOM. Importantly, the conclusions apply World-wide, for there is much to repair.

The prominent, practical guide to uncertainty is by the France-based Bureau International des Poids at Mesures, BIPM, with their Guide to the Expression of Uncertainty in Measurement (GUM).

Guide to the expression of uncertainty in measurement – JCGM 100:2008 (GUM 1995 with minor corrections – Evaluation of measurement data (bipm.org)

Several years ago, in  email correspondence with BOM, I started to ask this question:

If a person seeks to know the separation of two daily temperatures in degrees C that allows a confident claim that the two temperatures are different statistically, by how much would the two values be separated?

BOM has been trying to answer this question with several attempts. They have permitted me to quote from their correspondence on the condition that I reference the full quote, which I do here.


On March 31st 2022, BOM sent the most recent attempt to answer the question. Here is a table with some of their text.

(Start quote) “The uncertainties, with a 95% confidence interval for each measurement technology and data usage, are listed below. Sources that have been considered in contributing to this uncertainty include, but are not limited to, field and inspection instruments, calibration traceability, measurement electronics or observer error, comparison methods, screen size and aging.

Measurement Technology Ordinary Dry Bulb Thermometer PRT Probe and Electronics
Isolated single measurement – No nearby station or supporting evidence ±0.45 °C ±0.51 °C
Typical measurement – Station with 5+ years of operation with 10+ years of operation with at least 5 verification checks.   ±0.23 °C ±0.18 °C   ±0.23 °C ±0.16 °C
Long-term measurement – Station with 30+ years of aggregated records with 100+ years of aggregated record   ±0.14 °C ±0.13 °C   ±0.11 °C ±0.09 °C

I would stress that in answer to your specific question of “If a person seeks to know the separation of two daily temperatures in degrees C that allows a confident claim that the two temperatures are different statistically by how much would the two values be separated”, the ‘Typical measurement’ Uncertainty for the appropriate measurement technology would be the most suitable value. This value is not appropriate for wider application to assess long-term climate trends, given typical measurements are more prone to measurement, random, and calibration error than verified long-term datasets.”  (End quote)

These confidence intervals are essentially for one Part of the two Parts that comprise a complete estimation of confidence. They are mostly the Part A type that is derived from statistical methods. They are incomplete and unfit for routine use without more attention to Part B, those that are evaluated by other means.

There is a significant difference in the interpretation of temperature date, especially in time series, if the uncertainty is ±0.51 °C or ±0.09 °C, to use extreme estimates from the BOM table. It is vital to understand how the uncertainty of a single observation becomes much smaller when there are multiple observations combined in some way. Is that combination a valid scientific act?

In the case of routine temperature measurements (the test subject of this article), Type B might include, but not be limited to, all of those effects adjusted by homogenization of time series of temperatures. In this article, we use the BOM adjustment procedures for creating the Australian Climate Observations Reference Network – Surface Air Temperature (ACORN-SAT).

ACORN-SAT commences with “raw” temperature data as input. This is then examined visually and/or statistically for breaks in an expected (smooth) pattern. Sometimes a pattern at one station is compared with performance of other stations up to 1,200 km distant. Temperatures are adjusted, singly or in in blocks or patterns, to give a smoother-looking output, more in agreement with other stations, more pleasing to the eye perhaps, but often inadequately supported by metadata documenting actual changes made in the past. Sometimes, there is personal selection of when to adjust and by how much, that is, guesswork.

The BIPM guidelines have no advice on how to create uncertainty bounds for “guesses” – for good scientific reasons.



Some other relevant factors affecting BOM raw data include:

  1. Data began with the Fahrenheit scale, then moved to Celsius scale.
  2. There were periods of years when a thermometer observation was reported in whole degrees, with no places after the decimal. (“Rounding effects.”)
  3. Almost every ACORN-SAT station of the 112 or so was moved to a different location some time in its life.
  4. Some stations have had new buildings and ground surfaces like asphalt put close to them, potentially affecting their measurements. (“Urban Heat Island effects, UHI).
  5. Thermometers changed from liquid-in-glass to platinum resistance.
  6. Screen volumes changed over the decades, generally becoming smaller.
  7. Screens have been shown to be affected by cleaning and type of exterior finish.
  8. The recording of station metadata, noting effects with potential to affect measurements, was initially sparse and is still inadequate.
  9. Some manual observations were not taken on Sundays, The Sabbath, at some stations.
  10. And so on.


In mid-2017, BOM and New Zealand officials met and emailed to produce a report that touched on the variables just listed but concentrated on the performance of the Automatic Weather Station, AWS, lately dominant, with mostly PRT sensors.

Review_of_Bureau_of_Meteorology_Automatic_Weather_Stations.pdf (bom.gov.au)

Some email correspondence within BOM and New Zealand about this review became public though a Freedom of Information request. Relevant FOI material is here.


Here are some extracts from those emails. (Some names have been redacted. My bolds).

“While none of the temperature measurements resident in the climate database have an explicit uncertainty of measurement, the traceability chain back to the national temperature standards, and the processes used both in the Regional Instrument centre (the current name of the metrology laboratory in the Bureau) and the field inspection process suggest that the likely 95% uncertainty of a single temperature measurement is of the order of 0.5⁰C. This is estimated from a combination of the field tolerance and test process uncertainties over a temperature range from -10 to +55⁰C.”

“(We) should deal with the discrepancy between the BOM’s current 0.4⁰C uncertainty and the 0.1⁰C WMO aspirational goal.”

By reference to the table above, the PRT column offers a similar uncertainty of +/-0.51⁰C for “Isolated single measurement – No nearby station or supporting evidence”; also +/-0.37⁰C for “Typical measurement – Station with 5+ or 10+ years of operation”; also ±0.11⁰C for “Long-term measurement – Station with 30+ years of aggregated records.”

One does not know why there is a further offering for AWS of ±0.09 °C for “records with 100+ years of aggregated record.” Hugh Callendar developed the first commercially successful platinum RTD in 1885, but its use in automatic weather stations seems to have started about the time of the 1957-8 International Geophysical Year. There might be no examples of 100+ years.

Recall that for some 5 years before mid-2022, I had been asking BOM for estimates of uncertainty, which question was not answered by mid-2022. This has to be considered against the knowledge revealed in this 2017 email exchange, with the likely 95% uncertainty of a single temperature measurement is of the order of 0.5⁰C.” It is reasonable to consider that this estimate was concealed from me. One of the BOM staff who has been answering my recent emails was present and named among the email writers of 2017 exchange.

Why did BOM fail to mention this estimate? They were encouraged to provide an answer by my main question, but they did not.

This brings us to the start of this article and its three governing principles, one of which is ““Transparency in sharing science. Transparency underpins the robust generation of knowledge and promotes accountability to the American public. Federal scientists should be able to speak freely, if they wish, about their unclassified research, including to members of the press.”

As for America, also for Australia.

It happens that I have kept some past writings by officers of the BOM over the years. Here are some.

Recall that in Climategate, the BOM’s Dr David Jones emailed to colleagues on 7th September 2007.

Fortunately in Australia our sceptics are rather scientifically incompetent. It is also easier for us in that we have a policy ofproviding any complainer with every single station observation when they question our data (this usually snows them) and the Australian data is in pretty good order anyway. 

David Jones had not reached an apologetic mood by June 16, 2009, when he emailed me his response to a technical question:

Geoff, your name appears very widely in letters to editors, on blogs and your repeatedly email people in BoM asking the same questions. I am well aquatinted with letters such as this one – http://www.jennifermarohasy.com/blog/archives/001281.html . You also have a long track record of putting private correspondence on public blogs. I won’t be baited.

Further, there is an email involving BOM Media and Big Boss Andrew Johnson and others in the AWS review, 24 August 2017 9:58 AM:

“I expect we will reply to this one with: The Bureau does not comment on any third-party research.”

Continuing the theme is this 2017 BOM email in response to mine asserting with data that Australian heatwaves are not becoming longer, hotter or more frequent.

The Bureau is unable to comment on unpublished scientific hypotheses or studies, and we encourage you to publish your work in a suitable journal. Through the peer reviewed literature, you can take up any criticism you have of existing methodologies and have these published in a format and forum that is accessible to other scientists. Regards,  Climate Monitoring and Prediction.

This fortress BOM mood might have started from the top. One redacted name at that 2017 email session revealed that –

 “I am essentially ‘external’ as an emeritus researcher, but was head of the infrastructure/procurement/engineering/science measurement area when I retired from the Bureau in March 2016 last year.”

This person  might or might not have been former BOM Director Dr Rob Vertessy. Newspapers in 2017 reported resignation comments from him. They send a message.

“Vertessy’s agency was under consistent attack from climate science denialists who would claim, often through the news and opinion pages of the Australian, that the weather bureau was deliberately manipulating its climate records to make recent warming seem worse than it really was.

“From my perspective, people like this, running interference on the national weather agency, are unproductive and it’s actually dangerous,” Vertessy told me. “Every minute a BoM executive spends on this nonsense is a minute lost to managing risk and protecting the community. It is a real problem.”


Note the common media spin methods in this press article. The BOM have seen a problem, framed it in their own way and express emotion without denial of the accusations.

At this stage of this article, I submit some words by others to indicate the scale of the problems that are emerging.

“An irreproducibility crisis afflicts a wide range of scientific and social-scientific disciplines, from public health to social psychology. Far too frequently, scientists cannot replicate claims made in published research.1 Many improper scientific practices contribute to this crisis, including poor applied statistical methodology, bias in data reporting, fitting the hypotheses to the data, and endemic groupthink. Far too many scientists use improper scientific practices, including outright fraud.

National Association of Scholars (USA).

Shifting Sands. Unsound Science and Unsafe Regulation

Report #1: Keeping Count of Government Science: P-Value Plotting, P-Hacking, and PM2.5 Regulation


From that report, there is a view about the central Limit Theorem on page 36.

“The Bell Curve and the P-Value: The Mathematical Background

“All “classical” statistical methods rely on the Central Limit Theorem, proved by Pierre-Simon Laplace in 1810.

“The theorem states that if a series of random trials are conducted, and if the results of the trials are independent and identically distributed, the resulting normalized distribution of actual results, when compared to the average, will approach an ideal­ized bell-shaped curve as the number of trials increases without limit.

“By the early twentieth century, as the industrial landscape came to be dominated by methods of mass production, the theorem found application in methods of industri­al quality control. Specifically, the p-test naturally arose in connection with the ques­tion “how likely is it that a manufactured part will depart so much from specifications that it won’t fit well enough to be used in the final assemblage of parts?” The p-test, and similar statistics, became standard components of industrial quality control.

“It is noteworthy that during the first century or so after the Central Limit Theorem had been proved by Laplace, its application was restricted to actual physical mea­surements of inanimate objects. While philosophical grounds for questioning the assumption of independent and identically distributed errors existed (i.e., we can never know for certain that two random variables are identically distributed), the assumption seemed plausible enough when discussing measurements of length, or temperatures, or barometric pressures.

“Later in the twentieth century, to make their fields of inquiry appear more “scien­tific”, the Central Limit Theorem began to be applied to human data, even though nobody can possibly believe that any two human beings—the things now being measured—are truly independent and identical. The entire statistical basis of “ob­servational social science” rests on shaky supports, because it assumes the truth of a theorem that cannot be proved applicable to the observations that social scientists make.”

Dr David Jones emailed me on June 9, 2009 with this sentence:

“Your analogy between a 0.1C difference and a 0.1C/decade trend makes no sense either – the law of large numbers or central limit theorem tells you that random errors have a tiny effect on aggregated values.”


Part Two of this article takes up the mathematics and statistics relevant to CLT and LOLN.


Geoff Sherrington


Melbourne, Australia.

20th August 2022.


via Watts Up With That?


August 24, 2022 at 04:25AM

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s