What do you mean by “mean”: an essay on black boxes, emulators, and uncertainty

Guest post by Richard Booth, Ph.D

References:

[1] https://wattsupwiththat.com/2019/09/07/propagation-of-error-and-the-reliability-of-global-air-temperature-projections-mark-ii/

[2] https://wattsupwiththat.com/2019/10/15/why-roy-spencers-criticism-is-wrong  

  1. Introduction

I suspect that we can all remember childish arguments of the form “person A: what do you mean by x, B: oh I really mean y, A: but what does y really mean, B: oh put in other words it means z, A: really (?), but in any case what do you even mean by mean?”  Then in adult discussion there can be so much interpretation of words and occasional, sometimes innocent, misdirection, that it is hard to draw a sound conclusion.  And where statistics are involved, it is not just “what do you mean by mean” (arithmetic?, geometric?, root mean square?) but “what do you mean by error”, “what do you mean by uncertainty” etc.etc.?

Starting in the late summer of 2019 there were several WUWT postings on the subject of Dr. Pat Frank’s paper [1], and they often seemed to get bogged down in these questions of meaning and understanding.  A good deal of progress was made, but some arguments were left unresolved, so in this essay I revisit some of the themes which emerged.  Here is a list of sections:

B. Black Box and Emulator Theory – 2.3 pages (A4)

C. Plausibility of New Parameters – 0.6 pages

D. Emulator Parameters – 1.5 pages

E. Error and Uncertainty – 3.2 pages

F. Uniform Uncertainty (compared to Trapezium Uncertainty) – 2.5 pages

G. Further Examples – 1.2 pages

H. The Implications for Pat Frank’s Paper – 0.6 pages

I. The Implications for GCMs – 0.5 pages

Some of those sections are quite long, but each has a summary at its end, to help readers who are short of time and/or do not wish to wade through a deal of mathematics.  The length is unfortunately necessary to develop interesting mathematics around emulators and errors and uncertainty, whilst including examples which may shed some light on the concepts.  There is enough complication in the theory that I cannot guarantee that there isn’t the odd mistake.  When referring to [1] or [2] including their comments sections, I shall refer to Dr. Frank and Dr. Roy Spencer by name, but put the various commenters under the label “Commenters”.

I am choosing this opportunity to “come out” from behind my blog name “See – Owe to Rich”.  I am Richard Booth, Ph.D., and author of “On the influence of solar cycle lengths and carbon dioxide on global temperatures”.  Published in 2018 by the Journal of Atmospheric and Solar-Terrestrial Physics (JASTP), it is a rare example of a peer-reviewed connection between solar variations and climate which is founded on solid statistics, and is available at https://doi.org/10.1016/j.jastp.2018.01.026 (paywalled)  or in publicly accessible pre-print form at https://github.com/rjbooth88/hello-climate/files/1835197/s-co2-paper-correct.docx . I retired in 2019 from the British Civil Service, and though I wasn’t working on climate science there, I decided in 2007 that as I had lukewarmer/sceptical views which were against the official government policy, alas sustained through several administrations, I should use the pseudonym on climate blogs whilst I was still in employment.

  • Black Box And Emulator Theory

Consider a general “black box”, which has been designed to estimate some quantity of interest in the past, and to predict its value in the future.  Consider also an “emulator”, which is an attempt to provide a simpler estimate of the past black box values and to predict the black box output into the future.  Last, but not least, consider reality, the actual value of the quantity of interest.

Each of these three entities,

can be modelled as a time series with a statistical distribution.  They are all numerical quantities (possibly multivariate) with uncertainty surrounding them, and the only successful mathematics which has been devised for analysis of such is probability and statistics.  It may be objected that reality is not statistical, because it has a particular measured value.  But that is only true after the fact, or as they say in the trade, a posteriori.  Beforehand, a priori, reality is a statistical distribution of a random variable, whether the quantity be the landing face of the die I am about to throw or the global HadCRUT4 anomaly averaged across 2020.

It may also be objected that many black boxes, for example Global Circulation Models, are not statistical, because they follow a time evolution with deterministic physical equations.  Nevertheless, the evolution depends on the initial state, and because climate is famously “chaotic”, tiny perturbations to that state, lead to sizeable divergence later.  The chaotic system tends to revolve around a small number of attractors, and the breadth of orbits around each attractor can be studied by computer and matched to statistical distributions.

The most important parameters associated with a probability distribution of a continuous real variable are the mean (measure of location) and the standard deviation (measure of dispersion).  So across the 3 entities there are 6 important parameters; I shall use E[] to denote expectation or mean value, and Var[] to denote variance which is squared standard deviation.  What relationships between these 6 allow the defensible (one cannot assert “valid”) conclusion that the black box is “good”, or that the emulator is “good”? 

In general, since the purpose of an emulator is to emulate, it should do that with as high a fidelity as possible.  So for an emulator to be good, it should, like the Turing Test of whether a computer is a good emulator of a human, be able to display a similar spread/deviation/range of the black box as well as the mean/average component.  Ideally one would not be able to tell the output of one from that of the other.

To make things more concrete, I shall assume that the entities are each a uniform discrete time series, in other words a set of values evenly spaced across time with a given interval, such as a day, a month, or a year.  Let:

  X(t) be the random variable for reality at integer time t;

  M(t) be the random variable for the black box Model;

  W(t) be the random variable for some emulator (White box) of the black box

  Ri(t) be the random variable for some contributor to an entity, possibly an error term.

 Now choose a concrete time evolution of W(t) which does have some generality:

  • W(t) = (1-a)W(t-1) + R1(t) + R2(t) + R3(t) where 0 ≤ a ≤ 1

The reason for the 3 R terms will become apparent in a moment.  First note that the new value W(t) is partly dependent on the old one W(t-1) and partly on random Ri(t) terms.  If a=0 then there is no decay, and a putative flap of a butterfly’s wings contributing to W(t-1) carries on undiminished to perpetuity.  In Section C I describe how the decaying case a>0 is plausible.

R1(t) is to be the component which represents changes in major causal influences, such as the sun and carbon dioxide.  R2(t) is to be a component which represents a strong contribution with observably high variance, for example the Longwave Cloud Forcing (LCF).  Some emulators might ignore this, but it could have a serious impact on how accurately the emulator follows the black box.  R3(t) is a putative component which is negatively correlated with R2(t) with coefficient -r, with the potential (dependent on exact parameters) to mitigate the high variance of R2(t).  We shall call R3(t) the “mystery component”, and its inclusion is justified in Section C.

Equation (1) can be “solved”, i.e. the recursion removed, but first we need to specify time limits.  We assume that the black box was run and calibrated against data from time 0 to the present time P, and then we are interested in future times P+1, P+2,… up to F. The solution to Equation (1) is

  • W(t) = ∑i=0t (1-a)i(R1(t-i) + R2(t-i) + R3(t-i)) + (1-a)t W(0)

The expectation of W(t) depends on the expectations of each Rk(t), and to make further analytical progress we need to make assumptions about these.  Specifically, assume that

  • E[R1(t)] = bt+c, E[R2(t) = d], E[R3(t)] = 0

Then a modicum of algebra derives

  • E[W(t)] = b(at + a-1 + (1-a)t+1)/a2 + (c+d)(1 – (1-a)t)/a + (1-a)t W(0)

In the limit as a tends to 0, we get the special case

  • E[W(t)] = bt(t+1)/2 + (c+d)t + W(0)

Next we consider variance, with the following assumptions:

  • Var[Rk(t)] = sk2, Cov[R2(t),R3(t)] = -r s2 s3, all other covariances, within or across time, are 0, so
  • Var[W(t)] = (s12+s22+s32-2r s2 s3)(1 – (1-a)2t)/(2a-a2)

and as a tends to zero the last two parentheses tend to t (implying variance increases linearly with t).

Summary of section B:

  • A good emulator can mimic the output of the black box.
  • A fairly general iterative emulator model (1) is presented.
  • Formulae are given for expectation and variance of the emulator as a function of time t and various parameters.
  • The 2 extra parameters, a, and R3(t), over and above those of Pat Frank’s emulator, can make a huge difference to the evolution.
  • The “magic” component R3(t) with anti-correlation -r to R2(t) can greatly reduce model error variance whilst retaining linear growth in the absence of decay.
  • Any decay rate a>0 completely changes the propagation of error variance from linear growth to convergence to a finite limit.
  • Plausibility Of New Parameters

The decaying case a>0 may at first sight seem implausible.  But here is a way it could arise.  Postulate a model with 3 main variables, M(t) the temperature, F(t) the forcing, and H(t) the heat content of land and oceans.  Let

  M(t) = b + cF(t) + dH(t-1)

(Now by the Stefan-Boltzmann equation M should be related to F1/4 , but locally it can be linearized by a binomial expansion.)  The theory here is that temperature is fed both by instantaneous radiative forcing F(t) and by previously stored heat H(t-1).  (After all, climate scientists are currently worrying about how much heat is going into the oceans.)  Next, the heat changes by an amount dependent on the change in temperature:

  H(t-1) = H(t-2) + e(M(t-1)-M(t-2)) = H(0) + e(M(t-1)-M(0))

Combining these two equations we get

  M(t) = b + cF(t) + d(H(0) + e(M(t-1)-M(0)) = f + cF(t) + (1-a)M(t-1)

where a = 1-de, f = b+dH(0)-deM(0).  This now has the same form as Equation (1); there may be some quibbles about it, but it shows a proof of concept of heat buffering leading to a decay parameter.

For the anti-correlated R3(t), consider reference [2]. Roy Spencer, who has serious scientific credentials, had written “CMIP5 models do NOT have significant global energy imbalances causing spurious temperature trends because any model systematic biases in (say) clouds are cancelled out by other model biases”.  This means that in order to maintain approximate Top Of Atmosphere (TOA) radiative balance, some approximate cancellation is forced, which is equivalent to there being an R3(t) with high anti-correlation to R2(t).  The scientific implications of this are discussed further in Section I.

Summary of Section C:

  • A decay parameter is justified by a heat reservoir.
  • Anti-correlation is justified by GCMs’ deliberate balancing of TOA radiation.

Dr. Pat Frank’s emulator falls within the general model above.  The constants from his paper, 33K, 0.42, 33.3 Wm-2, and +/-4 Wm-2, the latter being from errors in LCF, combine to give 33*0.42/33.3 = 0.416 and 0.416*4 = 1.664 used here. So we can choose a = 0, b = 0, c+d = 0.416 F(t) where F(t) is the new GHG forcing (Wm-2) in period t, s1=0, s2=1.664, s3=0, and then derive

  • W(t) = (c+d)t + W(0) +/- sqrt(t) s2

(I defer discussion of the meaning of the +/- sqrt(t) s2, be it uncertainty or error or something else, to Section D.  Note that F(t) has to be constant to directly use the theory here.)

But by using more general parameters it is possible to get a smaller value of the +/- term.  There are two main ways to do this – by covariance or by decay, each separately justified in Section C.

In the covariance case, choose s3 = s2 and r = 0.95 (say).  Then in this high anti-correlation case, still with a = 0, Equation (7) gives

  • Var[W(t)] = 0.1s22t  (instead of s22t)

In the case of decay but no anti-correlation, a > 0 and s3 = 0 (so R3(t) = 0 with probability 1).  Now, as t gets large, we have

  • Var[W(t)] = (s12+s22)/(2a-a2)

so the variance does not increase without limit as in the a =0 case.  But with a > 0, the mean also changes, and for large t Equation (4) implies it is

  • E[W(t)] ~ bt/a + (b+c+d-b/a)/a

Now if we choose b = a(c+d) then that becomes (c+d)(t+1), which is fairly indistinguishable from the (c+d)t in Equation (8) derived from a=0, so we have derived a similar expectation but a smaller variance in Equation (10).

To streamline the notation, now let the parameters a, b, c, d, r be placed in a vector u, and let

  • E[W(t)] = mw(t;u),  Var[W(t)] = sw2(t;u)

(I am using a subscript ‘w’ for statistics relating to W(t), and ‘m’ for those relating to M(t).)  With 4 parameters (a, b, c+d, r) to set here, how should we choose the “best”?  Well, comparisons of W(t) with M(t) and X(t) can be made, the latter just in the calibration period t = 1 to t = P.  The nature of comparisons depends on whether or not just one, or many, observations of the series M(t) are available.

Case 1: Many series

With a deterministic black box, many observed series can be created if small perturbations are made to initial conditions and if the evolution of the black box output is mathematically chaotic.  In this case, a mean mm(t) and a standard deviation sm(t) can be derived from the many series.  Then curve fitting can be applied to mw(t;u) – mm(t) and sw(t;u) – sm(t) by varying u.  Something like Akaike’s Information Criterion (AIC) might be used for comparing competing models.  But in any case it should be easy to notice whether sm(t) grows like sqrt(t), as in the a=0 case, or tends to a limit, as in the a>0 case.

Case 2: One series

If chaotic evolution is not sufficient to randomize the black box, or if the black box owner cannot be persuaded to generate multiple series, there may be only one observed series m(t) of the random variable M(t).  In this case Var[M(t)] cannot be estimated unless some functional form, such as g+ht, is assumed for mm(t), when (m(t)-g-ht)2 becomes a single observation estimate of Var[M(t)] for each t, allowing an assumed constant variance to be estimated.  So some progress in fitting W(t;u) to m(t) may still be possible in this case.

Pat Frank’s paper effectively uses a particular W(t;u) (see Equation (8) above) which has fitted mw(t;u) to mm(t), but ignores the variance comparison.  That is, s2 in (8) was chosen from an error term from LCF without regard to the actual variance of the black box output M(t).

Summary of section D:

  • Pat Frank’s emulator model is a special case of the models presented in Section B, where error variance is given by Equation (7).
  • More general parameters can lead to lower propagation of error variance over time (or indeed, higher).
  • Fitting emulator mean to black box mean does not discriminate between emulators with differing error variances.
  • Comparison of emulator to randomized black box runs can achieve this discrimination.

In the sections above I have made scant reference to “uncertainty”, and a lot to probability theory and error distributions.  Some previous Commenters repeated the mantra “error is not uncertainty”, and this section addresses that question.  Pat Frank and others referred to the following “bible” for measurement uncertainty

https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf ; that document is replete with references to probability theory.  It defines measurement uncertainty as a parameter which is associated with the result of a measurement and that characterizes the dispersion of the values that could reasonably be attributed to the measurand.  It acknowledges that the dispersion might be described in different ways, but gives standard deviations and confidence intervals as principal examples.  The document also says that that definition is not inconsistent with two other definitions of uncertainty, which include the difference between the measurement and the true value. 

Here I explain why they might be thought consistent, using my notation above.  Let M be the measurement, and X again be the true value to infinite precision (OK, perhaps only to within Heisenberg quantum uncertainty.)  Then the JCGM’s main definition is a parameter associated with the statistical distribution of M alone, generally called “precision”, whereas the other two definitions are respectively a function of M-X and a very high confidence interval for X.  Both of those include X, and are predicated on what is known as the “accuracy” of the measurement of M.  (The JCGM says this is unknowable, but does not consider the possibility of a different and highly accurate measurement of X.)  Now, M-X is just a shift of M by a constant, so the dispersion of M around its mean is the same as the dispersion of M-X around its mean.  So provided that uncertainty describes dispersion (most simply measured by variance) and not location, they are indeed the same.  And importantly, the statistical theory for compounding variance is the same in each case.

Where does this leave us with respect to error versus uncertainty?  Assuming that X is a single fixed value, then prior to measurement, M-X is a random variable representing the error, with some probability distribution having mean mm-X and standard deviation sm.  b = mm-X is known as the bias of the measurement, and +/-sm is described by the JCGM 2.3.1 as the “standard” uncertainty parameter.  So standard uncertainty is just the s.d. of error, and more general uncertainty is a more general description of the error distribution relative to its mean.

There are two ways of finding out about sm: by statistical analysis of multiple measurements (if possible) or by appealing to an oracle, such as the manufacturer of the measurement device, who might supply information over and beyond the standard deviation.  In both cases the output resolution of the device may have some bearing on the matter. 

However, low uncertainty is not of much use if the bias is large.  The real error statistic of interest is E[(M-X)2] = E[((M-mm)+(mm-X))2] = Var[M] + b2, covering both a precision component and an accuracy component.

Sometimes the uncertainty/error in a measurement is not of great consequence per se, but feeds into a parameter of a mathematical model and thence into the output of that model.  This is the case with LCF feeding into radiative forcings in GCMs and then into temperature, and likewise with Pat Frank’s emulator of them.  But the theory of converting variances and covariances of input parameter errors into output error via differentiation is well established, and is given in Equation (13) of the JCGM.

To illuminate the above, we now turn to some examples, principally provided by Pat Frank and Commenters.

Example 1: The 1-foot end-to-end ruler

In this example we are given a 1-foot ruler with no gaps at the ends and no markings, and the manufacturer assures us that the true length is 12”+/-e”; originally e = 1 was chosen, but as that seems ridiculously large I shall choose e = 0.1 here.  So the end-to-end length of the ruler is in error by up to 0.1” either way, and furthermore the manufacturer assures us that any error in that interval is equally likely. I shall repeat a notation I introduced in an earlier blog comment, which is to write 12+/-_0.1 for this case, where the _ denotes a uniform probability distribution, instead of a single standard deviation for +/-.  (The standard deviation for a random variable uniform in [-a,a] is a/sqrt(3) = 0.577a, so b +/-_ a and b +/- 0.577a are loosely equivalent, except that the implicit distributions are different.  This is covered in the JCGM, where “rectangular” is used in place of “uniform”.)

Now, I want to build a model train table 10 feet long, to as high an accuracy as my budget and skill allow.  If I have only 1 ruler, it is hard to see how I can do better than get a table which is 120+/-_1.0”.  But if I buy 10 rulers (9 rulers and 1 ruler to rule them all would be apt if one of them was assured of accuracy to within a thousandth of an inch!), and I am assured by the manufacturer that they were independently machined, then by the rule of addition of independent variances, the uncertainty in the sum of the lengths is sqrt(10) times the uncertainty of each.

So using all 10 rulers placed end to end, the expected length is 120” and the standard deviation (uncertainty) gets multiplied by sqrt(10) instead of 10 for the single ruler case, an improvement by a factor of 3.16.  The value for the s.d. is 0.0577 sqrt(10) = 0.183”.   

To get the exact uncertainty distribution we would have to do what is called convolving of distributions to find the distribution of the sum_1^10 (X_i-12).  It is not a uniform distribution, but looks a little like a normal distribution under the Central Limit Theorem. Its “support” is not of course infinite, but is the interval (-1”,+1”), but it does tail off smoothly at the edges.  (In fact, recursion shows that the probability of it being less than (-1+x), for 0<x<0.2, is (5x)10/10!   That ! is a factorial, and with -1+x = -0.8 it gives the small probability of 2.76e-7, a tiny chance of it being in the extreme 1/5 of the interval.)

Now that seemed like a sensible use of the 10 rulers, but oddly enough it isn’t the best use.  Instead, sort them by length, and use the shortest and longest 5 times over.  We could do this even if we bought n rulers, not equal to 10.  We know by symmetry that the shortest plus longest has a mean error of 0, but calculating the variance is more tricky.

The error of the ith shortest ruler, plus 0.1, times 5, say Yi, has a Beta distribution (range from 0 to 1) with parameters (i, 101-i).  The variance of Yi is i(n+1-i)/((n+1)2(n+2)), which can be found at https://en.wikipedia.org/wiki/Beta_distribution .  Now

  Var[Y1 + Yn] = 2(Var[Y1]+Cov[Y1,Y100 ]) by symmetry.

Unfortunately that Wikipedia page does not give that covariance, but I have derived this to be

  • Cov[Yi,Yj] = i(n+1-j) / [(n+2)(n+1)2] if i <= j, so
  • Var[Y1 + Yn] = 2(n+1) / [(n+2)(n+1)2] = 2 / [(n+2)(n+1)]

Using the two rulers 5 times multiplies the variance by 25, but removing the scaling of 5 in going from ruler to Yi cancels this.  So (14) is also the variance of the error of our final measurement.

Now take n = 10 and we get uncertainty = square root of variance = sqrt(2/132) = 0.123”, which is less than the 0.183” from using all 10 rulers.  But if we were lavish and bought 100 rulers, it would come down to sqrt(2/10302) = 0.014”.

Having discovered this trick, it would be tempting to extend it and use (Y1 + Y2 + Yn-1 + Yn)/2.  But this doesn’t help, as the variance for that is (5n+1)/[2(n+2)(n+1)2], which is bigger than (14). 

I confess it surprised me that it is better to use the extremal rulers rather than the mean of them all. But I tested the mathematics both by Monte Carlo and by calculating the variance of the sum of n sorted rulers via (13) with the sum of n unsorted rulers, and for n=10 they agreed exactly.  I think the effectiveness of the method is because the variance of the extremal rulers is small because those lengths bump up against the hard limit from the uniform distribution.

That inference is confirmed by Monte Carlo experiments with, in addition to the uniform, a triangular and a normal distribution for Yi, still wanting a total length of 10 rulers, but having acquired n=100 of them.  The triangular has the same range as the uniform, and half the variance, and the normal has the same variance as the uniform, implying that the endpoints of the uniform represent +/-sqrt(3) standard deviations for the normal, covering 92% of its distribution.

In the following table 3 subsets of the 100 are considered, pared down from a dozen or so experiments.  Each subset is optimal, within the experiments tried, for one or more distribution (starred).  A subset a,b,c,… means that the a-th shortest and longest rulers are used, and the b-th shortest and longest etc. The fraction following the distribution is the variance of a single sample.  The decimal values are variances of the total lengths of the selected rulers then scaled up to 10 rulers.

               v  a  r  i  a  n  c  e  s      

dist\subset    1         1,12,23,34,45 1,34

U(0,1)    1/12 0.00479*  0.0689        0.0449

N(0,1/12) 1/12 0.781     0.1028*       0.2384

T(0,1)    1/24 0.0531    0.0353        0.0328*

We see that by far the smallest variance, 0.00479, occurs if we are guaranteed a uniform distribution, by using a single extreme pair, but that strategy isn’t optimal for the other 2 distributions.  5 well-spaced pairs are best for the normal, and quite good for the triangular, though the latter is slightly better with 2 well-spaced pairs.

Unless the manufacturer can guarantee the shape of the error distribution, assumption that it is uniform would be quite dangerous in terms of choosing a strategy for the use of the available rulers. 

Summary of Section E:

  • Uncertainty should properly be thought of as the dispersion of a distribution of random variables, possibly “hidden”, representing errors, even though that distribution might not be fully specified.
  • In the absence of clarification, a +/-u uncertainty value should be taken as one standard deviation of the error distribution.
  • The assumption, probably through ignorance, that +/-u represents a sharply bounded uniform (or “rectangular”) distribution, allows clever tricks to be played on sorted samples yielding implausibly small variances/uncertainties.
  • The very nature of errors being compounded from multiple sources supports the idea that a normal error distribution is a good approximation.
  • Uniform Uncertainty (compared to Trapezium Uncertainty)

As an interlude between examples, in this section we study further implications of a uniform uncertainty interval, most especially for a digital device.  By suitable scaling we can assume that the possible outputs are a complete range of integers, e.g. 0 to 1000.  We use Bayesian statistics to describe the problem.

Let X be a random variable for the true infinitely precise value which we attempt to measure.

Let x be the value of X actually occurring at some particular time.

Let M be our measurement, a random variable but including the possibility of zero variance.  Note that M is an integer.

Let D be the error, = M – X.

Let f(x) be a chosen (Bayesian) prior probability density function (p.d.f.) for X, P[X’=’x].

Let g(y;x) be a probability function (p.f.) for M over a range of integer y values, dependent on x, written g(y;x) = P[M=y | X’=’x]  (the PRECISION distribution).

Let c be a “constant” of proportionality, determined in each separate case by making relevant probabilities add up to 1.  Then after measurement M, the posterior probability for X taking the value z is, by Bayes’ Theorem,

  • P[X’=’x | M=y]  =  P[M=y | X’=’x] P[X’=’x] / c = g(y;x) f(x) / c

Usually we will take f(x) = P[X ‘=’ x] to be an “uninformative” prior, i.e. uniform over a large range bound to contain x, so it has essentially no influence.  In this case,

  • P[X’=’x | M=y] = g(y;x)/c where c = int g(y;x)dx (the UNCERTAINTY distribution).

Then P[D=z | M=y] = P[X=M-z | M=y] = g(y;y-z)/c.  Now assume that g() is translation invariant, so g(y;y-z) = g(0;-z) =: c h(z) defines function h(), and int h(z)dz = 1.  Then

  • P[D=z | M=y] = h(z), independent of y (ERROR DISTRIBUTION = shifted u.d.).

In addition to this distribution of error given observation, we may also be interested in the distribution of error given the true (albeit unknown) value.  (It took me a long time to work out how to evaluate this.)

Let A be the event {D = z}, B be {M = y}, C be {X = x}.  These events have a causal linkage, which is that they can simultaneously occur if and only if z = y-x.  And when that equation holds, so z can be replaced by y-x, then given that one of the events holds, either both or none of the other two occur, and therefore they have equal probability.  It follows that:

  P[A|C] = P[B|C] = P[C|B]P[B]/P[C] = P[A|B]P[B]/P[C]

  • P[D = z = y-x | X = x] = P[D = y-x | M = y] P[M = y]/P[X = x]

Of the 3 terms on the RHS, the first is h(y-x) from Equation (17), the third is f(x) from Equation (15), and the second is a new prior.  This prior must be closely related to f(), which we took to be uninformative, because M is an integer value near to X.  The upshot is that under these assumptions the LHS is proportional to h(y-x), so

  • P[D = y-x | X = x] = h(y-x)/∑i h(i-x)

Let x’ be the nearest integer to x, and a = x’-x, lying in the interval [-1/2,1/2).  Then y-x = y+a-x’ = a+k where k is an integer.  Then the mean m and variance s2 of D given X=x are:

  • m = ∑k (a+k)h(a+k) / ∑k h(a+k); s2 = ∑k (a+k-m)2h(a+k) / ∑k h(a+k)

A case of obvious interest would be an uncertainty interval which was +/-e uniform. That would correspond to h(z) = 1/(2e) for b-e < z < b+e and 0 elsewhere, where b is the bias of the error. We now evaluate the statistics for the case b = 0 and e ≤ 1.  The symmetry in a means that we need only consider a > 0.  –e < a+k < e implies that -e-a < k < e-a.  e < ½ implies there is an a slightly bigger than e such that no integer k is in the interval, which is impossible,so e is at least ½.  Since h(z) is constant over its range, in (20) cancellation allows us to replace h(a+k) with 1.

  • If a < 1-e then only k=0 is possible, and m = a, s2 = 0.
  • If a > 1-e then k=-1 and k=0 are both possible, and m = a – ½ , s2 = ¼ .

When s2 is averaged over all a we get 2(e-1/2)(1/4) = (2e-1)/4.

It is not plausible for e to be ½, for then s2 would be 0 whatever the fractional part a of x was. Since s2 is the variance of M-X given X=x, that implies that M is completely determined by X.  That might sound reasonable, but in this example it means that as X changes from 314.499999 to 314.500000, M absolutely has to flip from 314 to 315, and that implies that the device, despite giving output resolution to an integer, actually has infinite precision, and is therefore not a real device.

For e > ½, s2 is zero for a in an interval of width 2-2e, and non-zero in two intervals of total width 2e-1.  In these intervals for a (translating to x), it is non-deterministic as to whether the output M is 314, say, or 315.

In Equations (21) and (22) there is a disconcerting discontinuity in the expected error from 1-e at a = (1-e)- to (1/2-e) at a = (1-e)+.  This arises from the cliff edge in the uniform h(z).  More sophisticated functions h(z) do not exhibit this feature, such as a normal distribution, a triangle distribution, or a trapezium distribution such as:

  (23)  h(z) =

{ 2(z+3/4) for -3/4<z<-1/4

{ 1 for -1/4<z<1/4

{ 2(3/4-z) for 1/4<z<3/4

For this example we find

  (24)  if 0<a<1/4, m = a and s2 = 0,

           if 1/4<a<3/4, m = 1/2-a, s2 = 4(a-1/4)(3/4-a) <= 1/4

Note that the discontinuity previously noted does not occur here, as m is a continuous function of a even at a=1/4. The averaged s2 is 1/12, less than the 1/8 from the U[-3/4,3/4] distribution. 

All the above is for a device with a digital output, presumed to change slowly enough to be read reliably by a human.  In the case of an analogue device, like a mercury thermometer, then a human’s reading of the device provides an added error/uncertainty.  The human’s reading error is almost certainly not uniform (we can be more confident when the reading is close to a mark than when it is not), and in any case the sum of instrument and human error is almost certainly not uniform.

Summary of section F:

  • The PRECISION distribution, of an output given the true state, induces an ERROR distribution given some assumptions on translation invariance and flat priors.
  • The range of supported values of the error distribution must exceed the output resolution width, since otherwise infinite precision is implied.
  • Even when that criterion is satisfied, the assumption of a uniform ERROR distribution leads to a discontinuity in mean error as a function of the true value.
  • A corollary is that if your car reports ambient temperature to the nearest half degree, then sometimes, even in steady conditions, its error will exceed half a degree.

Example 2: the marked 1-foot ruler

In this variant, the rulers have markings and an indeterminate length at each end.  Now multiple rulers cannot usefully be laid end to end, and the human eye must be used to judge and mark the 12” positions.  This adds human error/uncertainty to the measurement process, which varies from human to human, and from day to day.  The question of how hard a human should try in order to avoid adding significant uncertainty is considered in the next example.

Example 3: Pat Frank’s Thermometer

Pat Frank introduced the interesting example of a classical liquid-in-glass (LiG) thermometer whose resolution is (+/-)0.25K.  He claimed that everything inside that half-degree interval was a uniform blur, but went on to explain that the uncertainty was due to at least 4 things, namely the thermometer capillary is not of uniform width, the inner surface of the glass is not perfectly smooth and uniform, the liquid inside is not of constant purity, the entire thermometer body is not at constant temperature.  He did not include the fact that during calibration human error in reading the instrument may have been introduced.  So the summation of 5 or more errors implies (except in mathematically “pathological” cases) that the sum is not uniformly distributed.  In fact a normal distribution, perhaps truncated if huge errors with infinitesimal probability are unpalatable, makes much more sense.

The interesting question arises as to what the (hypothetical) manufacturers meant when they said the resolution was +/-0.25K.  Did they actually mean a 1-sigma, or perhaps a 2-sigma, interval?  For deciding how to read, record, and use the data from the instrument, that information is rather vital.

Pat went on to say that a temperature reading taken from that thermometer and written as, e.g., 25.1 C, is meaningless past the decimal point. (He didn’t say, but presumably would consider 25.5 C to be meaningful, given the half-degree uncertainty interval.)  But this isn’t true; assuming that someone cares about the accuracy of the reading, it doesn’t help to compound instrumental error with deliberate human reading error.  Suppose that the error variance of the instrument actually corresponds to 2-sigma, as the manufacturer wanted to give a reasonably firm bound, then the variance is ((1/2)(1/4))2 = 1/64.  If t2 is the error variance of the observer, then the final variance is 1/64+t2

The observer should not aim for a ridiculously low t, even if achievable, and perhaps a high t is not so bad if the observations are not that important.  But beware: observations can increase in importance beyond the expectations of the observer.  For example we value temperature observations from 1870 because they tell us about the idyllic pre-industrial, pre-climate change, world!  In the present example, I would recommend trying for t2 = 1/100, or as near as can be achieved within reason.  Note that if the observer can manage to read uniformly within +/-0.1 C, then that means t2 =  1/300.   But if instead she reads to within +/-0.25, t2 = 1/48 and the overall variance is multiplied by (1+64/48) = 7/3 ~ 1.52, which is a significant impairment of accuracy precision. 

The moral is that it is vital to know what uncertainty variance the manufacturer really believes to be the case, that guidelines for observers should then be appropriately framed, and that sloppiness has consequences.

Summary of Section G:

  • Again, real life examples suggest the compounding of errors, leading to approximately normal distributions.
  • Given a reference uncertainty value from an analogue device, if the observer has the skill and time and inclination then she can reduce overall uncertainty by reading to a greater precision than the reference value.
  • The implications for Pat Frank’s paper

The implication of Section B is that a good emulator can be run with pseudorandom numbers and give output which is similar to that of the black box.  The implication of Section D is that uncertainty analysis is really error analysis and good headway can be made by postulating the existence of hidden random variables through which statistics can be derived.  The implication of Section C is that many emulators of GCM outputs are possible, and just because a particular one seems to fit mean values quite well does not mean that the nature of its error propagation is correct.  The only way to arbitrate between emulators would be to carry out Monte Carlo experiments with the black boxes and the emulators.  This might be expensive, but assuming that emulators have any value at all, it would increase this value.

Frank’s emulator does visibly give a decent fit to the annual means of its target, but that isn’t sufficient evidence to assert that it is a good emulator.  Frank’s paper claims that GCM projections to 2100 have an uncertainty of +/- at least 15K.  Because, via Section D, uncertainty really means a measure of dispersion, this means that Equation (1) with the equivalent of Frank’s parameters, using many examples of 80-year runs, would show an envelope where a good proportion would reach +15K or more, and a good proportion would reach -15K or less, and a good proportion would not reach those bounds.  This is just the nature of random walks with square root of time evolution. 

But the GCM outputs represented by CMIP5 do not show this behaviour, even though, climate being chaotic, different initial conditions should lead to such variety.  Therefore Frank’s emulator is not objectively a good one.  And the reason is that, as mentioned in Section C, the GCMs have corrective mechanisms to cancel out TOA imbalances except for, presumably, those induced by the rather small increase of greenhouse gases from one iteration to the next.

However the real value in Frank’s paper is first the attention drawn to the relatively large annual errors in the radiation budget arising from long wave cloud forcing, and second the revelation through comments on it that GCMs have ways of systematically squashing these errors.

Summary of Section H:

  • Frank’s emulator is not good in regard to matching GCM output error distributions.
  • Frank’s paper has valuable data on LCF errors.
  • Thereby it has forced “GCM auto-correction” out of the woodwork.
  1. The implications for GCMs

The “systematic squashing” of the +/-4 W/m^2 annual error in LCF inside the GCMs is an issue of which I for one was unaware before Pat Frank’s paper. 

The implication of comments by Roy Spencer is that there really is something like a “magic” component R3(t) anti-correlated with R2(t), though the effect would be similar if it was anti-correlated with R2(t-1) instead, which might be plausible with a new time step doing some automatic correction of overshooting or undershooting on the old time step.  GCM experts would be able to confirm or deny that possibility.

In addition, there is the question of a decay rate a, so that only a proportion (1-a) of previous forcing carries into the next time step, as justified by the heat reservoir concept in Section C.  After all, GCMs presumably do try to model the transport of heat in ocean currents, with concomitant heat storage.

It is very disturbing that GCMs have to resort to error correction techniques to achieve approximate TOA balance.  The two advantages of doing so are that they are better able to model past temperatures, and that they do a good job in constraining the uncertainty of their output to the year 2100.  But the huge disadvantage is that it looks like a charlatan’s trick; where is the vaunted skill of these GCMs, compared with anyone picking their favourite number for climate sensitivity and drawing straight lines against log(CO2)?  In theory, an advantage of GCMs might be an ability to explain regional differences in warming.  But I have not seen any strong claims that that is so, with the current state of the science.

Summary of Section I:

  • Auto-correction of TOA radiative balance helps to keep GCMs within reasonable bounds.
  • Details of how this is done would be of great interest; the practice seems dubious at best because it highlights shortcomings in GCMs’ modelling of physical reality.

via Watts Up With That?

https://ift.tt/3bgwrnI

February 7, 2020 at 08:58AM

Leave a comment