# More Than You Ever Wanted to Know About Calibrations, Part 4 – Calibration Acceptance

1 Feb 2023In my previous calibration blog posts I’ve discussed calibration types, curve fits, and zero points. In this post we’ll cover how to tell if your calibration is accurate. The previous posts have been a bit dense and math heavy, but the TL;DR on this one is pretty simple: **r ^{2 }is bad and you shouldn’t use it.**

It might seem a bit presumptuous of me to off-handedly dismiss a calibration metric that’s been in use longer than I’ve been alive, but bear with me. To start, let’s talk about what r^{2} really is. r^{2 }is generally defined as the coefficient of determination. It was, as far as I can tell, developed in 1921 by Sewall Wright, Senior Animal Husbandman at the USDA (the original paper is here, for those interested in historical scientific publishing). He describes his work as “…an attempt to present a method of measuring the direct influence along each separate path in such a system and thus of finding the degree to which variation of a given effect is determined by each particular cause... In cases in which the causal relations are uncertain the method can be used to find the logical consequences of any particular hypothesis in regard to them.” In short, if you think things are causally connected and build a mathematical model to explain it, r^{2} will give a metric to determine the accuracy of your model.

For instrument calibrations this doesn’t seem to be very helpful. We don’t need a tool to tell us that instrument response is causally connected to sample concentration, it’s built into the design of the instrument. What we need is a tool that tells us how accurate our calibration model is. Fortunately, r^{2} pulls double duty as a measure of goodness of fit, which from the name seems to be what we’re after. Unfortunately, this goodness of fit measurement doesn’t actually tell us much about the accuracy of the calibration due to how it’s calculated. To see why, let’s look at how r^{2 }is calculated.

r^{2} is defined as the ratio of the variance of the fitted values (those calculated from the regression model) to the variance of true values, with the equation shown in Figure 1. The sum of squares in the numerator is the variance of the fitted values, and the sum of squares in the denominator is the variance of the true values.

Figure 1 – Equation for the calculation of r^{2}.

In this equation, the values of *f _{i}* are the calculated values from the regression, the values of

*y*are the true y values, and

_{i}*ȳ*is the mean of the observed y values. For a least squares type regression this ends up being equal to the square of the Pearson correlation coefficient (usually designated r, hence why the coefficient of determination is r

^{2}), and ranges in value from 0 to 1. In true math textbook fashion though, I’ll leave the determination of that as an exercise to the reader. In any case, if the variance in both is equal, that is to say the regression perfectly matches the given data, then r

^{2 }= 1. r

^{2 }values less than 1 indicate that there is a mismatch between the model and the true values, and the further from 1 the greater the disagreement.

So far that sounds like a good measure of calibration accuracy, but if you look closely at the equation in Figure 1 you’ll see that it deals in absolute variance, not relative variance. This means that for calibration curves that span multiple orders of magnitude a high r^{2} value may only indicate a good fit at the high end of the curve. As far as r^{2} is concerned, 1% off at 100 ppm is the same as 100% off at 1ppm. Also, if you remember from the previous blog on curve fits you’ll recall that unweighted curves minimize total error, so they are also biased towards the high end of the curve. This means that using r^{2} as a measure of calibration accuracy will point you towards unweighted curves that are overall less accurate. Table 1 shows an example of this from an ethylene oxide calibration I did, where the best r^{2} value is associated with the calibration with the lowest total error, but with very high % error on the lowest calibration points.

EtO calculated result |
ppt |
ppt |
ppt |
ppt |
ppt |
ppt |
ppt |
%RSE |
r |
||

Linear equal wt. (pptv) |
16.22 |
35.54 |
138.09 |
254.53 |
778.60 |
1298.84 |
2686.18 |
32.37% |
0.997388 |
||

absolute error |
17.78 |
31.46 |
4.09 |
14.47 |
106.60 |
45.16 |
1.82 |
total |
221.38 |
||

% error |
52% |
47% |
3% |
5% |
16% |
3% |
0% |
total |
127% |
||

Linear 1/x (pptv) |
33.69 |
52.55 |
152.63 |
266.28 |
777.77 |
1285.52 |
2639.56 |
13.64% |
0.994124 |
||

absolute error |
0.31 |
14.45 |
18.63 |
2.72 |
105.77 |
58.48 |
48.44 |
total |
248.80 |
||

% error |
1% |
22% |
14% |
1% |
16% |
4% |
2% |
total |
59% |
||

Linear 1/x |
36.06 |
54.4 |
151.75 |
262.29 |
759.81 |
1253.70 |
2570.77 |
12.70% |
0.980645 |
||

absolute error |
2.06 |
12.60 |
17.75 |
6.71 |
87.81 |
90.30 |
117.23 |
total |
334.46 |
||

% error |
6% |
19% |
13% |
2% |
13% |
7% |
4% |
total |
65% |

Table 1 – Ethylene oxide (EtO) calibration and residual error at each calibration point.

If r^{2} is a flawed measure of accuracy, then what is the best way of determining if a calibration is accurate? The simplest way is to simply measure the residual error at each calibration point, which many methods are moving to. Recent PFAS methods such as 533, 537.1, and OTM-45 have all removed the r^{2 }requirement in favor of measuring error at each calibration point. TO-15A unfortunately has kept r^{2}, simply adding in the % error requirement for each calibration point in addition to it. SW-846 8000D still includes r^{2} but allows for the use of % error or % relative standard error (%RSE) in its place. Figure 2 shows the equation for %RSE.

Figure 2 – Calculation for % relative standard error

Calculating error at each point and calculating %RSE both have their advantages. Error at each point gives insight into potential non-linearity at the ends of the curve, and can help flag poorly made calibration points. % RSE meanwhile, is a single metric for the curve, so it’s a bit easier to apply as a basic acceptance criteria.

Hopefully this gives you an understanding of the limitations of r^{2} and convinces you to move away from it if possible, or at least treat it with caution if still required to use it. In my next blog post I’ll talk about precision and detection limit calculations and how calibration choices can influence them, but I’ll close this out with some comments taken from Dr. Cosma Shalizi, Associate Professor at Carnegie Mellon University whose lecture notes I came across while researching this. “At this point, you might be wondering just what R^{2} is good for — what job it does that isn’t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all... The tone I have taken when discussing F tests, R^{2} and correlation has been dismissive. This is deliberate, because they are grossly abused and over-used in current practice, especially by non-statisticians, and I want you to be too proud (or too ashamed) to engage in those abuses.”