To avoid any interruption in access or functionality, install a current-generation web browser now. Learn more.

# More Than You Ever Wanted to Know About Calibrations, Part 4 – Calibration Acceptance

2 February 2023
By
• Jason Hoisington
• Share:

In my previous calibration blog posts I’ve discussed calibration types, curve fits, and zero points. In this post we’ll cover how to tell if your calibration is accurate. The previous posts have been a bit dense and math heavy, but the TL;DR on this one is pretty simple: r2 is bad and you shouldn’t use it.

It might seem a bit presumptuous of me to off-handedly dismiss a calibration metric that’s been in use longer than I’ve been alive, but bear with me. To start, let’s talk about what r2 really is. r2 is generally defined as the coefficient of determination. It was, as far as I can tell, developed in 1921 by Sewall Wright, Senior Animal Husbandman at the USDA (the original paper is here, for those interested in historical scientific publishing). He describes his work as “…an attempt to present a method of measuring the direct influence along each separate path in such a system and thus of finding the degree to which variation of a given effect is determined by each particular cause... In cases in which the causal relations are uncertain the method can be used to find the logical consequences of any particular hypothesis in regard to them.” In short, if you think things are causally connected and build a mathematical model to explain it, r2 will give a metric to determine the accuracy of your model.

For instrument calibrations this doesn’t seem to be very helpful. We don’t need a tool to tell us that instrument response is causally connected to sample concentration, it’s built into the design of the instrument. What we need is a tool that tells us how accurate our calibration model is. Fortunately, r2 pulls double duty as a measure of goodness of fit, which from the name seems to be what we’re after. Unfortunately, this goodness of fit measurement doesn’t actually tell us much about the accuracy of the calibration due to how it’s calculated. To see why, let’s look at how r2 is calculated.

r2 is defined as the ratio of the variance of the fitted values (those calculated from the regression model) to the variance of true values, with the equation shown in Figure 1. The sum of squares in the numerator is the variance of the fitted values, and the sum of squares in the denominator is the variance of the true values. Figure 1 – Equation for the calculation of r2.

In this equation, the values of fi are the calculated values from the regression, the values of yi are the true y values, and ȳ is the mean of the observed y values. For a least squares type regression this ends up being equal to the square of the Pearson correlation coefficient (usually designated r, hence why the coefficient of determination is r2), and ranges in value from 0 to 1. In true math textbook fashion though, I’ll leave the determination of that as an exercise to the reader. In any case, if the variance in both is equal, that is to say the regression perfectly matches the given data, then r2 = 1. r2 values less than 1 indicate that there is a mismatch between the model and the true values, and the further from 1 the greater the disagreement.

So far that sounds like a good measure of calibration accuracy, but if you look closely at the equation in Figure 1 you’ll see that it deals in absolute variance, not relative variance. This means that for calibration curves that span multiple orders of magnitude a high r2 value may only indicate a good fit at the high end of the curve. As far as r2 is concerned, 1% off at 100 ppm is the same as 100% off at 1ppm. Also, if you remember from the previous blog on curve fits you’ll recall that unweighted curves minimize total error, so they are also biased towards the high end of the curve. This means that using r2 as a measure of calibration accuracy will point you towards unweighted curves that are overall less accurate. Table 1 shows an example of this from an ethylene oxide calibration I did, where the best r2 value is associated with the calibration with the lowest total error, but with very high % error on the lowest calibration points.

 EtO calculated result pptv pptv pptv pptv pptv pptv pptv %RSE r2 Linear equal wt. (pptv) 16.22 35.54 138.09 254.53 778.60 1298.84 2686.18 32.37% 0.997388 absolute error 17.78 31.46 4.09 14.47 106.60 45.16 1.82 total 221.38 % error 52% 47% 3% 5% 16% 3% 0% total 127% Linear 1/x (pptv) 33.69 52.55 152.63 266.28 777.77 1285.52 2639.56 13.64% 0.994124 absolute error 0.31 14.45 18.63 2.72 105.77 58.48 48.44 total 248.80 % error 1% 22% 14% 1% 16% 4% 2% total 59% Linear 1/x2 (pptv) 36.06 54.4 151.75 262.29 759.81 1253.70 2570.77 12.70% 0.980645 absolute error 2.06 12.60 17.75 6.71 87.81 90.30 117.23 total 334.46 % error 6% 19% 13% 2% 13% 7% 4% total 65%

Table 1 – Ethylene oxide (EtO) calibration and residual error at each calibration point.

If r2 is a flawed measure of accuracy, then what is the best way of determining if a calibration is accurate? The simplest way is to simply measure the residual error at each calibration point, which many methods are moving to. Recent PFAS methods such as 533, 537.1, and OTM-45 have all removed the r2 requirement in favor of measuring error at each calibration point. TO-15A unfortunately has kept r2, simply adding in the % error requirement for each calibration point in addition to it. SW-846 8000D still includes r2 but allows for the use of % error or % relative standard error (%RSE) in its place. Figure 2 shows the equation for %RSE. Figure 2 – Calculation for % relative standard error

Calculating error at each point and calculating %RSE both have their advantages. Error at each point gives insight into potential non-linearity at the ends of the curve, and can help flag poorly made calibration points. % RSE meanwhile, is a single metric for the curve, so it’s a bit easier to apply as a basic acceptance criteria.

Hopefully this gives you an understanding of the limitations of r2 and convinces you to move away from it if possible, or at least treat it with caution if still required to use it. In my next blog post I’ll talk about precision and detection limit calculations and how calibration choices can influence them, but I’ll close this out with some comments taken from Dr. Cosma Shalizi, Associate Professor at Carnegie Mellon University whose lecture notes I came across while researching this. “At this point, you might be wondering just what R2 is good for — what job it does that isn’t better done by other tools. The only honest answer I can give you is that I have never found a situation where it helped at all... The tone I have taken when discussing F tests, R2 and correlation has been dismissive. This is deliberate, because they are grossly abused and over-used in current practice, especially by non-statisticians, and I want you to be too proud (or too ashamed) to engage in those abuses.”

## Related Articles

#### More Than You Ever Wanted to Know About Calibrations, Part 1 – Types of Calibrations 