Deming regression

In statistics, Deming regression, named after W. Edwards Deming, is an errors-in-variables model which tries to find the line of best fit for a two-dimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x- and the y- axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure.

Deming regression. The red lines show the error in both x and y. This is different from the traditional least squares method which measures error parallel to the y axis. The case shown, with deviations measured perpendicularly, arises when errors in x and y have equal variances.

Deming regression is equivalent to the maximum likelihood estimation of an errors-in-variables model in which the errors for the two variables are assumed to be independent and normally distributed, and the ratio of their variances, denoted δ, is known.[1] In practice, this ratio might be estimated from related data-sources; however the regression procedure takes no account for possible errors in estimating this ratio.

The Deming regression is only slightly more difficult to compute compared to the simple linear regression. Most statistical software packages used in clinical chemistry offer Deming regression.

The model was originally introduced by Adcock (1878) who considered the case δ = 1, and then more generally by Kummell (1879) with arbitrary δ. However their ideas remained largely unnoticed for more than 50 years, until they were revived by Koopmans (1937) and later propagated even more by Deming (1943). The latter book became so popular in clinical chemistry and related fields that the method was even dubbed Deming regression in those fields.[2]

Specification

Assume that the available data (y_i, x_i) are measured observations of the "true" values (y_i*, x_i*), which lie on the regression line:

{\begin{aligned}y_{i}&=y_{i}^{*}+\varepsilon _{i},\\x_{i}&=x_{i}^{*}+\eta _{i},\end{aligned}}

where errors ε and η are independent and the ratio of their variances is assumed to be known:

\delta ={\frac {\sigma _{\varepsilon }^{2}}{\sigma _{\eta }^{2}}}.

In practice, the variances of the $x$ and $y$ parameters are often unknown, which complicates the estimate of $\delta$ . Note that when the measurement method for $x$ and $y$ is the same, these variances are likely to be equal, so $\delta =1$ for this case.

We seek to find the line of "best fit"

y^{*}=\beta _{0}+\beta _{1}x^{*},

such that the weighted sum of squared residuals of the model is minimized:[3]

SSR=\sum _{i=1}^{n}{\bigg (}{\frac {\varepsilon _{i}^{2}}{\sigma _{\varepsilon }^{2}}}+{\frac {\eta _{i}^{2}}{\sigma _{\eta }^{2}}}{\bigg )}={\frac {1}{\sigma _{\varepsilon }^{2}}}\sum _{i=1}^{n}{\Big (}(y_{i}-\beta _{0}-\beta _{1}x_{i}^{*})^{2}+\delta (x_{i}-x_{i}^{*})^{2}{\Big )}\ \to \ \min _{\beta _{0},\beta _{1},x_{1}^{*},\ldots ,x_{n}^{*}}SSR

See Jensen (2007)[4] for a full derivation.

Solution

The solution can be expressed in terms of the second-degree sample moments. That is, we first calculate the following quantities (all sums go from i = 1 to n):

{\begin{aligned}&{\overline {x}}={\frac {1}{n}}\sum x_{i},\quad {\overline {y}}={\frac {1}{n}}\sum y_{i},\\&s_{xx}={\tfrac {1}{n-1}}\sum (x_{i}-{\overline {x}})^{2},\\&s_{xy}={\tfrac {1}{n-1}}\sum (x_{i}-{\overline {x}})(y_{i}-{\overline {y}}),\\&s_{yy}={\tfrac {1}{n-1}}\sum (y_{i}-{\overline {y}})^{2}.\end{aligned}}

Finally, the least-squares estimates of model's parameters will be[5]

{\begin{aligned}&{\hat {\beta }}_{1}={\frac {s_{yy}-\delta s_{xx}+{\sqrt {(s_{yy}-\delta s_{xx})^{2}+4\delta s_{xy}^{2}}}}{2s_{xy}}},\\&{\hat {\beta }}_{0}={\overline {y}}-{\hat {\beta }}_{1}{\overline {x}},\\&{\hat {x}}_{i}^{*}=x_{i}+{\frac {{\hat {\beta }}_{1}}{{\hat {\beta }}_{1}^{2}+\delta }}(y_{i}-{\hat {\beta }}_{0}-{\hat {\beta }}_{1}x_{i}).\end{aligned}}

Orthogonal regression

For the case of equal error variances, i.e., when $\delta =1$ , Deming regression becomes orthogonal regression: it minimizes the sum of squared perpendicular distances from the data points to the regression line. In this case, denote each observation as a point z_j in the complex plane (i.e., the point (x_j, y_j) is written as z_j = x_j + iy_j where i is the imaginary unit). Denote as Z the sum of the squared differences of the data points from the centroid (also denoted in complex coordinates), which is the point whose horizontal and vertical locations are the averages of those of the data points. Then:[6]

If Z = 0, then every line through the centroid is a line of best orthogonal fit.
If Z ≠ 0, the orthogonal regression line goes through the centroid and is parallel to the vector from the origin to ${\sqrt {Z}}$ .

A trigonometric representation of the orthogonal regression line was given by Coolidge in 1913.[7]

Application

In the case of three non-collinear points in the plane, the triangle with these points as its vertices has a unique Steiner inellipse that is tangent to the triangle's sides at their midpoints. The major axis of this ellipse falls on the orthogonal regression line for the three vertices.[8]

gollark: School didn't restart here yet, but Boris Johnson is insisting that it's a "moral imperative" that everyone goes back and that it's totally safe for everyone.

gollark: My mother is a doctor, and warned me *against* going into medicine, although I forgot why.

gollark: It seems to just randomly change its mind every decade or so on stuff beyond "you lose weight if you burn more energy than you take in".

gollark: I'm not convinced that nutrition science... knows much.

gollark: Anyone who gets close enough to assassinate him is affected by the reality distortion field.

Notes

(Linnet 1993)
Cornbleet, Gochman (1979)
Fuller, ch.1.3.3
Jensen, Anders Christian (2007)
Glaister (2001)
Minda and Phelps (2008), Theorem 2.3.
Coolidge, J. L. (1913).
Minda and Phelps (2008), Corollary 2.4.

References

Adcock, R. J. (1878). "A problem in least squares". The Analyst. Annals of Mathematics. 5 (2): 53–54. doi:10.2307/2635758. JSTOR 2635758.CS1 maint: ref=harv (link)
Coolidge, J. L. (1913). "Two geometrical applications of the mathematics of least squares". The American Mathematical Monthly. 20 (6): 187–190. doi:10.2307/2973072.CS1 maint: ref=harv (link)
Cornbleet, P.J.; Gochman, N. (1979). "Incorrect Least–Squares Regression Coefficients". Clin. Chem. 25 (3): 432–438. PMID 262186.CS1 maint: ref=harv (link)
Deming, W. E. (1943). Statistical adjustment of data. Wiley, NY (Dover Publications edition, 1985). ISBN 0-486-64685-8.CS1 maint: ref=harv (link)
Fuller, Wayne A. (1987). Measurement error models. John Wiley & Sons, Inc. ISBN 0-471-86187-1.CS1 maint: ref=harv (link)
Glaister, P. (2001). "Least squares revisited". The Mathematical Gazette. 85: 104–107. doi:10.2307/3620485.CS1 maint: ref=harv (link)
Jensen, Anders Christian (2007). "Deming regression, MethComp package" (PDF).CS1 maint: ref=harv (link)
Koopmans, T. C. (1937). Linear regression analysis of economic time series. DeErven F. Bohn, Haarlem, Netherlands.CS1 maint: ref=harv (link)
Kummell, C. H. (1879). "Reduction of observation equations which contain more than one observed quantity". The Analyst. Annals of Mathematics. 6 (4): 97–105. doi:10.2307/2635646. JSTOR 2635646.CS1 maint: ref=harv (link)
Linnet, K. (1993). "Evaluation of regression procedures for method comparison studies". Clinical Chemistry. 39 (3): 424–432. PMID 8448852.CS1 maint: ref=harv (link)
Minda, D.; Phelps, S. (2008). "Triangles, ellipses, and cubic polynomials" (PDF). American Mathematical Monthly. 115 (8): 679–689. MR 2456092.CS1 maint: ref=harv (link)

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] (Linnet 1993)

[2] Cornbleet, Gochman (1979)

[3] Fuller, ch.1.3.3

[4] Jensen, Anders Christian (2007)

[5] Glaister (2001)

[6] Minda and Phelps (2008), Theorem 2.3.

[7] Coolidge, J. L. (1913).

[8] Minda and Phelps (2008), Corollary 2.4.