Influential observation

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation.[1] In particular, in regression analysis an influential point is one whose deletion has a large effect on the parameter estimates.[2]

In Anscombe's quartet the two datasets on the bottom both contain influential points. All four sets are identical when examined using simple summary statistics, but vary considerably when graphed. If one point were removed, the line would look very different.

Assessment

Various methods have been proposed for measuring influence.[3][4] Assume an estimated regression $\mathbf {y} =\mathbf {X} \mathbf {b} +\mathbf {e}$ , where $\mathbf {y}$ is an n×1 column vector for the response variable, $\mathbf {X}$ is the n×k design matrix of explanatory variables (including a constant), $\mathbf {e}$ is the n×1 residual vector, and $\mathbf {b}$ is a k×1 vector of estimates of some population parameter $\mathbf {\beta } \in \mathbb {R} ^{k}$ . Also define $\mathbf {H} \equiv \mathbf {X} \left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}$ , the projection matrix of $\mathbf {X}$ . Then we have the following measures of influence:

${\text{DFBETA}}_{i}\equiv \mathbf {b} -\mathbf {b} _{(-i)}={\frac {\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {x} _{i}^{\mathsf {T}}e_{i}}{1-h_{i\cdot }}}$ , where $\mathbf {b} _{(-i)}$ denotes the coefficients estimated with the i-th row $\mathbf {x} _{i}$ of $\mathbf {X}$ deleted, $h_{i\cdot }=\mathbf {x} _{i}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {x} _{i}^{\mathsf {T}}$ denotes the i-th row of $\mathbf {H}$ . Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each point and each observation (if there are N points and k variables there are N·k DFBETAs).[5] Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure):

x	y	intercept	slope
10.0	7.46	-0.005	-0.044
8.0	6.77	-0.037	0.019
13.0	12.74	-357.910	525.268
9.0	7.11	-0.033	0
11.0	7.81	0.049	-0.117
14.0	8.84	0.490	-0.667
6.0	6.08	0.027	-0.021
4.0	5.39	0.241	-0.209
12.0	8.15	0.137	-0.231
7.0	6.42	-0.020	0.013
5.0	5.73	0.105	-0.087

DFFITS - difference in fits
Cook's D measures the effect of removing a data point on all the parameters combined.[2]

Outliers, leverage and influence

An outlier may be defined as a data point that differs significantly from other observations.[6][7] A high-leverage point are observations made at extreme values of the independent variables.[8] Both types of atypical observations will force the regression line to be close to the point.[2] In Anscombe's quartet, the bottom right image has a point with high leverage and the bottom left image has an outlying point.

gollark: It's fortunate that parents are mostly not competent enough to do anything beyond use preexisting tooling in simple ways.

gollark: It's probably when the app is "running" in some way, not just actively focused.

gollark: Oh, I have an explanation.

gollark: Maybe I should have a web interface for APIONET for purposes.

gollark: Express it unicoduously or it is not real.

References

Burt, James E.; Barber, Gerald M.; Rigby, David L. (2009), Elementary Statistics for Geographers, Guilford Press, p. 513, ISBN 9781572304840.
Everitt, Brian (1998). The Cambridge Dictionary of Statistics. Cambridge, UK New York: Cambridge University Press. ISBN 0-521-59346-8.
Winner, Larry (March 25, 2002). "Influence Statistics, Outliers, and Collinearity Diagnostics".
Belsley, David A.; Kuh, Edwin; Welsh, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. pp. 11–16. ISBN 0-471-05856-4.
"Outliers and DFBETA" (PDF). Archived (PDF) from the original on May 11, 2013.
Grubbs, F. E. (February 1969). "Procedures for detecting outlying observations in samples". Technometrics. 11 (1): 1–21. doi:10.1080/00401706.1969.10490657. An outlying observation, or "outlier," is one that appears to deviate markedly from other members of the sample in which it occurs.
Maddala, G. S. (1992). "Outliers". Introduction to Econometrics (2nd ed.). New York: MacMillan. pp. 89. ISBN 978-0-02-374545-4. An outlier is an observation that is far removed from the rest of the observations.
Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.

Influential observation

Assessment

Outliers, leverage and influence

See also

References

Further reading