Leverage (statistics)
In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.
High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.[1]
Interpretation
The leverage score is also known as the observation self-sensitivity or self-influence,[2] because of the equation
which states that the leverage of the i-th observation equals the partial derivative of the fitted i-th dependent value with respect to the measured i-th dependent value . This partial derivative describes the degree by which the i-th measured value influences the i-th fitted value. Note that this leverage depends on the values of the explanatory (x-) variables of all observations but not on any of the values of the dependent (y-) variables.
The equation follows directly from the computation of the fitted values via the hat matrix as .
Bounds on leverage
Proof
First, note that H is an idempotent matrix: Also, observe that is symmetric (i.e.: ). So equating the ii element of H to that of H 2, we have
and
Effect on residual variance
If we are in an ordinary least squares setting with fixed X and homoscedastic regression errors
then the i-th regression residual
has variance
In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise.
Proof
First, note that is idempotent and symmetric, and . This gives
Thus
Studentized residuals
The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then
where is an appropriate estimate of
Related concepts
Partial leverage
Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum.
Mahalanobis distance
Leverage is closely related to the Mahalanobis distance[3] (see proof: [4]).
Specifically, for some matrix the squared Mahalanobis distance of some row vector from the vector of mean , of length , and with the estimated covariance matrix is:
This is related to the leverage of the hat matrix of after appending a column vector of 1's to it. The relationship between the two is:
The relationship between leverage and Mahalanobis distance enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically. [5]
Software implementations
Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.
Language/Program | Function | Notes |
---|---|---|
R | hat(x, intercept = TRUE) or hatvalues(model, ...) | See |
See also
- Projection matrix – whose main diagonal entries are the leverages of the observations
- Mahalanobis distance – a (scaled) measure of leverage of a datum
- Cook's distance – a measure of changes in regression coefficients when an observation is deleted
- DFFITS
- Outlier – observations with extreme Y values
- Degrees of freedom (statistics), the sum of leverage scores
References
- Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.
- Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).
- Weiner, Irving B.; Schinka, John A.; Velicer, Wayne F. (23 October 2012). Handbook of Psychology, Research Methods in Psychology. John Wiley & Sons. ISBN 978-1-118-28203-8.
- Prove the relation between Mahalanobis distance and Leverage?
- Kim, M. G. (2004). "Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513)".