Fisher's method

In statistics, Fisher's method,[1][2] also known as Fisher's combined probability test, is a technique for data fusion or "meta-analysis" (analysis of analyses). It was developed by and named for Ronald Fisher. In its basic form, it is used to combine the results from several independent tests bearing upon the same overall hypothesis (H₀).

Under Fisher's method, two small p-values P₁ and P₂ combine to form a smaller p-value. The yellow-green boundary defines the region where the meta-analysis p-value is below 0.05. For example, if both p-values are around 0.10, or if one is around 0.04 and one is around 0.25, the meta-analysis p-value is around 0.05.

Application to independent test statistics

Fisher's method combines extreme value probabilities from each test, commonly known as "p-values", into one test statistic (X²) using the formula

X_{2k}^{2}\sim -2\sum _{i=1}^{k}\ln(p_{i}),

where p_i is the p-value for the i^th hypothesis test. When the p-values tend to be small, the test statistic X² will be large, which suggests that the null hypotheses are not true for every test.

When all the null hypotheses are true, and the p_i (or their corresponding test statistics) are independent, X² has a chi-squared distribution with 2k degrees of freedom, where k is the number of tests being combined. This fact can be used to determine the p-value for X².

The distribution of X² is a chi-squared distribution for the following reason; under the null hypothesis for test i, the p-value p_i follows a uniform distribution on the interval [0,1]. The negative natural logarithm of a uniformly distributed value follows an exponential distribution. Scaling a value that follows an exponential distribution by a factor of two yields a quantity that follows a chi-squared distribution with two degrees of freedom. Finally, the sum of k independent chi-squared values, each with two degrees of freedom, follows a chi-squared distribution with 2k degrees of freedom.

Limitations of independence assumption

Dependence among statistical tests is generally positive, which means that the p-value of X² is too small (anti-conservative) if the dependency is not taken into account. Thus, if Fisher's method for independent tests is applied in a dependent setting, and the p-value is not small enough to reject the null hypothesis, then that conclusion will continue to hold even if the dependence is not properly accounted for. However, if positive dependence is not accounted for, and the meta-analysis p-value is found to be small, the evidence against the null hypothesis is generally overstated. The mean false discovery rate, $\alpha (k+1)/(2k)$ , $\alpha$ reduced for k independent or positively correlated tests, may suffice to control alpha for useful comparison to an over-small p-value from Fisher's X².

Extension to dependent test statistics

In cases where the tests are not independent, the null distribution of X² is more complicated. A common strategy is to approximate the null distribution with a scaled χ²-distribution random variable. Different approaches may be used depending on whether or not the covariance between the different p-values is known.

Brown's method [3] can be used to combine dependent p-values whose underlying test statistics have a multivariate normal distribution with a known covariance matrix. Kost's method [4] extends Brown's to allow one to combine p-values when the covariance matrix is known only up to a scalar multiplicative factor.

The harmonic mean p-value offers an alternative to Fisher's method for combining p-values when the dependency structure is unknown but the tests cannot be assumed to be independent.[5][6]

Interpretation

Fisher's method is typically applied to a collection of independent test statistics, usually from separate studies having the same null hypothesis. The meta-analysis null hypothesis is that all of the separate null hypotheses are true. The meta-analysis alternative hypothesis is that at least one of the separate alternative hypotheses is true.

In some settings, it makes sense to consider the possibility of "heterogeneity," in which the null hypothesis holds in some studies but not in others, or where different alternative hypotheses may hold in different studies. A common reason for the latter form of heterogeneity is that effect sizes may differ among populations. For example, consider a collection of medical studies looking at the risk of a high glucose diet for developing type II diabetes. Due to genetic or environmental factors, the true risk associated with a given level of glucose consumption may be greater in some human populations than in others.

In other settings, the alternative hypothesis is either universally false, or universally true – there is no possibility of it holding in some settings but not in others. For example, consider several experiments designed to test a particular physical law. Any discrepancies among the results from separate studies or experiments must be due to chance, possibly driven by differences in power.

In the case of a meta-analysis using two-sided tests, it is possible to reject the meta-analysis null hypothesis even when the individual studies show strong effects in differing directions. In this case, we are rejecting the hypothesis that the null hypothesis is true in every study, but this does not imply that there is a uniform alternative hypothesis that holds across all studies. Thus, two-sided meta-analysis is particularly sensitive to heterogeneity in the alternative hypotheses. One sided meta-analysis can detect heterogeneity in the effect magnitudes, but focuses on a single, pre-specified effect direction.

Relation to Stouffer's Z-score method

The relationship between Fisher's method and Stouffer's method can be understood from the relationship between z and −log(p)

A closely related approach to Fisher's method is Stouffer's Z, based on Z-scores rather than p-values, allowing incorporation of study weights. It is named for the sociologist Samuel A. Stouffer.[7] If we let Z_i = Φ^− 1(1−p_i), where Φ is the standard normal cumulative distribution function, then

Z\sim {\frac {\sum _{i=1}^{k}Z_{i}}{\sqrt {k}}},

is a Z-score for the overall meta-analysis. This Z-score is appropriate for one-sided right-tailed p-values; minor modifications can be made if two-sided or left-tailed p-values are being analysed. Specifically, if two-sided p-values are being analyzed, the two-sided p-value (p_i/2) is used, or 1-p_i if left-tailed p-values are used.[8]

Since Fisher's method is based on the average of −log(p_i) values, and the Z-score method is based on the average of the Z_i values, the relationship between these two approaches follows from the relationship between z and −log(p) = −log(1−Φ(z)). For the normal distribution, these two values are not perfectly linearly related, but they follow a highly linear relationship over the range of Z-values most often observed, from 1 to 5. As a result, the power of the Z-score method is nearly identical to the power of Fisher's method.

One advantage of the Z-score approach is that it is straightforward to introduce weights. [9][10] If the i^th Z-score is weighted by w_i, then the meta-analysis Z-score is

Z\sim {\frac {\sum _{i=1}^{k}w_{i}Z_{i}}{\sqrt {\sum _{i=1}^{k}w_{i}^{2}}}},

which follows a standard normal distribution under the null hypothesis. While weighted versions of Fisher's statistic can be derived, the null distribution becomes a weighted sum of independent chi-squared statistics, which is less convenient to work with.

gollark: EWO has no wrist damage mechanic.

gollark: Well, I can't *test* for desync issues easily because my latency to EWO's core servers is really low.

gollark: It breaks websockets, I Don't know why.

gollark: For osmarks.tk specifically.

gollark: It is not particularly performant.

References

Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver and Boyd (Edinburgh). ISBN 0-05-002170-2.
Fisher, R.A.; Fisher, R. A (1948). "Questions and answers #14". The American Statistician. 2 (5): 30–31. doi:10.2307/2681650. JSTOR 2681650.
Brown, M. (1975). "A method for combining non-independent, one-sided tests of significance". Biometrics. 31 (4): 987–992. doi:10.2307/2529826.
Kost, J.; McDermott, M. (2002). "Combining dependent P-values". Statistics & Probability Letters. 60 (2): 183–190. doi:10.1016/S0167-7152(02)00310-3.
Good, I J (1958). "Significance tests in parallel and in series". Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR 2281953.
Wilson, D J (2019). "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. doi:10.1073/pnas.1814092116. PMC 6347718.
Stouffer, S.A.; Suchman, E.A.; DeVinney, L.C.; Star, S.A.; Williams, R.M. Jr. (1949). The American Soldier, Vol.1: Adjustment during Army Life. Princeton University Press, Princeton.
"Testing two-tailed p-values using Stouffer's approach". stats.stackexchange.com. Retrieved 2015-09-14.
Mosteller, F.; Bush, R.R. (1954). "Selected quantitative techniques". In Lindzey, G. (ed.). Handbook of Social Psychology,Vol1. Addison_Wesley, Cambridge, Mass. pp. 289–334.
Liptak, T. (1958). "On the combination of independent tests". Magyar Tud. Akad. Mat. Kutato Int. Kozl. 3: 171–197.