Previous research has shown that treating dependent effect sizes as independent inflates the variance of the mean effect size and introduces bias by giving studies with more effect sizes more weight in the meta-analysis. This article summarizes the different approaches to handling dependence that have been advocated by methodologists, some of which are more feasible to implement with education research studies than others. A case study using effect sizes from a recent meta-analysis of reading interventions is presented to compare the results obtained from different approaches to dealing with dependence. Overall, mean effect sizes and variance estimates were found to be similar, but estimates of indexes of heterogeneity varied. Meta-analysts are advised to explore the effect of the method of handling dependence on the heterogeneity estimates before conducting moderator analyses and to choose the approach to dependence that is best suited to their research question and their data set.
Keywords: meta-analysis, statistical dependence, heterogeneity analysis
The inclusion of statistically dependent effect sizes in a meta-analysis can present a serious threat to the validity of the meta-analytic results. Dependence can arise in a number of ways. One common way that dependence presents itself occurs when a study included in a meta-analysis uses more than one outcome measure, such as a reading intervention study that measures both reading fluency and reading comprehension. The resulting effect sizes are dependent because the same participants were measured more than once. Dependence also commonly occurs when a study's research design includes two treatment groups compared with the same control group. Because the same control group participants are included in each treatment/control comparison, the resulting effect sizes are statistically dependent. Failure to resolve or model dependence results in artificially reduced estimates of variance, which in turn inflates Type I error (Borenstein, Hedges, Higgins, & Rothstein, 2009a). Treating dependent effect sizes as if they were independent also gives more weight in the meta-analysis to studies that have multiple measures or more than two groups. Statistical dependence must be resolved in a way that allows each study to contribute a single independent effect size to the meta-analysis or modeled using methodological techniques designed to handle dependence to avoid these threats to the validity of the meta-analytic results.
Prevalence of the Problem of Dependence in Meta-Analyses in Education Research
Education research studies commonly yield a set of dependent effect sizes. For example, Edmonds et al. (2009) extracted 78 effect sizes from 21 studies of interventions for struggling readers, an average of nearly four per study across multiple measures and multiple dependent comparisons. Tran, Sanchez, Arellano, and Swanson (2011) calculated 107 effect sizes from multiple measures across 13 response-to-instruction studies, meaning that an average of eight outcome measures had been used in these studies. In their meta-analysis on the effectiveness of Reading Recovery, D'Agostino and Murphy (2004) calculated 1,379 effect sizes across the multiple outcomes, group comparisons, and testing occasions in the 36 studies that met their inclusion criteria, for an average of approximately 38 effect sizes per study. In a review of education meta-analyses published since 2000, Ahn, Ames, and Myers (2012) found that 37.5% of the 56 meta-analyses included in their report averaged three or more effect sizes per study. The average number of effect sizes per study across all 56 meta-analyses was 3.71. Just 7 of the 56 meta-analytic reports stated that dependence of effect sizes was not an issue in their data set.
Statistical Methods for Handling Dependence From Multiple Outcomes
Much has been written by prominent researchers about how to resolve dependence of effect sizes in a meta-analysis when faced with multiple outcomes. Some methods are more complex and challenging to implement with education research studies than others. On the less complex end of the spectrum, Card (2012) recommended choosing between two straightforward methods of resolving dependence. The first is to select a single outcome to include based on the focus of the meta-analysis. He cautioned that this approach is appropriate only when the meta-analyst can make a strong case for including one outcome over others. A second option, and one that is frequently implemented in education meta-analyses, is to aggregate all measures by computing an average effect size. Although computing an average effect across measures within a study is easy to do, the result may not be the best measure of the effect of the study. This approach effectively punishes studies for attempting to measure the impact of their treatment across a broad array of measures. For example, researchers testing a reading fluency intervention might be interested in knowing if their intervention has any effect on reading comprehension. Such a study conceivably could result in a large effect of 0.80 on a measure of reading fluency and a small effect of 0.20 on a measure of reading comprehension. If these measures are averaged for inclusion in a meta-analysis that is focused broadly on the effect of reading interventions on reading skills, the resulting effect size of 0.50 would not accurately represent the effectiveness of this study's intervention.
Reflecting on this problem, Marín-Martínez and Sánchez-Meca (1999) cautioned meta-analysts to consider whether or not effect sizes within a study are homogenous before averaging them to resolve dependence. If effects within studies are not homogenous, another approach to resolving dependence should be implemented. Cooper (1998) suggested a variation on simply averaging all outcomes. In his shifting-unit-of-analysis approach, effect sizes within studies are combined based on the variables of interest in the meta-analysis to provide a single estimate of the overall effect to include in the meta-analysis. Cooper stated that this approach minimizes violations of the assumption of independence of the effect sizes while preserving as much of the data as possible. However, using this approach can result in running multiple meta-analyses for each outcome type, with some analyses having a small number of studies and little power as a result.
More complex approaches to dealing with dependence from multiple outcomes involve accounting for the correlation between measures when computing a summary effect size across multiple dependent outcomes. As Borenstein et al. (2009b) pointed out, averaging effect sizes across measures makes an implicit assumption that the correlation between measures is 1.0—meaning that each outcome essentially duplicates the information provided by other outcomes. When meta-analysts ignore dependence and include effect sizes from all measures as if the effects were independent, the assumed correlation between measures is 0—meaning that each outcome contributes information that is unrelated to any other outcome. According to Borenstein et al., when making either of these assumptions about the correlation between measures, the result is an incorrect estimate of the variance of the composite effect size that the study contributes to the meta-analysis. Assuming a correlation of 1.0 results in an overestimate of the variance of the composite effect size because all the information provided by the outcomes is redundant. Assuming a correlation of 0 results in an underestimate of the variance for the composite effect size because each effect size is seen as contributing independent information. A larger estimate of the variance results in a larger confidence interval around the effect size and an increased likelihood of finding that the effect size is not significantly different from zero (a Type II error). The opposite is true when an inaccurately small estimate of the variance is calculated, resulting in an inflation of the Type I error rate.
When the correlation between outcomes is known, the dependence can be accounted for mathematically when computing a mean effect for a study. Rosenthal and Rubin (1986); Raudenbush, Becker, and Kalaian (1988); Gleser and Olkin (1994); and Borenstein et al. (2009b) provided equations for calculating an effect size for a study with multiple outcomes that include the correlations between the outcomes. More complex approaches incorporate the correlation between measures into multivariate models for conducting meta-analysis. Kalaian and Raudenbush (1996) described and illustrated the use of multivariate multilevel modeling to conduct meta-analysis in a way that models dependency in effects within studies. In their example, they meta-analyzed studies of the impact of coaching on performance on the Scholastic Aptitude Test (SAT) math and verbal subtests. Given that the correlation between these subtests has been reported by the developers of the SAT, Kalaian and Raudenbush were able to compute the covari-ance matrix needed for implementing their modeling technique. The structural equation modeling (SEM) approach to meta-analysis proposed by Cheung (2010) also requires that the correlations between multiple measures within a study are known.
In her discussion of multivariate meta-analysis, Becker (2000) acknowledged that in many cases the meta-analyst does not know the correlations between multiple measures used in a particular study. She suggested consulting previous studies or manuals from test publishers to impute a correlation. Theoretically, such an approach makes sense. However, it is often impractical or impossible for a meta-analyst working with education research studies to implement any of these suggestions. Researcher-designed measures are commonly used in education research, and the correlations between such measures are not routinely reported. When a study measures outcomes using standardized tests, the correlations between them might be available from test publishers or in the research literature, but the extent to which these correlations generalize beyond the normative sample to a special population (such as students with learning disabilities) is rarely documented.
When it is not possible to locate the correlation from these sources, Becker (2000) and Borenstein et al. (2009b) suggested conducting sensitivity analyses to determine a possible range of correlations between measures. Conducting sensitivity analyses can be a workable solution when a small number of measures are involved and only a few studies use multiple measures. However, when more than two or three measures are used in multiple studies to be included in the meta-analysis, conducting sensitivity analyses for every pair of outcomes quickly become so laborious and time-consuming that it is not feasible, especially because computer programs to conduct sensitivity analysis are not available. In these instances, averaging outcomes with an assumed correlation of 1.0 and inflating Type II error is considered the more conservative approach.
Statistical Methods for Handling Dependence From Multiple Group Comparisons
Many of the same researchers who have suggested methods for dealing with dependence when including studies with multiple outcomes also have described methods for dealing with dependence from multiple group comparisons within studies. Gleser and Olkin (1994) provided equations for a matrix of effect sizes that come from a set of studies where multiple treatments are compared with a no-treatment control group. They assumed that the corpus of studies that the meta-analyst has gathered includes a common and defined set of treatments (such as several types of diet or exercise routines), with some studies including perhaps two of these treatments compared with a no-treatment control group and others including three or four or more. In this scenario, regression models can be fit that account for the dependence in the group comparisons within studies. This approach works well in fields where treatments are standardized or come from a common set of treatments, such as medicine. Within education research, it is rare that the same treatments are present across studies, making it impossible to construct the type of matrix needed to implement Gleser and Olkin's approach.
Borenstein et al. (2009c) proposed a way of dealing with the dependence inherent in multiple group comparisons that is more easily applied to education research. First, they advised meta-analysts to consider if their interest is in comparing the effects of two specific treatments or in computing a combined overall effect of treatment compared with the control group. If one's interest is in comparing treatments, and two treatment groups are compared with a single control group in a given study, an effect size can be computed from the information provided for the two treatments that indicates the benefit of one treatment over the other. In this case, effect sizes from treatment–control comparisons are not included in the meta-analysis, eliminating the dependence from the shared control group. This approach makes sense only if the two treatments are present in a similar enough form across the corpus of studies to allow for similar contrasts across the meta-analysis.
If one's interest is in the overall effect of different types of treatment compared with a control group, calculating a combined effect size and its variance for studies in which multiple treatments are compared with the same control group is a straightforward process as long as the number of participants in each treatment group and the control group is known. The correlation between the effect size for the first treatment group versus the control group and the effect size for the second treatment group versus the control group can be calculated based on the number of participants in each group. A combined weighted mean effect size can be computed that gives more weight to an effect from a treatment with a larger sample size than to another treatment in the same study with a smaller sample size. The variance of this combined effect can be computed in a manner that takes into account the proportion of all study participants that are shared members of the control group. For example, if 50 participants are in one treatment group, 50 participants are in a second treatment group, and 50 participants are in the control group, the proportion of shared participants in the comparison of the each treatment group with the control group is 0.50 because 50% of the participants in each comparison are the same. More simply, in cases where means, standard deviations, and sample sizes are available for all treatment groups and the control group, the meta-analyst can create a combined mean simply by calculating a weighted mean and standard deviation for a study with all treatment conditions combined and using this mean and standard deviation with the mean and standard deviation of the control group to calculate a standardized mean difference effect size.
Borenstein et al.'s (2009c) approach to computing a combined, weighted mean effect is easier to apply to the types of research methodologies typically found in education research reports than Gleser and Olkin's (1994) approach. It is a sound means of preserving the statistical independence of effect sizes in a meta-analysis. However, independence comes at the cost of losing information about the unique effect of each treatment. Averaging the effects of treatment may not represent the intent of a study's researchers when they designed a multiple treatment versus control study. Additionally, when there are vast differences in the effectiveness of the treatments, this approach handicaps the most effective treatment in a study by averaging it with less effective treatments. When there are many studies with multiple dependent comparisons in a meta-analysis, the overall mean effect will be reduced by the presence of weaker and stronger treatments homogenized into a middling studywise effect size.
New Approaches to Dealing With Dependence From Multiple Outcomes and Comparisons
Robust Variance Estimation
Hedges, Tipton, and Johnson (2010) proposed a new approach to dealing with dependence that can be applied no matter the source or sources of dependence in a data set of effect sizes. Known as robust variance estimation (RVE), it overcomes the need to include the known correlations between measures in order to include all effect sizes from all measures and all group comparisons in the meta-analysis. Instead of modeling dependence as is done in multivariate approaches to meta-analysis that require known correlations, RVE mathematically adjusts the standard errors of the effect sizes to account for the dependence (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). An intraclass correlation (ρ) that represents the within-study correlation between effects must be specified when implementing RVE to estimate the effect size weights, but because RVE is not affected very much by the choice of weights, it does not matter if the correlation is precise (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). Because the same ρ is applied to all dependent effect sizes within each study in the meta-analysis, sensitivity analysis with a range of values for ρ can be conducted quite easily to determine how the correlation that is chosen affects the resulting estimates of the mean effect and its variance. Dependence from multiple sources, including multiple measures and multiple group comparisons, can be accommodated simultaneously (Tanner-Smith & Tipton, 2013). RVE is reasonably easy to implement with syntax for several popular statistical software packages provided by Tanner-Smith and Tipton and available from the Peabody Research Institute (n.d.).
There are some important limitations to consider when implementing RVE. Because the math involved in RVE relies on the central limit theorem, simulation studies have shown that a minimum of 10 independent studies are needed to estimate a reliable main effect and a minimum of 40 independent students are needed to estimate a meta-regression coefficient (Hedges et al., 2010; Tanner-Smith & Tipton, 2013). RVE can be used only in meta-regression. If a meta-analysis involves categorical moderators with more than two levels, the dummy-coding of variables required to analyze all pairwise comparisons can be cumbersome to implement in currently available statistical software. Additionally, because the degrees of freedom used to test the statistical significance of the meta-regression coefficients is equal to the number of independent studies minus the number of parameters estimated, meta-analyses with a small number of studies will be restricted in the number of covariates that can be included (Tanner-Smith & Tipton, 2013). Tanner-Smith and Tipton's simulation studies indicated that a minimum of 40 studies with an average of at least five effect sizes per study are needed to estimate a meta-regression coefficient. When fewer studies are included, they found that the confidence interval for the coefficient tends to be too narrow, meaning that the p value for the estimate will be inaccurate. Nevertheless, RVE is a mathematically sound method for modeling dependence that should be strongly considered by education meta-analysts when their data sets meet its requirements.
Konstantopoulos (2011) proposed three-level meta-analysis as an extension of the use of two-level random-effects models in meta-analysis. In two-level models, Level 2 variance represents between-study differences in effect size estimates, with the assumption that all studies are contributing an independent effect size. Three-level meta-analysis allows for clustering of dependent effect sizes within studies at Level 2; between-study effects are then estimated at Level 3. Cheung (2013) described how three-level meta-analysis can be used to pool dependent effect sizes within each study, modeling the within-study dependence at Level 2 and the between-study mean effect size and variance at Level 3. This approach to dependence can be applied when the correlations between the dependent effect sizes are not known, as is usually the case when multiple measures are used in a study. Unlike in RVE, three-level meta-analysis provides estimates of both the Level 2 (within study) and Level 3 (between study) variance so that meta-analysts can determine where the variation in effects is the greatest. Covariates can be included in the three-level model at both Level 2 and Level 3 to attempt to explain the variance present at each level.
Cheung (2013) described how to use SEM to conduct a three-level meta-analysis. Some advantages of the SEM approach include its ability to handle missing data on covariates and to provide a means for empirical comparison of the two-level and three-level models to determine which model best fits the data. Cheung provided syntax and a package for running three-level meta-analysis in R, making it easier for other meta-analysts to implement his approach. Like RVE, three-level meta-analysis is a promising solution to the problem of dependence in meta-analysis. However, as Cheung noted, additional studies are needed to demonstrate the strengths and potential limitations of both approaches to dependence because neither technique has been used widely in published research.
How Education Researchers Handle Dependence in Meta-Analysis
Drawing from the methods described above, education researchers have implemented a variety of means of handling dependence from multiple measures and/or multiple group comparisons when conducting a meta-analysis. In their meta-analysis of the effect of writing instruction on reading, Graham and Hebert (2011) resolved the dependence from multiple measures using Cooper's (1998) shifting-unit-of-analysis approach. They separated measures by construct (e.g., reading comprehension, reading fluency) and meta-analyzed effect sizes for each construct separately. When studies included multiple measures of a single construct, they included the average of the effects in their meta-analysis. Graham and Hebert's approach yielded multiple sets of independent effects that they meta-analyzed separately. This approach also can be implemented when studies provide multiple treatment comparisons by conducting separate meta-analyses for each type of treatment.
The advantage of this approach is that it allows the meta-analyst to retain all of the information from each study while preserving statistical independence. However, to do so the meta-analyst must run multiple analyses and cannot draw conclusions about the overall effect from the corpus of studies. Additionally, dividing the corpus of studies into groups by measure type and/or treatment type can result in a significant reduction in power. Nevertheless, this approach remains popular with meta-analysts and has been implemented in a number of other recent meta-analyses in education (e.g., Flynn, Zheng, & Swanson, 2012; Gersten et al., 2009; Tran et al., 2011). In their review of 56 education meta-analyses, Ahn et al. (2012) found that 26.8% of the meta-analyses in their data set used the shifting-unit-of-analysis approach to resolve dependence.
Another common approach to handling dependence in meta-analysis is to select a single measure and/or group comparison that seems to best represent the study's primary research question. Graham and Hebert (2011) took this approach to resolving the statistical dependence in studies that had multiple group comparisons, and Chambers (2004) used it in a meta-analysis of the effects of computers in classrooms. In their meta-analysis of reading comprehension instruction for students with learning disabilities, Berkeley, Scruggs, and Mastropieri (2010) implemented a hybrid of this approach and the approach described above, selecting a single outcome measure from each study that best represented the research question while conducting separate meta-analyses for different types of measures and for measures of treatment effect, maintenance effect, and generalization effect. This approach was used in 14.3% of the 56 education meta-analyses reviewed by Ahn et al. (2012).
The main advantage of this method of resolving dependence is that it contributes the effect size that conveys the central finding of the study to the meta-analysis. When meta-analysts select a single outcome or group comparison for the meta-analysis, studies that include additional outcomes or comparisons in an attempt to measure the effects of their intervention more broadly or compare it with other types of treatment do not have the effect size of their primary outcome or comparison of interest reduced by averaging it with smaller effects from tertiary outcomes or weaker treatments. However, in large-scale or multicomponent interventions, researchers often expect to see effects of treatment on multiple types of measures or are interested in determining which of several treatments is most effective. In these cases, it can be difficult for the meta-analyst to pick a single measure or group comparison that will best represent the study in the meta-analysis, especially if the study's authors are not clear in describing the outcome or comparison they view as most central to the purpose of their study.
Ahn et al. (2012) documented the use of other approaches to dealing with dependence in the 56 education meta-analyses they reviewed. The approach most commonly used in these meta-analyses was averaging or weighted averaging of the dependent effect sizes within studies. This approach was implemented in 42.9% of the meta-analyses. They also found that a multivariate approach was used in 7.1% of the meta-analyses. A combination of approaches was used in 12.5% of the meta-analyses. In 32.2% of the meta-analyses, researchers either failed to mention whether dependence was an issue in their data set or mentioned it but did not report how they handled it.
Because Hedges et al.'s (2010) RVE approach is a relatively new technique for dealing with dependence, published examples of its use are few in number. Wilson, Tanner-Smith, Lipsey, Steinka-Fry, and Morrison (2011) used RVE to account for dependence in their meta-analysis of high school dropout prevention programs that included 504 effect sizes from 317 independent samples and 152 studies. Uttal et al. (2013) implemented RVE in a meta-analysis that included 1,038 effect sizes from 206 studies that assessed the effect of training programs on spatial skills. Outside of educational research, RVE has been implemented in meta-analyses on the effectiveness of outpatient substance abuse treatment for adolescents (Tanner-Smith, Wilson, & Lipsey, 2013), the relationship between social goals and aggressive behavior in youth (Samson, Ojanen, & Hollo, 2012), and the effect of mindfulness-based stress reduction on physical and mental health in adults (de Vibe, Bjørndal, Tipton, Hammerstrøm, & Kowalski, 2012). No published examples of the use of three-level meta-analysis to handle dependence were found in the educational research literature. Both Konstantopoulus (2011) and Cheung (2013) illustrated the use of three-level meta-analysis with extant data sets. Van den Noortgate, López-López, Marín-Martínez, and Sánchez-Meca (2013) used simulated data sets in their exploration of three-level meta-analysis as a method for handling dependence.
A Case Study in Methods of Dealing With Dependence
To better understand the impact of the choices education meta-analysts face when dealing with multiple measures and multiple group comparisons within studies, different methods of handling dependence were implemented using a set of effect sizes from a meta-analytic study by Scammacca, Roberts, Vaughn, and Stuebing (in press) of reading interventions for struggling readers in Grades 4 to 12. Researchers chose to use an extant set of effect sizes from a recent meta-analysis rather than a simulated data set because we believed that a real-world data set can better emulate the types and nature of dependence that typically exist in studies that education researchers struggle to meta-analyze. In doing so, we acknowledge that simulation studies make an important contribution to the knowledge base and are a necessary next step to the work we present here.
The Scammacca et al. (in press) report involved separate and combined analyses of effect sizes from research published between 1980 and 2004 and between 2005 and 2011. For this report, only effect sizes from the 2005 to 2011 group of 50 studies were used. This more recent group contained many more instances of studies with more than two groups (k = 17) and with multiple measures (k = 43) than the earlier group of studies. The proportion of these more complex research designs within the set of 50 is more representative than the older set of the sets of studies that would cause a meta-analyst to confront the issues addressed here. See the appendix for the effect size data used in this case study.
This case study sought to answer the following research questions:
Research Question 1: How do different approaches to dealing with dependence in data from multiple outcomes within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?
Research Question 2: How do different approaches to dealing with dependence in data from multiple group comparisons within studies affect meta-analytic estimates of mean effect size, variance, and indexes of heterogeneity?
The approaches to handling dependence in this case study include those implemented in other meta-analytic studies that involved education data and others chosen to illustrate alternative means of estimating the overall effect from a study with multiple dependent effects. Additionally, meta-analyses were attempted with all outcomes and all groups as independent for comparison purposes.
Procurement of Corpus of Studies
The studies used in Scammacca et al. (in press) were located through a computer search of ERIC and PsycINFO using descriptors related to reading, learning difficulties/disabilities, and reading intervention; a search of abstracts from other published research syntheses and meta-analyses and reference lists in seminal studies; and a hand search of major journals in which previous intervention studies were published. Studies were included in the meta-analysis if (a) participants were English-speaking struggling readers in Grades 4 to 12 (age 9–21), (b) the study's research design used a multiple-group experimental or quasi-experimental treatment-comparison or multiple-treatment comparison designs, (c) the intervention provided any type of reading instruction, (d) data were reported for at least one dependent measure that assessed one or more reading constructs, and (e) sufficient data for calculating effect sizes and standard errors were provided.
Studies that met criteria were coded using a code sheet that included elements specified in the What Works Clearinghouse Design and Implementation Assessment Device (Institute of Education Sciences, 2008) and used in previous research (Scammacca et al., 2007). Researchers with doctorate degrees and doctoral students with experience coding studies for other meta-analyses and research syntheses completed the code sheets. All coders had completed training on how to complete the code sheet and had reached a high level of reliability with others coding the same article independently. Every study was independently coded by two raters. When discrepancies were found between coders, they reviewed the article together and discussed the coding until consensus was reached.
Effect Size Calculation
Effect sizes were calculated using the Hedges (1981) procedure for unbiased effect sizes for Cohen's d (this statistic is also known as Hedges's g). Hedges's g was calculated using the posttest means and standard deviations for treatment and comparison (or multiple treatment) groups when such data were provided. In some cases, Cohen's d effect sizes were reported and means and standard deviations were not available. For these effects, Cohen's d for posttest mean differences between groups and the treatment and comparison group sample sizes was used to calculate Hedges's g. For each effect, estimates of Hedges's g were weighted by the inverse of the variance to account for variations in precision based on sample size in the studies. All effects were computed using the Comprehensive Meta Analysis (Version 2.2.064) software (Borenstein, Hedges, Higgins, & Rothstein, 2011). Effects were coded for all measures and pairwise group comparisons between treatment and control groups or different treatment groups when no control group was included in the study. The 36 research reports yielded 50 independent studies with a total of 366 effect sizes, an average of about 7 effect sizes per study. At this point, researchers in the original study were faced with the dilemma of how to combine multiple effect sizes from multiple measures and multiple dependent group comparisons to best estimate the mean effect of reading intervention.
Calculating Mean Effects From Studies With Multiple Measures
Nearly all studies provided data on multiple outcome measures. Scammacca et al. (in press) averaged the effect sizes from multiple measures within each pairwise group comparison using the procedure recommended by Card (2012), and included the average effect size and the average of its standard error in the meta-analysis. Five other approaches to computing a mean effect across multiple measures within a single independent group comparison were conducted for the present report:
The measure that yielded the highest effect size was selected for each independent group comparison.
A measure was selected at random using a random number generator for each independent group comparison.
A measure was selected for each independent group comparison that seemed to best represent the primary focus of the study's intervention.
Measures were analyzed separately based on the type of reading skill measured (fluency, vocabulary, spelling, reading comprehension, word, and word fluency) for each independent group comparison and mean effects were calculated for each skill.
All measures were treated as independent estimates of effects for each independent group comparison.
All five approaches were used in meta-analyses to calculate a mean effect and its standard error across all studies for all measures included in the research reports and for norm-referenced, standardized measures only. Because researcher-designed measures tend to have lower reliability than standardized measures, repeating the meta-analyses with only standardized measures allows researchers to investigate the effects of different approaches to dealing with multiple measures while constraining some of the influence of measurement error.
Calculating Mean Effects From Studies With Multiple Dependent Groups
Seventeen of the research reports contained more than one dependent treatment-control or multiple-treatment group comparison. In Scammacca et al. (in press), the procedure recommended by Borenstein et al. (2009c) was implemented for comparisons that involved dependent groups. This procedure involves computing a combined weighted mean effect size and its standard error in a manner that reflects the degree of dependence in the data. Four other approaches to computing a mean effect across multiple dependent group comparisons were completed for this report:
The group comparison that yielded the highest mean effect size across measures included in the study was selected and included in the meta-analysis.
A group comparison was selected at random using a random number generator and its mean effect size across measures was included in the meta-analysis.
A group comparison was selected that seemed to best represent the primary focus of the study's intervention and its mean effect size across measures was included in the meta-analysis.
All group comparisons were treated as independent and each mean effect size across measures was included in the meta-analysis.
Meta-analyses were then conducted on the resulting data using all types of measures and using standardized measures only, for the reason stated above. In each of the different analyses for the multiple dependent group comparisons, the average of the effect sizes for all measures involving the group comparison of interest was used to hold constant the effect of multiple measures while examining different approaches to the problem of multiple dependent group comparisons. In a similar way, the effect of multiple dependent group comparisons was held constant in the analyses involving multiple measures. In these analyses, the Borenstein et al. (2009c) method of combining multiple dependent comparisons was implemented. Results for the RVE approach and three-level meta-analysis are reported separately.
For all the methods of dealing with multiple measures and multiple dependent group comparisons, a random-effects model was used to analyze effect sizes. This model allows for generalizations to be made beyond the studies included in the analysis to the population of studies from which they come. Mean effect size statistics and their standard errors were computed and heterogeneity of variance was evaluated using the Q statistic, the I2 statistic, and the tau-squared statistic. For all but the RVE and three-level meta-analysis approaches, the meta-analyses were conducted in Comprehensive Meta Analysis (Version 2.2.064) software (Borenstein et al., 2011). For the RVE approach, unrestricted, intercept-only meta-regression models were run in SPSS using a macro provided by Tanner-Smith and Tipton (2013) and Peabody Research Institute (n.d.). Sensitivity analysis with a range of values for ρ was conducted to determine the effect of varying intraclass correlations on estimates of the mean effect size, the Q statistic, and the tau-squared statistic. For three-level meta-analysis, Cheung's (2013) R syntax for the metaSEM package he authored was used. Finally, meta-regression was conducted using number of measures and number of groups as a predictor of effect size in a mixed-effects model using unrestricted maximum likelihood estimation.
Approaches to Handling Dependence From Multiple Measures
The meta-analyses that implemented different methods of resolving the dependence resulting from having multiple measures within a study produced some points of similarity and some differences across the methods used when considering all types of measures. See Table 1 for results for all types of measures. The mean effect size and variance when using the mean of measures method was nearly identical to the mean effect size when all measures were treated as independent and when a measure was selected based on the primary research question. Using the highest effect size produced a much larger mean effect, as would be expected, and a slightly larger variance. Random selection of an effect size also produced a somewhat larger estimate of the mean effect and a slightly larger variance. Estimates of heterogeneity varied widely depending on the method used to resolve dependence from multiple measures. Treating all measures as independent, using the highest effect size, and randomly selecting an effect size resulted in the largest values across all three indexes of heterogeneity.
Meta-analytic results for various methods of summarizing multiple measures within a study for all types of measures
When all types of measures were meta-analyzed by the type of reading skill tested (the shifting-unit-of-analysis approach), reading comprehension measures, word measures, and word fluency measures produced similar mean effect sizes and variances to the mean of measures, all measure independent, and select by research question measures. Results differed for all measures of fluency, vocabulary, and spelling skills, perhaps due to the number of small number of studies that included vocabulary and spelling measures or due to true differences in the effectiveness of reading interventions on these reading skills. The results for indexes of heterogeneity also differed depending on the domain of reading skills analyzed, with fluency measures showing the most heterogeneity and spelling measures the least.
Results of the meta-analyses that included standardized measures only are shown in Table 2. As with the analyses of all measures, the mean of measures approach resulted in a similar mean effect size and variance as was obtained when all measures were treated as independent or when measures were selected based on the study's primary research question. Random selection of a measure resulted in a similar mean effect size and variance as well, whereas choosing the measure with the highest effect size resulted in the largest mean effect but a similar variance to other approaches. The Q and I2 measures of heterogeneity again had large values for the independent approach, the random approach, the highest effect size approach, and the select-by-research-question approach. In the analyses by reading skill, fluency and vocabulary measures again showed much smaller effects than other domains. The I2 index of heterogeneity had large values for reading comprehension and small values for other domains, likely due to the large number of studies that included a standardized measure of reading comprehension.
Meta-analytic results for various methods of summarizing multiple measures within a study for standardized measures
Approaches to Handling Dependence From Multiple Group Comparisons
Results from the five approaches used to deal with the dependence from multiple group comparisons are shown in Tables 3 (all types of measures) and 4 (standardized measures only). Mean effect sizes and variances were very similar across all approaches for both sets of analyses. Selecting the group comparison with the highest effect size resulted in the largest mean effect size for both all types of measures and standardized measures only, but not by much. The difference was especially small in the analysis of standardized measures. Interestingly, the variance did not increase when all group comparisons were treated as independent. However, treating all group comparisons as independent did result in very large estimates on the Q index of heterogeneity. Tau-squared and I2 estimates were less affected.
Meta-analytic results for various methods of summarizing multiple comparisons within a study for all types of measures
Meta-analytic results for various methods of summarizing multiple comparisons within a study for standardized measures
The Robust Variance Estimation Approach to Handling Dependence
Results from the meta-analyses that implemented RVE are shown in Tables 5 (all types of measures) and 6 (standardized measures only). Results are reported to the fifth decimal place to show that very little change occurred in the mean effect size and measures of heterogeneity based on varying the intraclass correlation ρ. Compared with the results presented above for other approaches to dealing with dependence, the RVE approach produced estimates of the mean effect size and standard error that were very similar to those found when using the mean of measures approach and the weighted mean for group comparisons approach when looking both at the meta-analysis of all types of measures and only at standardized measures. The Q statistic for the meta-analysis of standardized measures was just slightly larger using the RVE approach than in other methods used to handle multiple dependent group comparisons, but the increase was enough to indicate the presence of statistically significant heterogeneity. The estimates of heterogeneity generally were larger than those obtained with other methods of dealing with dependence but less than those obtained when dependence was ignored.
Meta-analytic results for summarizing multiple outcomes and multiple comparisons within a study using robust variance estimation for all types of measures at different values of ρ
Meta-analytic results for summarizing multiple outcomes and multiple comparisons within a study using robust variance estimation for standardized measures at different values of ρ
Handling Dependence With Three-Level Meta-Analysis
Results from the three-level meta-analysis using all types of outcome measures were similar to those obtained using RVE. The estimate of the mean effect was 0.27 with a standard error of 0.05 (95% confidence interval = 0.18, 0.37). The tau-squared estimate of variance was 0.10 (SE = 0.01) at Level 2 (within studies) and 0.07 (SE = 0.02) at Level 3 (between studies), meaning that more within-study than between-study variation was present. In three-level meta-analysis, I2 is calculated based on the Q statistic; thus, it is on a different scale and is interpreted differently than the I2 statistics that have been presented previously in this article. The Level 2 I2 and Level 3 I2 values were 0.48 and 0.35, respectively, meaning that 48% of the variation in effect sizes was due to within-study factors and 35% of the variation was due to between-study factors. These finding suggest that Level 2 covari-ates should be included in the model to account for the within-study variation before between-study covariates are considered.
A three-level meta-analysis using effect sizes from standardized measures only was attempted; it failed to converge on an optimal solution. When restricted maximum likelihood estimation was used to examine the variance components, a solution was reached that estimated tau-squared at Level 2 (within studies) as 1 × 10−10 and at Level 3 (between studies) as 0.02. Therefore, it seems likely that the very small value for tau-squared at Level 2 caused the model to fail to converge on an optimal solution.
Meta-Regression on Number of Measures and Number of Groups
To evaluate the relationship between effect size and number of measures and number of groups in a study, meta-regression was conducted using these variables as a predictor of effect size in a mixed-effects model using unrestricted maximum likelihood estimation. Meta-regression was run predicting the effect size using all types of measures and standardized measures only. The number of measures used in a study was not a statistically significant predictor of effect size when considering all types of outcome measures (β = 0.00, SE = 0.01, Q-model = 0.06, df = 1, p = .805, T2 = 0.03) or only standardized outcome measures (β = −0.01, SE = 0.01, Q-model = 0.43, df = 1, p = .513, T2 = 0.00). See Figures 1 and 2 for scatterplots of effect sizes by number of measures.
Scatterplot of effect size by number of measures used in a study.
Scatterplot of effect size by number of standardized measures used in a study.
The number of groups used in a study was not a statistically significant predictor of effect size when considering all types of outcome measures (β = 0.00, SE = 0.05, Q-model = 0.00, df = 1, p = .999, T2 = 0.03). However, number of groups was a statistically significant predictor of effect size when considering only standardized measures (β = −0.06, SE = 0.03, Q-model = 5.04, df = 1, p = .024, T2 = 0.00), with effect sizes from standardized measures decreasing as the number of groups increased. See Figures 3 and 4 for scatterplots of effect sizes by number of measures.
Scatterplot of effect sizes by number of groups in a study using all types of measures.
Scatterplot of effect sizes by number of groups in a study using standardized measures only.
The case study presented here was conducted to demonstrate the effects of different methods of dealing with dependence from multiple measures and multiple group comparisons within studies on meta-analytic results. Results indicated that most approaches to handling dependence produced similar estimates of the mean effect and variance for this set of effect sizes. The mean effect and variance were especially similar when only standardized measures were included in the analyses. The expected increase in the variance of the mean effect was not observed in the case study data when all measures or all group comparisons were included in the analysis as if they were independent effect sizes.
These findings are not what would be expected based on previous research. In their simulation study of dependence from multiple group comparisons, Kim and Becker (2010) found that the variance of the mean effect increased as the proportion of studies in the meta-analysis that contained dependent comparisons increased. They found that the variance estimate was at least somewhat inflated when as few as 20% of the studies in the meta-analysis included dependent comparisons. In the present case study, 34% of the studies had dependent group comparisons. However, Kim and Becker also noted that variance estimates were most inflated when treatment groups were larger than control groups, which was not generally the case in the studies included in the present case study. Additionally, Kim and Becker's simulations involved a set of 10 studies with 12 and 15 effect sizes representing 20% and 50% dependence. In the present case study, 50 studies with 92 effect sizes were included in the analysis with multiple dependent group comparisons. It may be that the larger number of effect sizes in the case study contributed to the difference in findings. Additional simulation studies with a larger set of effect sizes and additional scenarios of dependence are needed to determine under what circumstances and to what extent dependence inflates variance estimates.
Based on a single set of effects from reading intervention studies, the case study presented here cannot provide definitive guidance on the best way to resolve dependence resulting from multiple measures or multiple group comparisons within studies. Indeed, there may not be only one best way to resolve dependence, given that data sets of effect sizes can differ widely in the degree and nature of the dependence present in them. Additionally, the choice of method in dealing with dependence must take into account the overall purpose and research questions behind the meta-analysis. Despite being unable to offer definitive guidance, the case study presented here raises some important issues for meta-analysts of education research to consider as they deal with dependence. Furthermore, it draws attention to ways in which primary researchers can assist meta-analysts by providing the information needed to make the best decision about how to handle the multiple dependent effects from their studies.
Implications for Education Meta-Analysts
Consider the Effect of Your Method of Dealing With Dependence When Conducting Moderator Analyses
The greatest differences between the various methods of dealing with dependence in the case study were seen in the indexes of heterogeneity. For the methods of handling multiple measures in the meta-analysis of all measures, Q values varied widely, ranging as high as 363.04. A good deal of variation also was seen in the Q values in the meta-analysis for standardized measures only, though the range was smaller. Q values were especially large when all group comparisons were treated as independent, providing another reason why this approach to dealing with dependence should be avoided.
Meta-analysts who find large Q values likely will want to find meaningful moderator variables within their set of studies that explain the heterogeneity. If a moderator variable was confounded with the approach taken to deal with dependence, the moderator analysis could show significant differences falsely based on a moderator variable that is a characteristic of the studies in the meta-analysis when in fact the heterogeneity being explained is due to the method used to deal with dependence. This false finding could occur, for example, if grade level is used as a moderator variable and multiple measures were more commonly administered to students in upper grades than students in lower grades. Conversely, when the indexes of heterogeneity are increased as a result of the method chosen to deal with dependence and that method is not confounded with any moderator variable, meta-analysts may not be able to find moderators that explain the heterogeneity and not know why it remains unexplained, while failing to realize that the actual source of the heterogeneity is the method used to deal with dependence.
Therefore, it is critical for meta-analysts to consider and account for the impact of their method of dealing with dependence when their meta-analysis results in a large Q statistic. If possible, running the meta-analysis using only standardized, norm-referenced measures instead of researcher-developed measures also can be helpful in detecting whether the size of the Q statistic is due to variance in measurement rather than meaningful differences between studies. Additionally, looking beyond the often-reported Q statistic and evaluating I2 and tau-squared as measures of heterogeneity is important. Simulation studies are needed to model the impact of different approaches to dealing with dependence on estimates of heterogeneity and determine under what circumstances these estimates are artificially inflated or constrained.
Match Your Method of Dealing With Dependence to Your Research Question and Your Data Set
When the correlations between measures are known or can be obtained, a mul-tivariate approach using multilevel modeling or SEM is the best approach for handling dependence in meta-analytic data sets. Because this is rarely the case in education meta-analyses, different approaches to dependence when correlations are not known have been recommended by different research methodologists. Given the lack of guidance currently available on a single optimal way to deal with dependence from multiple measures or multiple group comparisons when correlations are not available, meta-analysts of education research would do best to choose an approach that is suited to the data available from their set of studies and the questions they hope to answer through their meta-analyses.
When an overall estimate of the effect of treatment is more central to the purpose of a meta-analysis and/or treatments and measures are sufficiently similar to one another, the RVE, three-level meta-analysis, and mean of measures approaches are the best options. The RVE and three-level meta-analysis approaches are particularly well suited to meta-analyses with a large numbers of studies and when continuous moderators or dichotomous categorical moderators are of interest. Because three-level meta-analysis provides variance estimates at both the within-study and between-study levels and allows for covariates to be introduced at both levels, it is ideal for meta-analyses where researchers are interested in exploring sources of systematic variance at the within-study level. However, given the newness of RVE and three-level meta-analysis and the small number of published meta-analyses that have used these techniques, more research is needed to explore their benefits and limitations as solutions to the problem of dependence in education meta-analyses before they can be considered the optimal methods.
When the meta-analyst's research questions are addressed to the mean effect of particular domains of treatment or measurement and obtaining an overall estimate of effect across domains is not important, the shifting-unit-of-analysis approach works well if domains of treatment or measurement can be sorted cleanly into categories and enough independent effect sizes are available in each domain to allow for sufficient power. When the research question driving the meta-analysis is clearly and at least somewhat narrowly defined, selecting a single measure and/or group comparison that is best aligned with the purpose of the meta-analysis is a reasonable approach to dependence. Intentional selection of a single measure or comparison also is warranted when the authors of the studies in the meta-analysis define causal models in a way that makes clear which measures their treatments should affect most directly or which treatment in a multiple-comparison study is most central to their hypothesis. In data sets where a great deal of dependence is present, multiple approaches to resolving dependence might be attempted and the range of the mean effect, variance, and indexes of heterogeneity for each approach reported.
Practice Full Disclosure
The American Psychological Association's Meta-Analysis Reporting Standards (American Psychological Association, 2008) recommended that meta-analysts describe the method used to arrive at a single independent effect size for studies that contain multiple dependent effect sizes. However, it seems that this recommendation is not routinely followed. Ahn et al. (2012) reported that 32.2% of the education meta-analyses they reviewed failed to disclose any information about the dependence present in their corpus of studies or how it was handled. Similarly, in a review of a random sample of 100 meta-analyses in psychology and related disciplines, Dieckmann, Malle, and Bodner (2009) found that information on dependence was missing from 34% of the reports they reviewed. Given the importance of maintaining statistical independence in a meta-analysis, failure to report the extent and type of dependence present in one's corpus of studies and how that dependence was resolved is inexcusable and raises questions about the validity of the meta-analytic results. Additionally, meta-analysts should describe briefly why a particular approach to resolving dependence was chosen and what attempts were made to determine how the mean effect size, variance, and estimates of heterogeneity were affected by the chosen approach.
Implications for Primary Researchers
Clearly Specify All Aspects of Your Causal Model
Given the complexity present in studies where dependence of effect sizes occurs, primary researchers can assist meta-analysts who will be working with these effect sizes by clearly stating the way in which the measures and/or multiple treatment groups are related in their theoretical conceptualization of the effect of their treatment. Readers and future meta-analysts will benefit from knowing which measures a study's researchers view as a primary indicator of the effectiveness of the treatment and which are secondary or tertiary indicators. This information helps meta-analysts who choose to deal with dependence by focusing only on primary indicators to know which measure to include.
When a study introduces dependence from multiple group comparisons, researchers can help meta-analysts by carefully describing the treatment provided to all groups (including details on what, if any, treatment the control group receives, especially if control group members are receiving a business-as-usual treatment provided by their school). Complete descriptions of all groups allow meta-analysts to select group comparisons that align with their research questions and to choose moderator variables to use in attempting to explain heterogeneity. Additionally, primary researchers who include multiple treatment groups should explain why different variations of treatments are being provided, how the treatments differ, and which outcomes are considered primary for each treatment. This information is helpful to meta-analysts who are aggregating independent effects across studies based on similar treatment characteristics or who are interested in including independent effects from certain types of treatment only. Finally, primary researchers might consider whether their causal model would be best represented in future meta-analyses if separate control groups were provided for each treatment group. When researchers are interested in determining the distinct effect of two or more different treatments compared with a control condition, the cost of creating separate control groups would be warranted. Doing so preserves independence while allowing the effect size for each treatment–control comparison to be included rather than averaged.
Provide All Relevant Statistics Needed to Deal With Dependence
Another way that primary researchers can assist meta-analysts in dealing with dependence is by providing all the necessary statistical information needed to allow meta-analysts to implement multivariate meta-analytic methods that model dependence. Researchers should provide the correlations between all measures used so that meta-analysts can create the covariance matrices needed for meta-analytic multilevel modeling and multivariate SEM. A simple table of measures and their correlations based on all participants in the study's sample is all that is needed. Additionally, primary researchers with multiple dependent group comparisons should be sure to include the initial sample size and the sample size after attrition for all treatment and comparison groups so that meta-analysts can use this information to calculate a sample-weighted effect size for all treatments versus the shared control group.
With primary researchers increasingly designing more complex, large-scale studies at the request of grant providers, statistical dependence of the resulting effect sizes has become a significant issue for meta-analysts in education research. All the approaches available to meta-analysts to deal with dependence that were described in this report and demonstrated in the case study have benefits and limitations. At the present time, selecting a method for dealing with dependence is one of many choices a researcher must make when conducting a meta-analysis. Further research is needed to test these approaches with simulated and nonsimulated data to determine the conditions under which each approach is best implemented and to provide better guidance in selecting the best approach for a given set of dependent effect sizes. While waiting for this guidance to become available, the best way forward for education meta-analysts is to weigh carefully the advantages and disadvantages of each approach and to provide as much information as possible on the chosen approach so that readers can consider this information when interpreting the meta-analytic results.
This research was supported by Grant P50 HD052117 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development and by the Institute of Education Sciences, U.S. Department of Education, through Grant R305F100013 to The University of Texas at Austin as part of the Reading for Understanding Research Initiative. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, the National Institutes of Health, the Institute of Education Sciences, or the U.S. Department of Education.
Effect Size Data Used in this Case Study
|Study||Independent subgroup||Comparisona||Outcomeb||Hedges's g||SE|
|1||T1 vs. C||A||1.39||0.35|
|T2 vs. C||A||0.92||0.33|
|2||T vs. C||A||0.85||0.30|
|3||A||T1 vs. T2||A||0.06||0.32|
|T1 vs. T3||A||0.41||0.32|
|T1 vs. T4||A||0.15||0.32|
|T1 vs. T5||A||0.20||0.32|
|T2 vs. T3||A||0.34||0.32|
|T2 vs. T4||A||0.21||0.32|
|T2 vs. T5||A||0.24||0.32|
|T3 vs. T4||A||0.52||0.32|
For the process in historical linguistics known as metanalysis, see Rebracketing.
A meta-analysis is a statistical analysis that combines the results of multiple scientific studies.
The basic tenet behind meta-analyses is that there is a common truth behind all conceptually similar scientific studies, but which has been measured with a certain error within individual studies. The aim then is to use approaches from statistics to derive a pooled estimate closest to the unknown common truth based on how this error is perceived. In essence, all existing methods yield a weighted average from the results of the individual studies and what differs is the manner in which these weights are allocated and also the manner in which the uncertainty is computed around the point estimate thus generated. In addition to providing an estimate of the unknown common truth, meta-analysis has the capacity to contrast results from different studies and identify patterns among study results, sources of disagreement among those results, or other interesting relationships that may come to light in the context of multiple studies.
A key benefit of this approach is the aggregation of information leading to a higher statistical power and more robust point estimate than is possible from the measure derived from any individual study. However, in performing a meta-analysis, an investigator must make choices which can affect the results, including deciding how to search for studies, selecting studies based on a set of objective criteria, dealing with incomplete data, analyzing the data, and accounting for or choosing not to account for publication bias.
Meta-analyses are often, but not always, important components of a systematic review procedure. For instance, a meta-analysis may be conducted on several clinical trials of a medical treatment, in an effort to obtain a better understanding of how well the treatment works. Here it is convenient to follow the terminology used by the Cochrane Collaboration, and use "meta-analysis" to refer to statistical methods of combining evidence, leaving other aspects of 'research synthesis' or 'evidence synthesis', such as combining information from qualitative studies, for the more general context of systematic reviews.
The historical roots of meta-analysis can be traced back to 17th century studies of astronomy, while a paper published in 1904 by the statistician Karl Pearson in the British Medical Journal which collated data from several studies of typhoid inoculation is seen as the first time a meta-analytic approach was used to aggregate the outcomes of multiple clinical studies. The first meta-analysis of all conceptually identical experiments concerning a particular research issue, and conducted by independent researchers, has been identified as the 1940 book-length publication Extrasensory Perception After Sixty Years, authored by Duke University psychologists J. G. Pratt, J. B. Rhine, and associates. This encompassed a review of 145 reports on ESP experiments published from 1882 to 1939, and included an estimate of the influence of unpublished papers on the overall effect (the file-drawer problem). Although meta-analysis is widely used in epidemiology and evidence-based medicine today, a meta-analysis of a medical treatment was not published until 1955. In the 1970s, more sophisticated analytical techniques were introduced in educational research, starting with the work of Gene V. Glass, Frank L. Schmidt and John E. Hunter.
The term "meta-analysis" was coined by Gene V. Glass, who was the first modern statistician to formalize the use of the term meta-analysis. He states "my major interest currently is in what we have come to call ...the meta-analysis of research. The term is a bit grand, but it is precise and apt ... Meta-analysis refers to the analysis of analyses". Although this led to him being widely recognized as the modern founder of the method, the methodology behind what he termed "meta-analysis" predates his work by several decades. The statistical theory surrounding meta-analysis was greatly advanced by the work of Nambury S. Raju, Larry V. Hedges, Harris Cooper, Ingram Olkin, John E. Hunter, Jacob Cohen, Thomas C. Chalmers, Robert Rosenthal, Frank L. Schmidt, and Douglas G. Bonett.
Conceptually, a meta-analysis uses a statistical approach to combine the results from multiple studies in an effort to increase power (over individual studies), improve estimates of the size of the effect and/or to resolve uncertainty when reports disagree. A meta-analysis is a statistical overview of the results from one or more systematic reviews. Basically, it produces a weighted average of the included study results and this approach has several advantages:
- Results can be generalized to a larger population,
- The precision and accuracy of estimates can be improved as more data is used. This, in turn, may increase the statistical power to detect an effect.
- Inconsistency of results across studies can be quantified and analyzed. For instance, does inconsistency arise from sampling error, or are study results (partially) influenced by between-study heterogeneity.
- Hypothesis testing can be applied on summary estimates,
- Moderators can be included to explain variation between studies,
- The presence of publication bias can be investigated
A meta-analysis of several small studies does not predict the results of a single large study. Some have argued that a weakness of the method is that sources of bias are not controlled by the method: a good meta-analysis cannot correct for poor design and/or bias in the original studies. This would mean that only methodologically sound studies should be included in a meta-analysis, a practice called 'best evidence synthesis'. Other meta-analysts would include weaker studies, and add a study-level predictor variable that reflects the methodological quality of the studies to examine the effect of study quality on the effect size. However, others have argued that a better approach is to preserve information about the variance in the study sample, casting as wide a net as possible, and that methodological selection criteria introduce unwanted subjectivity, defeating the purpose of the approach.
Publication bias: the file drawer problem
Another potential pitfall is the reliance on the available body of published studies, which may create exaggerated outcomes due to publication bias, as studies which show negative results or insignificant results are less likely to be published. For example, pharmaceutical companies have been known to hide negative studies and researchers may have overlooked unpublished studies such as dissertation studies or conference abstracts that did not reach publication. This is not easily solved, as one cannot know how many studies have gone unreported.
This file drawer problem (characterized by negative or non-significant results being tucked away in a cabinet), can result in a biased distribution of effect sizes thus creating a serious base rate fallacy, in which the significance of the published studies is overestimated, as other studies were either not submitted for publication or were rejected. This should be seriously considered when interpreting the outcomes of a meta-analysis.
The distribution of effect sizes can be visualized with a funnel plot which (in its most common version) is a scatter plot of standard error versus the effect size. It makes use of the fact that the smaller studies (thus larger standard errors) have more scatter of the magnitude of effect (being less precise) while the larger studies have less scatter and form the tip of the funnel. If many negative studies were not published, the remaining positive studies give rise to a funnel plot in which the base is skewed to one side (asymmetry of the funnel plot). In contrast, when there is no publication bias, the effect of the smaller studies has no reason to be skewed to one side and so a symmetric funnel plot results. This also means that if no publication bias is present, there would be no relationship between standard error and effect size. A negative or positive relation between standard error and effect size would imply that smaller studies that found effects in one direction only were more likely to be published and/or to be submitted for publication.
Apart from the visual funnel plot, statistical methods for detecting publication bias have also been proposed. These are controversial because they typically have low power for detection of bias, but also may make false positives under some circumstances. For instance small study effects (biased smaller studies), wherein methodological differences between smaller and larger studies exist, may cause asymmetry in effect sizes that resembles publication bias. However, small study effects may be just as problematic for the interpretation of meta-analyses, and the imperative is on meta-analytic authors to investigate potential sources of bias.
A Tandem Method for analyzing publication bias has been suggested for cutting down false positive error problems. This Tandem method consists of three stages. Firstly, one calculates Orwin's fail-safe N, to check how many studies should be added in order to reduce the test statistic to a trivial size. If this number of studies is larger than the number of studies used in the meta-analysis, it is a sign that there is no publication bias, as in that case, one needs a lot of studies to reduce the effect size. Secondly, one can do an Egger's regression test, which tests whether the funnel plot is symmetrical. As mentioned before: a symmetrical funnel plot is a sign that there is no publication bias, as the effect size and sample size are not dependent. Thirdly, one can do the trim-and-fill method, which imputes data if the funnel plot is asymmetrical.
The problem of publication bias is not trivial as it is suggested that 25% of meta-analyses in the psychological sciences may have suffered from publication bias. However, low power of existing tests and problems with the visual appearance of the funnel plot remain an issue, and estimates of publication bias may remain lower than what truly exists.
Most discussions of publication bias focus on journal practices favoring publication of statistically significant findings. However, questionable research practices, such as reworking statistical models until significance is achieved, may also favor statistically significant findings in support of researchers' hypotheses.
Problems related to studies not reporting non-statistically significant effects
It is not uncommon that studies do not report the effects when they do not reach statistical significance. For example, they may simply say that the groups did not show statistically significant differences, without report any other information (e.g. a statistic or p-value). Exclusion of these studies would lead to a situation similar to publication bias, but their inclusion (assuming null effects) would also bias the meta-analysis. MetaNSUE, a new method created by Joaquim Radua, has shown to allow researchers to include unbiasedly these studies. Its steps are as follows:
Problems related to the statistical approach
Other weaknesses are that it has not been determined if the statistically most accurate method for combining results is the fixed, IVhet, random or quality effect models, though the criticism against the random effects model is mounting because of the perception that the new random effects (used in meta-analysis) are essentially formal devices to facilitate smoothing or shrinkage and prediction may be impossible or ill-advised. The main problem with the random effects approach is that it uses the classic statistical thought of generating a "compromise estimator" that makes the weights close to the naturally weighted estimator if heterogeneity across studies is large but close to the inverse variance weighted estimator if the between study heterogeneity is small. However, what has been ignored is the distinction between the model we choose to analyze a given dataset, and the mechanism by which the data came into being. A random effect can be present in either of these roles, but the two roles are quite distinct. There's no reason to think the analysis model and data-generation mechanism (model) are similar in form, but many sub-fields of statistics have developed the habit of assuming, for theory and simulations, that the data-generation mechanism (model) is identical to the analysis model we choose (or would like others to choose). As a hypothesized mechanisms for producing the data, the random effect model for meta-analysis is silly and it is more appropriate to think of this model as a superficial description and something we choose as an analytical tool – but this choice for meta-analysis may not work because the study effects are a fixed feature of the respective meta-analysis and the probability distribution is only a descriptive tool.
Problems arising from agenda-driven bias
The most severe fault in meta-analysis often occurs when the person or persons doing the meta-analysis have an economic, social, or political agenda such as the passage or defeat of legislation. People with these types of agendas may be more likely to abuse meta-analysis due to personal bias. For example, researchers favorable to the author's agenda are likely to have their studies cherry-picked while those not favorable will be ignored or labeled as "not credible". In addition, the favored authors may themselves be biased or paid to produce results that support their overall political, social, or economic goals in ways such as selecting small favorable data sets and not incorporating larger unfavorable data sets. The influence of such biases on the results of a meta-analysis is possible because the methodology of meta-analysis is highly malleable.
A 2011 study done to disclose possible conflicts of interests in underlying research studies used for medical meta-analyses reviewed 29 meta-analyses and found that conflicts of interests in the studies underlying the meta-analyses were rarely disclosed. The 29 meta-analyses included 11 from general medicine journals, 15 from specialty medicine journals, and three from the Cochrane Database of Systematic Reviews. The 29 meta-analyses reviewed a total of 509 randomized controlled trials (RCTs). Of these, 318 RCTs reported funding sources, with 219 (69%) receiving funding from industry[clarification needed]. Of the 509 RCTs, 132 reported author conflict of interest disclosures, with 91 studies (69%) disclosing one or more authors having financial ties to industry. The information was, however, seldom reflected in the meta-analyses. Only two (7%) reported RCT funding sources and none reported RCT author-industry ties. The authors concluded "without acknowledgment of COI due to industry funding or author industry financial ties from RCTs included in meta-analyses, readers' understanding and appraisal of the evidence from the meta-analysis may be compromised."
For example, in 1998, a US federal judge found that the United States Environmental Protection Agency had abused the meta-analysis process to produce a study claiming cancer risks to non-smokers from environmental tobacco smoke (ETS) with the intent to influence policy makers to pass smoke-free–workplace laws. The judge found that:
EPA's study selection is disturbing. First, there is evidence in the record supporting the accusation that EPA "cherry picked" its data. Without criteria for pooling studies into a meta-analysis, the court cannot determine whether the exclusion of studies likely to disprove EPA's a priori hypothesis was coincidence or intentional. Second, EPA's excluding nearly half of the available studies directly conflicts with EPA's purported purpose for analyzing the epidemiological studies and conflicts with EPA's Risk Assessment Guidelines. See ETS Risk Assessment at 4-29 ("These data should also be examined in the interest of weighing all the available evidence, as recommended by EPA's carcinogen risk assessment guidelines (U.S. EPA, 1986a) (emphasis added)). Third, EPA's selective use of data conflicts with the Radon Research Act. The Act states EPA's program shall "gather data and information on all aspects of indoor air quality" (Radon Research Act § 403(a)(1)) (emphasis added).
As a result of the abuse, the court vacated Chapters 1–6 of and the Appendices to EPA's "Respiratory Health Effects of Passive Smoking: Lung Cancer and other Disorders".
Steps in a meta-analysis
A meta-analysis is usually preceded by a systematic review, as this allows identification and critical appraisal of all the relevant evidence (thereby limiting the risk of bias in summary estimates). The general steps are then as follows:
- Formulation of the research question, e.g. using the PICO model (Population, Intervention, Outcome).
- Search of literature
- Selection of studies ('incorporation criteria')
- Based on quality criteria, e.g. the requirement of randomization and blinding in a clinical trial
- Selection of specific studies on a well-specified subject, e.g. the treatment of breast cancer.
- Decide whether unpublished studies are included to avoid publication bias (file drawer problem)
- Decide which dependent variables or summary measures are allowed. For instance, when considering a meta-analysis of published (aggregate) data:
- Selection of a meta-analysis model, e.g. fixed effect or random effects meta-analysis.
- Examine sources of between-study heterogeneity, e.g. using subgroup analysis or meta-regression.
Formal guidance for the conduct and reporting of meta-analyses is provided by the Cochrane Handbook.
For reporting guidelines, see the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement.
Methods and assumptions
In general, two types of evidence can be distinguished when performing a meta-analysis: individual participant data (IPD), and aggregate data (AD). The aggregate data can be direct or indirect.
AD is more commonly available (e.g. from the literature) and typically represents summary estimates such as odds ratios or relative risks. This can be directly synthesized across conceptually similar studies using several approaches (see below). On the other hand, indirect aggregate data measures the effect of two treatments that were each compared against a similar control group in a meta-analysis. For example, if treatment A and treatment B were directly compared vs placebo in separate meta-analyses, we can use these two pooled results to get an estimate of the effects of A vs B in an indirect comparison as effect A vs Placebo minus effect B vs Placebo.
IPD evidence represents raw data as collected by the study centers. This distinction has raised the need for different meta-analytic methods when evidence synthesis is desired, and has led to the development of one-stage and two-stage methods.  In one-stage methods the IPD from all studies are modeled simultaneously whilst accounting for the clustering of participants within studies. Two-stage methods first compute summary statistics for AD from each study and then calculate overall statistics as a weighted average of the study statistics. By reducing IPD to AD, two-stage methods can also be applied when IPD is available; this makes them an appealing choice when performing a meta-analysis. Although it is conventionally believed that one-stage and two-stage methods yield similar results, recent studies have shown that they may occasionally lead to different conclusions.
Statistical models for aggregate data
Direct evidence: Models incorporating study effects only
Fixed effects model
The fixed effect model provides a weighted average of a series of study estimates. The inverse of the estimates' variance is commonly used as study weight, so that larger studies tend to contribute more than smaller studies to the weighted average. Consequently, when studies within a meta-analysis are dominated by a very large study, the findings from smaller studies are practically ignored. Most importantly, the fixed effects model assumes that all included studies investigate the same population, use the same variable and outcome definitions, etc. This assumption is typically unrealistic as research is often prone to several sources of heterogeneity; e.g. treatment effects may differ according to locale, dosage levels, study conditions, ...
Random effects model
A common model used to synthesize heterogeneous research is the random effects model of meta-analysis. This is simply the weighted average of the effect sizes of a group of studies. The weight that is applied in this process of weighted averaging with a random effects meta-analysis is achieved in two steps:
- Step 1: Inverse variance weighting
- Step 2: Un-weighting of this inverse variance weighting by applying a random effects variance component (REVC) that is simply derived from the extent of variability of the effect sizes of the underlying studies.
This means that the greater this variability in effect sizes (otherwise known as heterogeneity), the greater the un-weighting and this can reach a point when the random effects meta-analysis result becomes simply the un-weighted average effect size across the studies. At the other extreme, when all effect sizes are similar (or variability does not exceed sampling error), no REVC is applied and the random effects meta-analysis defaults to simply a fixed effect meta-analysis (only inverse variance weighting).
The extent of this reversal is solely dependent on two factors:
- Heterogeneity of precision
- Heterogeneity of effect size
Since neither of these factors automatically indicates a faulty larger study or more reliable smaller studies, the re-distribution of weights under this model will not bear a relationship to what these studies actually might offer. Indeed, it has been demonstrated that redistribution of weights is simply in one direction from larger to smaller studies as heterogeneity increases until eventually all studies have equal weight and no more redistribution is possible. Another issue with the random effects model is that the most commonly used confidence intervals generally do not retain their coverage probability above the specified nominal level and thus substantially underestimate the statistical error and are potentially overconfident in their conclusions. Several fixes have been suggested but the debate continues on. A further concern is that the average treatment effect can sometimes be even less conservative compared to the fixed effect model and therefore misleading in practice. One interpretational fix that has been suggested is to create a prediction interval around the random effects estimate to portray the range of possible effects in practice. However, an assumption behind the calculation of such a prediction interval is that trials are considered more or less homogeneous entities and that included patient populations and comparator treatments should be considered exchangeable and this is usually unattainable in practice.
The most widely used method to estimate between studies variance (REVC) is the DerSimonian-Laird (DL) approach. Several advanced iterative (and computationally expensive) techniques for computing the between studies variance exist (such as maximum likelihood, profile likelihood and restricted maximum likelihood methods) and random effects models using these methods can be run in Stata with the metaan command. The metaan command must be distinguished from the classic metan (single "a") command in Stata that uses the DL estimator. These advanced methods have also been implemented in a free and easy to use Microsoft Excel add-on, MetaEasy. However, a comparison between these advanced methods and the DL method of computing the between studies variance demonstrated that there is little to gain and DL is quite adequate in most scenarios.
However, most meta-analyses include between 2 and 4 studies and such a sample is more often than not inadequate to accurately estimate heterogeneity. Thus it appears that in small meta-analyses, an incorrect zero between study variance estimate is obtained, leading to a false homogeneity assumption. Overall, it appears that heterogeneity is being consistently underestimated in meta-analyses and sensitivity analyses in which high heterogeneity levels are assumed could be informative. These random effects models and software packages mentioned above relate to study-aggregate meta-analyses and researchers wishing to conduct individual patient data (IPD) meta-analyses need to consider mixed-effects modelling approaches.
Doi & Barendregt working in collaboration with Khan, Thalib and Williams (from the University of Queensland, University of Southern Queensland and Kuwait University), have created an inverse variance quasi likelihood based alternative (IVhet) to the random effects (RE) model for which details are available online. This was incorporated into MetaXL version 2.0, a free Microsoft excel add-in for meta-analysis produced by Epigear International Pty Ltd, and made available on 5 April 2014. The authors state that a clear advantage of this model is that it resolves the two main problems of the random effects model. The first advantage of the IVhet model is that coverage remains at the nominal (usually 95%) level for the confidence interval unlike the random effects model which drops in coverage with increasing heterogeneity. The second advantage is that the IVhet model maintains the inverse variance weights of individual studies, unlike the RE model which gives small studies more weight (and therefore larger studies less) with increasing heterogeneity. When heterogeneity becomes large, the individual study weights under the RE model become equal and thus the RE model returns an arithmetic mean rather than a weighted average. This side-effect of the RE model does not occur with the IVhet model which thus differs from the RE model estimate in two perspectives: Pooled estimates will favor larger trials (as opposed to penalizing larger trials in the RE model) and will have a confidence interval that remains within the nominal coverage under uncertainty (heterogeneity). Doi & Barendregt suggest that while the RE model provides an alternative method of pooling the study data, their simulation results demonstrate that using a more specified probability model with untenable assumptions, as with the RE model, does not necessarily provide better results. The latter study also reports that the IVhet model resolves the problems related to underestimation of the statistical error, poor coverage of the confidence interval and increased MSE seen with the random effects model and the authors conclude that researchers should henceforth abandon use of the random effects model in meta-analysis. While their data is compelling, the ramifications (in terms of the magnitude of spuriously positive results within the Cochrane database) are huge and thus accepting this conclusion requires careful independent confirmation. The availability of a free software (MetaXL) that runs the IVhet model (and all other models for comparison) facilitates this for the research community.
Direct evidence: Models incorporating additional information
Quality effects model
Doi and Thalib originally introduced the quality effects model. They introduced a new approach to adjustment for inter-study variability by incorporating the contribution of variance due to a relevant component (quality) in addition to the contribution of variance due to random error that is used in any fixed effects meta-analysis model to generate weights for each study. The strength of the quality effects meta-analysis is that it allows available methodological evidence to be used over subjective random effects, and thereby helps to close the damaging gap which has opened up between methodology and statistics in clinical research. To do this a synthetic bias variance is computed based on quality information to adjust inverse variance weights and the quality adjusted weight of the ith study is introduced. These adjusted weights are then used in meta-analysis. In other words, if study i is of good quality and other studies are of poor quality, a proportion of their quality adjusted weights is mathematically redistributed to study i giving it more weight towards the overall effect size. As studies become increasingly similar in terms of quality, re-distribution becomes progressively less and ceases when all studies are of equal quality (in the case of equal quality, the quality effects model defaults to the IVhet model – see previous section). A recent evaluation of the quality effects model (with some updates) demonstrates that despite the subjectivity of quality assessment, the performance (MSE and true variance under simulation) is superior to that achievable with the random effects model. This model thus replaces the untenable interpretations that abound in the literature and a software is available to explore this method further.
Indirect evidence: Network meta-analysis methods
Indirect comparison meta-analysis methods (also called network meta-analyses, in particular when multiple treatments are assessed simultaneously) generally use two main methodologies. First, is the Bucher method which is a single or repeated comparison of a closed loop of three-treatments such that one of them is common to the two studies and forms the node where the loop begins and ends. Therefore, multiple two-by-two comparisons (3-treatment loops) are needed to compare multiple treatments. This methodology requires that trials with more than two arms have two arms only selected as independent pair-wise comparisons are required. The alternative methodology uses complex statistical modelling to include the multiple arm trials and comparisons simultaneously between all competing treatments. These have been executed using Bayesian methods, mixed linear models and meta-regression approaches
Specifying a Bayesian network meta-analysis model involves writing a directed acyclic graph (DAG) model for general-purpose Markov chain Monte Carlo (MCMC) software such as WinBUGS. In addition, prior distributions have to be specified for a number of the parameters, and the data have to be supplied in a specific format. Together, the DAG, priors, and data form a Bayesian hierarchical model. To complicate matters further, because of the nature of MCMC estimation, overdispersed starting values have to be chosen for a number of independent chains so that convergence can be assessed. Currently, there is no software that automatically generates such models, although there are some tools to aid in the process. The complexity of the Bayesian approach has limited usage of this methodology. Methodology for automation of this method has been suggested but requires that arm-level outcome data are available, and this is usually unavailable. Great claims are sometimes made for the inherent ability of the Bayesian framework to handle network meta-analysis and its greater flexibility. However, this choice of implementation of framework for inference, Bayesian or frequentist, may be less important than other choices regarding the modeling of effects (see discussion on models above).
Frequentist multivariate framework
On the other hand, the frequentist multivariate methods involve approximations and assumptions that are not stated explicitly or verified when the methods are applied (see discussion on meta-analysis models above). For example, the mvmeta package for Stata enables network meta-analysis in a frequentist framework. However, if there is no common comparator in the network, then this has to be handled by augmenting the dataset with fictional arms with high variance, which is not very objective and requires a decision as to what constitutes a sufficiently high variance. The other issue is use of the random effects model in both this frequentist framework and the Bayesian framework. Senn advises analysts to be cautious about interpreting the 'random effects' analysis since only one random effect is allowed for but one could envisage many. Senn goes on to say that it is rather naıve, even in the case where only two treatments are being compared to assume that random-effects analysis accounts for all uncertainty about the way effects can vary from trial to trial. Newer models of meta-analysis such as those discussed above would certainly help alleviate this situation and have been implemented in the next framework.
Generalized pairwise modelling framework
An approach that has been tried since the late 1990s is the implementation of the multiple three-treatment closed-loop analysis. This has not been popular because the process rapidly becomes overwhelming as network complexity increases. Development in this area was then abandoned in favor of the Bayesian and multivariate frequentist methods which emerged as alternatives. Very recently, automation of the three-treatment closed loop method has been developed for complex networks by some researchers as a way to make this methodology available to the mainstream research community. This proposal does restrict each trial to two interventions, but also introduces a workaround for multiple arm trials: a different fixed control node can be selected in different runs. It also utilizes robust meta-analysis methods so that many of the problems highlighted above are avoided. Further research around this framework is required to determine if this is indeed superior to the Bayesian or multivariate frequentist frameworks. Researchers willing to try this out have access to this framework through a free software.
Another form of additional information comes from the intended setting. If the target setting for applying the meta-analysis results is known then it may be possible to use data from the setting to tailor the results thus producing a ‘tailored meta-analysis’. ,  This has been used in test accuracy meta-analyses, where empirical knowledge of the test positive rate and the prevalence have been used to derive a region in Receiver Operating Characteristic (ROC) space known as an ‘applicable region’. Studies are then selected for the target setting based on comparison with this region and aggregated to produce a summary estimate which is tailored to the target setting.
Validation of meta-analysis results
The meta-analysis estimate represents a weighted average across studies and when there is heterogeneity this may result in the summary estimate not being representative of individual studies. Qualitative appraisal of the primary studies using established tools can uncover potential biases,  but does not quantify the aggregate effect of these biases on the summary estimate. Although the meta-analysis result could be compared with an independent prospective primary study, such external validation is often impractical. This has led to the development of methods that exploit a form of leave-one-out cross validation, sometimes referred to as internal-external cross validation (IOCV).  Here each of the k included studies in turn is omitted and compared with the summary estimate derived from aggregating the remaining k- 1 studies. A general validation statistic, Vn based on IOCV has been developed to measure the statistical validity of meta-analysis results.  For test accuracy and prediction, particularly when there are multivariate effects, other approaches which seek to estimate the prediction error have also been proposed. 
Applications in modern science
Modern statistical meta-analysis does more than just combine the effect sizes of a set of studies using a weighted average. It can test if the outcomes of studies show more variation than the variation that is expected because of the sampling of different numbers of research participants. Additionally, study characteristics such as measurement instrument used, population sampled, or aspects of the studies' design can be coded and used to reduce variance of the estimator (see statistical models above). Thus some methodological weaknesses in studies can be corrected statistically. Other uses of meta-analytic methods include the development of clinical prediction models, where meta-analysis may be used to combine data from different research centers, or even to aggregate existing prediction models.
Meta-analysis can be done with single-subject design as well as group research designs. This is important because much research has been done with single-subject research designs. Considerable dispute exists for the most appropriate meta-analytic technique for single subject research.
Meta-analysis leads to a shift of emphasis from single studies to multiple studies. It emphasizes the practical importance of the effect size instead of the statistical significance of individual studies. This shift in thinking has been termed "meta-analytic thinking". The results of a meta-analysis are often shown in a forest plot.
Results from studies are combined using different approaches. One approach frequently used in meta-analysis in health care research is termed 'inverse variance method'. The average effect size across all studies is computed as a weighted mean, whereby the weights are equal to the inverse variance of each study's effect estimator. Larger studies and studies with less random variation are given greater weight than smaller studies. Other common approaches include the Mantel–Haenszel method and the Peto method.
Seed-based d mapping (formerly signed differential mapping, SDM) is a statistical technique for meta-analyzing studies on differences in brain activity or structure which used neuroimaging techniques such as fMRI, VBM or PET.
Different high throughput techniques such as microarrays have been used to understand Gene expression. MicroRNA expression profiles have been used to identify differentially expressed microRNAs in particular cell or tissue type or disease conditions or to check the effect of a treatment. A meta-analysis of such expression profiles was performed to derive novel conclusions and to validate the known findings.
- ^Greenland S, O' Rourke K: Meta-Analysis. Page 652 in Modern Epidemiology, 3rd ed. Edited by Rothman KJ, Greenland S, Lash T. Lippincott Williams and Wilkins; 2008.
- ^Walker E, Hernandez AV, Kattan MW (2008). "Meta-analysis: Its strengths and limitations". Cleve Clin J Med. 75 (6): 431–9. PMID 18595551. }
- ^Glossary at Cochrane Collaboration
- ^PLACKETT, R. L. (1958). "STUDIES IN THE HISTORY OF PROBABILITY AND STATISTICS: VII. THE PRINCIPLE OF THE ARITHMETIC MEAN". Biometrika. 45 (1–2): 133. doi:10.1093/biomet/45.1-2.130. Retrieved 29 May 2016.
- ^Pearson K (1904). "Report on certain enteric fever inoculation statistics". BMJ. 2 (2288): 1243–1246. doi:10.1136/bmj.2.2288.1243. PMC 2355479. PMID 20761760.
- ^Nordmann AJ, Kasenda B, Briel M (Mar 9, 2012). "Meta-analyses: what they can and cannot do". Swiss Medical Weekly. 142: w13518. doi:10.4414/smw.2012.13518. PMID 22407741.
- ^O'Rourke K (2007-12-01). "An historical perspective on meta-analysis: dealing quantitatively with varying study results". J R Soc Med. 100 (12): 579–582. doi:10.1258/jrsm.100.12.579. PMC 2121629. PMID 18065712.
- ^Pratt JG, Rhine JB, Smith BM, Stuart CE, Greenwood JA. Extra-Sensory Perception after Sixty Years: A Critical Appraisal of the Research in Extra-Sensory Perception. New York: Henry Holt, 1940
- ^Glass G. V (1976). "Primary, secondary, and meta-analysis of research". Educational Researcher. 5 (10): 3–8. doi:10.3102/0013189X005010003.
- ^Cochran WG (1937). "Problems Arising in the Analysis of a Series of Similar Experiments". Journal of the Royal Statistical Society. 4: 102–118. doi:10.2307/2984123.
- ^Cochran WG, Carroll SP (1953). "A Sampling Investigation of the Efficiency of Weighting Inversely as the Estimated Variance". Biometrics. 9: 447–459. doi:10.2307/3001436.
- ^LeLorier J, Grégoire G, Benhaddad A, Lapierre J, Derderian F (1997). "Discrepancies between Meta-Analyses and Subsequent Large Randomized, Controlled Trials". New England Journal of Medicine. 337 (8): 536–542. doi:10.1056/NEJM199708213370806. PMID 9262498.
- ^ abSlavin RE (1986). "Best-Evidence Synthesis: An Alternative to Meta-Analytic and Traditional Reviews". Educational Researcher. 15 (9): 5–9. doi:10.3102/0013189X015009005.
- ^Hunter, Schmidt, & Jackson, John E. (1982). Meta-analysis: Cumulating research findings across studies. Beverly Hills, California: Sage.
- ^Glass, McGaw, & Smith (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.
- ^ abRosenthal R (1979). "The "File Drawer Problem" and the Tolerance for Null Results". Psychological Bulletin. 86 (3): 638–641. doi:10.1037/0033-2909.86.3.638.
- ^Hunter, John E; Schmidt, Frank L (1990). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Newbury Park, California; London; New Delhi: SAGE Publications.
- ^Light & Pillemer (1984). Summing up: The science of reviewing research. Cambridge, CA: Harvard University Pree.
- ^Ioannidis JP, Trikalinos TA (2007). "The appropriateness of asymmetry tests for publication bias in meta-analyses: a large survey". CMAJ. 176 (8): 1091–6. doi:10.1503/cmaj.060410. PMC 1839799. PMID 17420491.
- ^ abFerguson CJ, Brannick MT (2012). "Publication bias in psychological science: prevalence, methods for identifying and controlling, and implications for the use of meta-analyses". Psychol Methods. 17 (1): 120–8. doi:10.1037/a0024445. PMID 21787082.
- ^Simmons JP, Nelson LD, Simonsohn U (2011). "False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant". Psychol Sci. 22 (11): 1359–66. doi:10.1177/0956797611417632. PMID 22006061.