The Garden of Forking Paths
Published psychological research can appear to be a streamlined and logical process, but behind the scenes, it is filled with complex decisions. From generating a research question to cleaning data and interpreting results, researchers face what has been called a "garden of forking paths" [1]. Each decision made—such as which variables to control for or how to handle outliers—creates an exponentially growing number of alternative analytical pathways that are often hidden in the final publication.
These "researcher degrees of freedom" [2] can lead to an over-reporting of statistically significant results and an under-reporting of non-significant ones, whether intentionally or not. Practices like "p-hacking"—selectively reporting analyses that yield significant results—or "HARKing"—forming a hypothesis after the results are known [3]—can exploit this flexibility. Such issues are thought to be a major contributor to the replication crisis in psychology, where a large proportion of published findings cannot be reproduced, likely because the reported effects were false positives to begin with [4].
While Open Science principles like pre-registration and data sharing aim to curb these problems, they have practical limitations and don't solve a more fundamental issue: different researchers making perfectly justifiable and transparent choices can still arrive at different conclusions. This was demonstrated in crowdsourced studies where dozens of analytical teams were given the same dataset and research question but produced a wide variety of results [5]. This suggests that a degree of idiosyncratic variability is simply inherent to the scientific process.
This context led to the development of analytical concepts like multiverse analysis (MVA); instead of choosing just one analytical path, a multiverse analysis addresses the problem by running every possible combination of a set of justifiable choices to create a vast landscape of results. By inspecting this “multiverse” of outcomes, researchers can assess whether a hypothesised effect is stable and robust or if it only appears under a narrow set of analytical conditions. This project implements the multiverse approach, learning from both a previously crowdsourced analysis and a partial implementation of the multiverse analysis, to provide novel insights into the results of the analysis and the strenghts, weaknesses, and the potential of multiverse analysis in general.
What are we looking at?
- Across all possible models, how much of the variation in red cards can be explained by the included predictors (i.e., what is the marginal variance explained)?
- When we group the models by which predictor they contain, what is the average performance for each group?
- Focusing on the models that include player skin tone, how do the odds ratios and their statistical significance compare to the findings from previous crowdsourced and multiverse analyses of this dataset?
Key takeaways
- A consistently significant link was found between darker skin tone and receiving a red card after statistical issues like overfitting were corrected.
- However, the models that show this effect have very low explanatory power, meaning they don't do a good job of predicting the outcome in reality.
- This highlights that a finding can be statistically significant without being practically meaningful.
What is a Multiverse Analysis?
A multiverse analysis (MVA) is a computational strategy designed to address the problems of "researcher degrees of freedom" and the "garden of forking paths". Instead of hiding the countless analytical choices a researcher can make, MVA embraces them by running every possible combination of a set of justifiable choices. This creates a "multiverse" of results, allowing us inspect patterns in findings.
Previous Analyses of this Dataset
The dataset used in this project has been analyzed before:
- The original crowdsourced study showed that 29 different expert teams produced a wide range of results, highlighting the problem of idiosyncratic researcher choices.
- Previous multiverse analyses also explored this dataset but focused on different aspects, such as a more constrained set of variables or comparing different types of statistical models. This project builds on that work by conducting a more comprehensive covariate multiverse with a novel focus on model performance.
Original Data
The data comes from a sports statistics company and includes 146,028 unique interactions between 2,053 professional football players and 3,147 referees from the top European leagues in the 2012-2013 season. A crucial data cleaning step removed "contaminating" referees (those not part of the target season).
Codebook (Selected Variables)
Original Variables
playerShort: Unique player IDclub: Player's clubbirthday: Player's date of birth-
height/weight: Player's height and weight position: Player's on-field position-
games: Number of games in the player-referee dyad -
redCards: Number of red cards given by the referee to the player -
rater1/rater2: Skin tone ratings from two independent raters -
refCountry: Unique ID for the referee's country -
meanIAT: Mean implicit bias score for the referee's country
Modified Variables
-
avrate: The average of the two skin tone ratings. -
age: Player's age as of a specific date during the season. -
position(collapsed): Player positions grouped into four main categories (Goalkeeper, Back, Middle, Front). -
redcard(dichotomous): Red card outcome converted to a binary (Yes/No) variable.
Multiverse coding
The analysis was conducted in R (version 4.1.2), mainly
utilising tidyverse (data wrangling) and
lme4 (statistical analyses). Multilevel logistic
regression modelling was selected to analyse the data as it
was one of the preferred approaches in the crowdsourced
project and its properties were sutiable for the intended
model.
Snippet 1: Generating All Possible Models
To create the multiverse, we first defined a list of 10
potential covariates. The R code below then uses the
expand.grid() function to generate every possible
combination of these covariates, resulting in a master list of
1,023 unique model formulas to be tested.
# Create list of potential covariates
covariates_list <- list(avrate = c(NA, 'avrate'),
position = c(NA, 'position'),
yellowCards = c(NA, 'yellowCards'),
height = c(NA, 'height'),
weight = c(NA, 'weight'),
club = c(NA, 'club'),
age = c(NA, 'age'),
meanIAT = c(NA, 'meanIAT'),
refCountry = c(NA, 'refCountry'),
victories = c(NA, 'victories'))
# Creating a list of all possible combinations
covariates_list <- expand.grid(covariates_list)
# Remove an empty row (no covariates)
covariates_list <- covariates_list[-1,]
Snippet 2: Running the Analysis
With the 1,023 unique model formulas generated, the next step
was to run a multilevel logistic regression for each one. The
R code below shows the for loop that iterates
through every formula and runs the model using the
glmer() function.
# Starting the loop
for(i in 1:nrow(covariate_grid)) {
# Each row of covariate_grid is now used as a formula for the regression
output <- tryCatch(glmer(data = redcard,
formula = paste('redCards ~ ',
covariate_grid[i, 'formula'],
'+ (1 | playerShort) + (1 | refNum)'),
family = binomial(link="logit"),
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0),
error = function(e) {
skip_to_next <<- TRUE
})
if(skip_to_next) { next }
# ... code to extract results (R2, ORs, etc.) would follow ...
}
A Single Universe: Model 129
To make the process concrete, the code from Snippet 2 dynamically builds the regression formula. For model 129, this formula becomes:
redCards ~ avrate+meanIAT + (1 | playerShort) + (1 | refNum)
This tells the model to predict red cards from skin tone (`avrate`) and the referee's country's implicit bias (`meanIAT`), while accounting for the nested data structure. Below is the simplified fixed-effects output for this specific model.
| Term | Odds Ratio | Lower CI | Upper CI | p-value |
|---|---|---|---|---|
| avrate (Skin Tone) | 1.21 | 0.97 | 1.52 | 0.092 |
Note: Full table including the intercept and other parameters is omitted for clarity.
Breaking Down the Function
-
family = binomial(link="logit"): This specifies a logistic regression, the correct model for a binary outcome (a player either received a red card or did not). -
control = glmerControl(optimizer = "bobyqa"): This instructs the model to use the `bobyqa` algorithm to find the best-fitting solution. This optimizer is well-suited for complex models as it does not require computationally intensive derivatives to find the optimal result efficiently.[6] -
nAGQ = 0: This is a setting that manages the trade-off between speed and precision. For this project, a faster, less precise estimation method was chosen to save computational power, a practical necessity for running over 1,000 models.[6] -
error = function(e)andif(skip_to_next): This is a crucial error handling mechanism. If a model fails to compute, this code "catches" the error and tells the loop to simply skip to the next model, preventing the entire analysis from crashing.
Figure 1: Full Multiverse Performance
This scatter plot displays the marginal variance explained (R²M) by all 1,023 models in the multiverse. The plot reveals two instances where the variance explained increases noticeably; these are caused by the inclusion of the players' club and referees' country covariates, which was interpreted as evidence of overfitting.
Figure 2: Multiverse Performance Without Overfitting
This plot shows the marginal variance explained (R²M) for the subset of 256 models after removing the two covariates suspected of causing overfitting. This visualization demonstrates that, with overfitting eliminated, the models explain a modest amount of variance (less than 6%).
Figure 3: Performance of Skin-Tone-Inclusive Models
This plot is a subset of the full multiverse, displaying the performance for the 512 models that specifically include averaged players' skin tone rating as a predictor.
Figure 4: Performance of Skin-Tone-Inclusive Models Without Overfitting
This plot displays the 128 models from Figure 3 after the two overfitting covariates have been removed, confirming that the remaining models have modest predictive power.
Figure 5: Average Model Performance by Covariate (Full)
This visualization uses violin plots to show the distribution of model performance grouped by each of the ten covariates. The plots for Country and Club are dramatically different, clustering at much higher values.
Figure 6: Average Model Performance by Covariate (Without Overfitting)
This is the same visualization as Figure 5 but based only on the models where players' club and referees' country have been excluded. No single covariate dramatically outperforms the others.
Figure 7: Distribution of Skin Tone ORs (Full)
This density plot shows the bimodal (two-peaked) distribution of odds ratios (ORs) for the effect of skin tone from all 512 models that included it.
Figure 8: Skin Tone ORs with Confidence Intervals (Full)
This plot displays each of the 512 odds ratios as a point, ordered by size, with shaded areas representing their 95% confidence intervals. 46.5% of the estimates were statistically significant (blue).
Figure 9: Distribution of Skin Tone ORs (Without Overfitting)
This density plot shows the distribution for the 128 models remaining after removing overfitting. The distribution is much more compact and the median OR is higher at 1.33.
Figure 10: Skin Tone ORs with Confidence Intervals (Without Overfitting)
This plot shows the odds ratios for the 128 "cleaned" models. After eliminating overfitting, 92.2% of the estimates for the effect of skin tone are statistically significant (blue).
Statistical Significance vs. Practical Meaning
The multiverse analysis shows a consistently significant link between darker skin tone and red cards after correcting for overfitting. However, the very low explanatory power (marginal R²) of these models suggests that this effect, while statistically real, is practically weak. It's a small piece of a much larger puzzle that the models don't capture.
The core conclusion is that researchers and the public should be cautious. A statistically significant finding doesn't automatically mean it's a powerful or important real-world effect. Looking at model performance is just as crucial.
What This Analysis Doesn't Cover
- Computational Constraints: Due to a lack of sufficient computational power, not all possible covariates or their interactions (like Body Mass Index) could be included in the multiverse.
- Model Fit Statistics: Logistic regression models don't have a standard R² value. The "pseudo-R²" used is a helpful approximation but isn't a perfect measure of model fit.
MVA as an Evolutionary Tool
Multiverse Analysis is a powerful exploratory tool for checking the stability of findings, but it isn't a magic bullet. Its results are only as good as the initial choices made by the researcher. This highlights the need for the scientific community to establish "principles of good practice" to guide new users.
Ultimately, MVA represents an "evolution rather than revolution" in data analysis. Its focus on the distribution of effects, rather than a single p-value, encourages a more nuanced understanding of results and helps move the field beyond a simple "significant/not-significant" mindset.
References
-
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
-
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2016). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. In A. E. Kazdin (Ed.), Methodological issues and strategies in clinical research (pp. 547-555). American Psychological Association.
-
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and social psychology review, 2(3), 196-217.
-
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).
-
Silberzahn, R., Uhlmann, E. L., Martin, D. P., et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337-356.
-
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2025). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-37. https://cran.r-project.org/web/packages/lme4/lme4.pdf