The Multiverse Project

The Garden of Forking Paths

Published psychological research can appear to be a streamlined and logical process, but behind the scenes, it is filled with complex decisions. From generating a research question to cleaning data and interpreting results, researchers face what has been called a "garden of forking paths" [1]. Each decision made—such as which variables to control for or how to handle outliers—creates an exponentially growing number of alternative analytical pathways that are often hidden in the final publication.

These "researcher degrees of freedom" [2] can lead to an over-reporting of statistically significant results and an under-reporting of non-significant ones, whether intentionally or not. Practices like "p-hacking"—selectively reporting analyses that yield significant results—or "HARKing"—forming a hypothesis after the results are known [3]—can exploit this flexibility. Such issues are thought to be a major contributor to the replication crisis in psychology, where a large proportion of published findings cannot be reproduced, likely because the reported effects were false positives to begin with [4].

While Open Science principles like pre-registration and data sharing aim to curb these problems, they have practical limitations and don't solve a more fundamental issue: different researchers making perfectly justifiable and transparent choices can still arrive at different conclusions. This was demonstrated in crowdsourced studies where dozens of analytical teams were given the same dataset and research question but produced a wide variety of results [5]. This suggests that a degree of idiosyncratic variability is simply inherent to the scientific process.

This context led to the development of analytical concepts like multiverse analysis (MVA); instead of choosing just one analytical path, a multiverse analysis addresses the problem by running every possible combination of a set of justifiable choices to create a vast landscape of results. By inspecting this “multiverse” of outcomes, researchers can assess whether a hypothesised effect is stable and robust or if it only appears under a narrow set of analytical conditions. This project implements the multiverse approach, learning from both a previously crowdsourced analysis and a partial implementation of the multiverse analysis, to provide novel insights into the results of the analysis and the strenghts, weaknesses, and the potential of multiverse analysis in general.

What are we looking at?

Across all possible models, how much of the variation in red cards can be explained by the included predictors (i.e., what is the marginal variance explained)?
When we group the models by which predictor they contain, what is the average performance for each group?
Focusing on the models that include player skin tone, how do the odds ratios and their statistical significance compare to the findings from previous crowdsourced and multiverse analyses of this dataset?

Key takeaways

A consistently significant link was found between darker skin tone and receiving a red card after statistical issues like overfitting were corrected.
However, the models that show this effect have very low explanatory power, meaning they don't do a good job of predicting the outcome in reality.
This highlights that a finding can be statistically significant without being practically meaningful.

What is a Multiverse Analysis?

A multiverse analysis (MVA) is a computational strategy designed to address the problems of "researcher degrees of freedom" and the "garden of forking paths". Instead of hiding the countless analytical choices a researcher can make, MVA embraces them by running every possible combination of a set of justifiable choices. This creates a "multiverse" of results, allowing us inspect patterns in findings.

Previous Analyses of this Dataset

The dataset used in this project has been analyzed before:

The original crowdsourced study showed that 29 different expert teams produced a wide range of results, highlighting the problem of idiosyncratic researcher choices.
Previous multiverse analyses also explored this dataset but focused on different aspects, such as a more constrained set of variables or comparing different types of statistical models. This project builds on that work by conducting a more comprehensive covariate multiverse with a novel focus on model performance.

Original Data

The data comes from a sports statistics company and includes 146,028 unique interactions between 2,053 professional football players and 3,147 referees from the top European leagues in the 2012-2013 season. A crucial data cleaning step removed "contaminating" referees (those not part of the target season).

Codebook (Selected Variables)

Original Variables

playerShort: Unique player ID
club: Player's club
birthday: Player's date of birth
height / weight: Player's height and weight
position: Player's on-field position
games: Number of games in the player-referee dyad
redCards: Number of red cards given by the referee to the player
rater1 / rater2: Skin tone ratings from two independent raters
refCountry: Unique ID for the referee's country
meanIAT: Mean implicit bias score for the referee's country

Modified Variables

avrate: The average of the two skin tone ratings.
age: Player's age as of a specific date during the season.
position (collapsed): Player positions grouped into four main categories (Goalkeeper, Back, Middle, Front).
redcard (dichotomous): Red card outcome converted to a binary (Yes/No) variable.

View Full Codebook on OSF

Multiverse coding

The analysis was conducted in R (version 4.1.2), mainly utilising tidyverse (data wrangling) and lme4 (statistical analyses). Multilevel logistic regression modelling was selected to analyse the data as it was one of the preferred approaches in the crowdsourced project and its properties were sutiable for the intended model.

View Full Technical Specifications on OSF

Snippet 1: Generating All Possible Models

To create the multiverse, we first defined a list of 10 potential covariates. The R code below then uses the expand.grid() function to generate every possible combination of these covariates, resulting in a master list of 1,023 unique model formulas to be tested.


# Create list of potential covariates
covariates_list <- list(avrate = c(NA, 'avrate'),
                        position = c(NA, 'position'),
                        yellowCards = c(NA, 'yellowCards'),
                        height = c(NA, 'height'),
                        weight = c(NA, 'weight'),
                        club = c(NA, 'club'),
                        age = c(NA, 'age'),
                        meanIAT = c(NA, 'meanIAT'),
                        refCountry = c(NA, 'refCountry'),
                        victories = c(NA, 'victories'))

# Creating a list of all possible combinations
covariates_list <- expand.grid(covariates_list) 

# Remove an empty row (no covariates)
covariates_list <- covariates_list[-1,]

Snippet 2: Running the Analysis

With the 1,023 unique model formulas generated, the next step was to run a multilevel logistic regression for each one. The R code below shows the for loop that iterates through every formula and runs the model using the glmer() function.


# Starting the loop
for(i in 1:nrow(covariate_grid)) {
  
  # Each row of covariate_grid is now used as a formula for the regression
  output <- tryCatch(glmer(data = redcard,
                           formula = paste('redCards ~ ',
                                           covariate_grid[i, 'formula'], 
                                           '+ (1 | playerShort) + (1 | refNum)'),
                           family = binomial(link="logit"),
                           control = glmerControl(optimizer = "bobyqa"),
                           nAGQ = 0),
                     
                     error = function(e) {
                       skip_to_next <<- TRUE
                     })
  
  if(skip_to_next) { next }
  
  # ... code to extract results (R2, ORs, etc.) would follow ...
}

View Full Code on OSF

A Single Universe: Model 129

To make the process concrete, the code from Snippet 2 dynamically builds the regression formula. For model 129, this formula becomes:

redCards ~ avrate+meanIAT + (1 | playerShort) + (1 | refNum)

This tells the model to predict red cards from skin tone (`avrate`) and the referee's country's implicit bias (`meanIAT`), while accounting for the nested data structure. Below is the simplified fixed-effects output for this specific model.

Term	Odds Ratio	Lower CI	Upper CI	p-value
avrate (Skin Tone)	1.21	0.97	1.52	0.092

Note: Full table including the intercept and other parameters is omitted for clarity.

Breaking Down the Function

family = binomial(link="logit"): This specifies a logistic regression, the correct model for a binary outcome (a player either received a red card or did not).
control = glmerControl(optimizer = "bobyqa"): This instructs the model to use the `bobyqa` algorithm to find the best-fitting solution. This optimizer is well-suited for complex models as it does not require computationally intensive derivatives to find the optimal result efficiently.[6]
nAGQ = 0: This is a setting that manages the trade-off between speed and precision. For this project, a faster, less precise estimation method was chosen to save computational power, a practical necessity for running over 1,000 models.[6]
error = function(e) and if(skip_to_next): This is a crucial error handling mechanism. If a model fails to compute, this code "catches" the error and tells the loop to simply skip to the next model, preventing the entire analysis from crashing.

Figure 1: Full Multiverse Performance

This scatter plot displays the marginal variance explained (R²M) by all 1,023 models in the multiverse. The plot reveals two instances where the variance explained increases noticeably; these are caused by the inclusion of the players' club and referees' country covariates, which was interpreted as evidence of overfitting.

Figure 2: Multiverse Performance Without Overfitting

This plot shows the marginal variance explained (R²M) for the subset of 256 models after removing the two covariates suspected of causing overfitting. This visualization demonstrates that, with overfitting eliminated, the models explain a modest amount of variance (less than 6%).

Figure 3: Performance of Skin-Tone-Inclusive Models

This plot is a subset of the full multiverse, displaying the performance for the 512 models that specifically include averaged players' skin tone rating as a predictor.

Figure 4: Performance of Skin-Tone-Inclusive Models Without Overfitting

This plot displays the 128 models from Figure 3 after the two overfitting covariates have been removed, confirming that the remaining models have modest predictive power.

Figure 5: Average Model Performance by Covariate (Full)

This visualization uses violin plots to show the distribution of model performance grouped by each of the ten covariates. The plots for Country and Club are dramatically different, clustering at much higher values.

Figure 6: Average Model Performance by Covariate (Without Overfitting)

This is the same visualization as Figure 5 but based only on the models where players' club and referees' country have been excluded. No single covariate dramatically outperforms the others.

Figure 7: Distribution of Skin Tone ORs (Full)

This density plot shows the bimodal (two-peaked) distribution of odds ratios (ORs) for the effect of skin tone from all 512 models that included it.

Figure 8: Skin Tone ORs with Confidence Intervals (Full)

This plot displays each of the 512 odds ratios as a point, ordered by size, with shaded areas representing their 95% confidence intervals. 46.5% of the estimates were statistically significant (blue).

Figure 9: Distribution of Skin Tone ORs (Without Overfitting)

This density plot shows the distribution for the 128 models remaining after removing overfitting. The distribution is much more compact and the median OR is higher at 1.33.

Figure 10: Skin Tone ORs with Confidence Intervals (Without Overfitting)

This plot shows the odds ratios for the 128 "cleaned" models. After eliminating overfitting, 92.2% of the estimates for the effect of skin tone are statistically significant (blue).

Statistical Significance vs. Practical Meaning

The multiverse analysis shows a consistently significant link between darker skin tone and red cards after correcting for overfitting. However, the very low explanatory power (marginal R²) of these models suggests that this effect, while statistically real, is practically weak. It's a small piece of a much larger puzzle that the models don't capture.

The core conclusion is that researchers and the public should be cautious. A statistically significant finding doesn't automatically mean it's a powerful or important real-world effect. Looking at model performance is just as crucial.

What This Analysis Doesn't Cover

Computational Constraints: Due to a lack of sufficient computational power, not all possible covariates or their interactions (like Body Mass Index) could be included in the multiverse.
Model Fit Statistics: Logistic regression models don't have a standard R² value. The "pseudo-R²" used is a helpful approximation but isn't a perfect measure of model fit.

MVA as an Evolutionary Tool

Multiverse Analysis is a powerful exploratory tool for checking the stability of findings, but it isn't a magic bullet. Its results are only as good as the initial choices made by the researcher. This highlights the need for the scientific community to establish "principles of good practice" to guide new users.

Ultimately, MVA represents an "evolution rather than revolution" in data analysis. Its focus on the distribution of effects, rather than a single p-value, encourages a more nuanced understanding of results and helps move the field beyond a simple "significant/not-significant" mindset.

References

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2016). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. In A. E. Kazdin (Ed.), Methodological issues and strategies in clinical research (pp. 547-555). American Psychological Association.
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and social psychology review, 2(3), 196-217.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).
Silberzahn, R., Uhlmann, E. L., Martin, D. P., et al. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337-356.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2025). lme4: Linear mixed-effects models using Eigen and S4. R package version 1.1-37. https://cran.r-project.org/web/packages/lme4/lme4.pdf