Noah Brier | October 17, 2019

Why is this interesting? - The Replication Crisis Edition

On the Stanford Prison Experiment, the "Marshmallow Test", and the trouble in psychology

Today’s the 150th edition of WITI. As always, if you’ve got any feedback for us please just hit reply. Also, while I’ve got you, sorry for the early email yesterday, got a little jumpy on the send button. - Noah (NRB)

Noah here. Near the top of the list of most famous psychology experiments has to be Philip Zimbardo’s Stanford Prison Experiment (SPE). Performed in 1971, the study set up a fake prison in a basement at Stanford University and assigned some students to be guards and others to be prisoners. Each “prisoner” was searched, stripped, and deloused before being assigned a uniform, number, and stocking cap to simulate having their hair shaved off. Things quickly began to unravel. Within 36 hours the first student had to be released because they were suffering an “acute emotional disturbance” and the whole thing ended prematurely after they found the guards escalating abuse and outsiders objected to the treatment of the “inmates” in the experiment.  

It has since become a symbol of the terrible things humans do to one another. “People put in positions of authority, like prison guards, sometimes abuse that authority, and in startlingly cruel ways,” explained the New York Times review of the 2015 film about the experiment starring Billy Crudup. The only problem? The experiment is looking more and more like it wasn’t an experiment at all. Here’s a quick outline of all the issues from a new paper by Thibault Le Texier, who I found by way of his excellent interview on the Rationally Speaking podcast. Le Texier went through the SPE archive at Stanford, which was donated by Zimbardo in 2011. In doing so he found seven large issues:

(1) in designing the SPE, Zimbardo borrowed several key elements from a student experiment conducted 3 months before, (2) the guards knew what results Zimbardo wanted to achieve and how to achieve them, (3) the guards were asked to play a specific part but were not informed that they were subjects, (4) the prisoners could not leave of their own will and were subjected to harsh conditions designed by the experimenters, (5) the participants were almost never completely immersed in the unrealistic prison situation, (6) the collection and the reporting of the data were incomplete and biased, and (7) the conclusions of the SPE had been written in advance according to non-academic aims.

Why is this interesting?

This is part of a much broader story of a “replication crisis” in psychology: The challenge of being completely unable to reproduce the results from some of the discipline’s most famous experiments. Here’s how FiveThirtyEight described it a year ago:

The replication crisis arose from a series of events that began around 2011, the year that social scientists Uri Simonsohn, Leif Nelson and Joseph Simmons published a paper, “False-Positive Psychology,” that used then-standard methods to show that simply listening to the Beatles song “When I’m Sixty-Four” could make someone younger. It was an absurd finding, and that was the point. The paper highlighted the dangers of p-hacking — adjusting the parameters of an analysis until you get a statistically significant p-value (a difficult-to-understand number often misused to imply a finding couldn’t have happened by chance) — and other subtle or not-so-subtle ways that researchers could tip the scales to produce a favorable result.

In addition to the SPE (which, in some ways, isn’t even a question of replication as much as whether it was ever a scientific study at all), the list of famous experiments that haven’t held up is full of well-known studies. According to Atlantic writer Ed Yong, “There’s social priming, where subliminal exposures can influence our behavior. And ego depletion, the idea that we have a limited supply of willpower that can be exhausted. And the facial-feedback hypothesis, which simply says that smiling makes us feel happier.” There’s also the “marshmallow test”, which asked children to wait 15 minutes to eat a marshmallow in order to get a second marshmallow. Those who were able to wait longer did better on standardized tests and had fewer behavioral problems. The problem was that the study was done at Stanford and most of the children had parents who were professors and, as a result, pretty well off. When researchers eventually controlled for those factors they found the study’s results didn’t hold up. In other words, kids who had a stable household were more likely to be able to delay gratification, and once you sort things that way there doesn’t seem to be much long-term difference between the kids who waited and those who didn't.

The challenge, of course, is that for lots of these studies the cultural impact has already been made. Speaking of just the marshmallow study:

Yet their findings have been interpreted to be a prescription by school districts and policy wonks. “If you’re a policy maker and you are not talking about core psychological traits like delayed gratification skills, then you’re just dancing around with proxy issues,” the New York Times’s David Brooks wrote in 2006. It’s not hard to find studies on interventions to increase delaying gratification in schools or examples of schools adopting these lessons into their curricula. Sesame Street’s Cookie Monster has even been used to teach the lesson.

As we know well, once that cultural imprint has been made it can be massively difficult to erase. In the meantime, it will be interesting to see what other domino's fall (“growth mindset” and “grit” aren’t looking too hot) and it’s probably worth trying to use what statistician Andrew Gelman calls the “time-reversal heuristic”. “Imagine the two studies in reverse order: First a large and careful study that finds nothing of interest, then a small noisy replication whose authors fish around in the data and find an unexpected statistically significant result. The idea is to remove the ‘research incumbency effect’ and to consider each study on its own merits.” (NRB)

Product of the Day:

[Unforunately this product appears to have been discontinued. Sorry!] Long haul flying is painful without noise cancelling headphones. But my new hack both on a plane and living in noisy New York are these Bose sleep pods. They fit comfortably in your ear like earplugs and play ambient sounds of your choosing via an app. If you turn up the volume it cuts out a lot of noise (ahem crying babies on 15 hour flights) as it does street noise on your commute. For those that need to find a pocket of calm in travel or on the commute I really enjoy this product. (CJN

Quick Links:

Thanks for reading,

Noah (NRB) & Colin (CJN)

PS - Noah here. I’ve started a new company and we are looking for our first engineer and designer to join the team. If you are one of those or know anyone who is great, please share. Dinner’s on me at a restaurant of your choice if you help us find someone.


Why is this interesting? is a daily email from Noah Brier & Colin Nagy (and friends!) about interesting things. If you’ve enjoyed this edition, please consider forwarding it to a friend. If you’re reading it for the first time, consider subscribing (it’s free!).

© WITI Industries, LLC.