index.md (6203B)
1 --- 2 title: "Reproducible research fail" 3 description: One of the mistakes I made in data collection. 4 date: 2015-10-01 5 categories: 6 - Data Science 7 - Science 8 tags: 9 - Psychology 10 --- 11 12 In most of the psychology subdisciplines under the umbrella of “cognitive 13 psychology” (e.g., language, memory, perception, etc.), researchers use 14 programs to collect data from participants (c.f. social psychology, which often 15 uses surveys instead). These are usually simple programs that display words or 16 pictures and record responses; if you’ve ever taken an introductory psychology 17 course, you were surely made to sit through a few of these. Although there are 18 a few tools that allow psychologists to create experiments without writing 19 their own code, most of us (at least in the departments with which I’ve been 20 affiliated) program their own studies. 21 22 The majority of psych grad students start their Ph.D.s with little programming 23 experience, so it’s not surprising that coding errors sometimes affect 24 research. As a first year grad student who’d previously worked as a programmer, 25 I was determined to do better. Naturally, I made a ton of mistakes, and I want 26 to talk about one of them: a five-year-old mistake I’m dealing with *today*. 27 28 Like many mistakes, I made this one while trying to avoid another. I noticed 29 that it was common for experiments to fail to record details about trials which 30 later turned out to be important. For instance, a researcher could run a 31 [visual search](http://www.scholarpedia.org/article/Visual_search) experiment 32 and not save the locations and identities of the randomly-selected 33 “distractors”, but later be unable to see if there was an effect of 34 [crowding](http://www.scholarpedia.org/article/Visual_search#Spatial_layout.2C_density.2C_crowding). 35 It was fairly common to fail to record response times while looking for an 36 effect on task accuracy, but then be unable to show that the observed effect 37 was not due to a speed-accuracy tradeoff. 38 39 I decided that I’d definitely record everything. This wasn’t itself a mistake. 40 41 Since I program my experiments in Python and using an object-oriented design -- 42 all of the data necessary to display a trial was encapsulated in instances of 43 the Trial class -- I decided that the best way to save *everything* was to 44 serialize these objects using Python’s pickle module. This way, if I added 45 additional members to Trial, I didn’t have to remember to explicitly include 46 them in the experiment’s output. I also smugly knew that I didn’t have to worry 47 about rounding errors since everything was stored in machine precision (because 48 *that* matters). 49 50 That’s not quite where I went wrong. 51 52 The big mistake was using this approach but failing to follow best practices 53 for reproducible research. It’s now incredibly difficult to unpickle the data 54 from my studies because the half dozen modules necessary to run my code have 55 all been updated since I wrote these programs. I didn’t even record the version 56 numbers of anything. I’ve had to write a bunch of hacks and manually install a 57 few old modules just to get my data back. 58 59 Today it’s a lot easier to do things the right way. If you’re programming in 60 Python, you can use the [Anaconda 61 distribution](https://www.continuum.io/why-anaconda) to create environments 62 that keep their own copies of your code’s dependencies. These won’t get updated 63 with the rest of the system, so you should be able to go back and run things 64 later. A language-agnostic approach could utilize [Docker 65 images](https://www.docker.com/), or go a step further and run each experiment 66 in its own virtual machine (although care should be taken to ensure adequate 67 system performance). 68 69 I do feel like I took things too far by pickling my Python objects. Even if I 70 had used anaconda, I’d have been committing myself to either performing all my 71 analyses in Python, or performing the intermediate step of writing a script to 72 export my output (giving myself another chance to introduce a coding error). 73 Using a generic output file format (e.g., a simple CSV file) affords more 74 flexibility in choosing analysis tools, and also better supports data-sharing. 75 76 I still think it’s important to record “everything”, but there are better ways 77 to do it. An approach I began to use later was to write separate programs for 78 generating trials and displaying them. The first program handles 79 counterbalancing and all the logic supporting randomness; it then creates a CSV 80 for each participant. The second program simply reads these CSVs and dutifully 81 displays trials based *only* on the information they contain, ensuring that no 82 aspect of a trial (e.g., the color of a distractor item) could be forgotten. 83 84 The display program records responses from participants and combines them with 85 trial info to create simple output files for analysis. To further protect 86 against data loss, it also records, with timestamps, a simple log of every 87 event that occurs during the experiment. The log file includes the experiment 88 start time, keypresses and input events, changes to the display, and anything 89 else that could happen. Between the input CSVs and this log file, it’s possible 90 to recreate exactly what happened during the course of the study -- even if the 91 information wasn’t in the “simple” output files. I make sure that the output is 92 written to disk frequently to ensure that nothing is lost in case of a system 93 crash. This approach also makes it easy to restart at a specific point, which 94 is useful for long studies and projects using fMRI (the scanner makes it easy 95 to have false-starts). 96 97 My position at the FAA doesn’t involve programming a lot of studies. We do most 98 of our work on fairly complicated simulator configurations (I have yet to do a 99 study that didn’t include a diagram describing the servers involved), and there 100 are a lot of good programmers around who are here specifically to keep these 101 running. I hope this lesson is useful for anybody else who might be collecting 102 data from people, whether it’s in the context of a psychology study or user 103 testing. 104