index.md (4331B)
1 --- 2 title: "The Art of Data Science" 3 description: This book is a great companion to your boring stats class. 4 date: 2015-09-09 5 categories: 6 - Data Science 7 --- 8 9 Data Science is hot. Fortunately, the recently released ebook [The Art of Data 10 Science (A Guide for Anyone Who Works with 11 Data)](https://leanpub.com/artofdatascience) doesn't waste space on trendy 12 technologies, and focuses instead on the enduring fundamentals of data 13 analysis. I can forgive the authors for choosing a topical title, however, 14 because few people are better qualified to capitalize on the trend as Johns 15 Hopkins Professors (and expert data-handlers) [Roger 16 Peng](http://www.biostat.jhsph.edu/~rpeng/) and [Elizabeth 17 Matsui](https://twitter.com/eliza68). 18 19 No exercises accompany this book, and although there are a few snatches of R 20 code, it's not meant to be used as a handbook for data analysis. Instead, it 21 teaches readers how to think like a (productive) data analyst. Peng and Matsui 22 break the process of analyzing data into a list of core activities, which 23 begins with defining the question and ends with communicating the results. 24 Instead of presenting this as a linear process, they describe an "epicycle of 25 data analysis", a pattern of thinking and acting that is repeated in all of the 26 core activities. The authors explain that an analyst will often cycle through 27 this pattern several times during a single activity, and that the process often 28 sends scientists back to earlier steps. Readers learn that real world data 29 analysis is like playing Chutes and Ladders on a board with no ladders. 30 31 ![Less fun than doing data analysis](Cnl03.jpg) 32 33 I would recommend this book to any reader who's interested in this year's 34 sexiest career -- especially those annoyed that anybody would label a career 35 "sexy". However, I think this book is most valuable as a companion to 36 introductory statistics classes, as it fills in key gaps left by traditional 37 statistics curricula. This is particularly true for students, such as those at 38 the beginning of a postgraduate education, who expect to analyze real data by 39 the time the course has finished. Stats teachers have to cover a lot of ground 40 in classes that begin with stories about ball-filled urns (some lucky students 41 get actual candy) and end with ANOVA tables. In the end, few future scientists 42 finish the term with much more than a list of specific tests that can be 43 matched to familiar situations. Although most students gain additional training 44 during their graduate student apprenticeship, an education in data analysis 45 competes for time with the acquisition of domain knowledge and more specific 46 skills. This often leads to a cargo-cult approach to data analysis among people 47 who are specifically trained to seek out truth. I believe that reading this 48 book will help counteract these shortcomings. 49 50 My favorite chapters discuss the process of building models of data, and I 51 really appreciate that the authors describe the difference between modeling for 52 inference (i.e., statistical modeling) and modeling for prediction (i.e., 53 machine learning). As a grad student, I had completed my program's required 54 stats classes before I *really* understood the relationship between "running a 55 (statistical) test" and "fitting a model"; learning this lesson earlier would 56 have made me better at both. I also like that the book encourages analysts to 57 build visualizations of their data early and often. This indispensable process 58 has become much easier in recent years as new libraries in several programming 59 languages streamline the process of tidying and plotting complicated data. The 60 only section that I would like to see expanded is the chapter on communication, 61 which might have benefitted from a brief discussion on building visualizations 62 for an audience. I suppose that this omission may have been intentional; there 63 are so many new developments in dataviz, such as a shift toward dynamic and 64 interactive visualizations, that it would be difficult to write something that 65 won’t soon be out of date. 66 67 If you're new to data analysis, or interested in integrating a more formal (but 68 still flexible) framework into your practice, I think you'll find The Art of 69 Data Science to be worth a look. It is a quick read and available at a 70 pay-what-you-like price, so even poor grad students have no excuse not to. 71