www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (4331B)


      1 ---
      2 title: "The Art of Data Science"
      3 description: This book is a great companion to your boring stats class.
      4 date: 2015-09-09
      5 categories:
      6 - Data Science
      7 ---
      8 
      9 Data Science is hot. Fortunately, the recently released ebook [The Art of Data
     10 Science (A Guide for Anyone Who Works with
     11 Data)](https://leanpub.com/artofdatascience) doesn't waste space on trendy
     12 technologies, and focuses instead on the enduring fundamentals of data
     13 analysis. I can forgive the authors for choosing a topical title, however,
     14 because few people are better qualified to capitalize on the trend as Johns
     15 Hopkins Professors (and expert data-handlers) [Roger
     16 Peng](http://www.biostat.jhsph.edu/~rpeng/) and [Elizabeth
     17 Matsui](https://twitter.com/eliza68). 
     18 
     19 No exercises accompany this book, and although there are a few snatches of R
     20 code, it's not meant to be used as a handbook for data analysis. Instead, it
     21 teaches readers how to think like a (productive) data analyst. Peng and Matsui
     22 break the process of analyzing data into a list of core activities, which
     23 begins with defining the question and ends with communicating the results.
     24 Instead of presenting this as a linear process, they describe an "epicycle of
     25 data analysis", a pattern of thinking and acting that is repeated in all of the
     26 core activities. The authors explain that an analyst will often cycle through
     27 this pattern several times during a single activity, and that the process often
     28 sends scientists back to earlier steps. Readers learn that real world data
     29 analysis is like playing Chutes and Ladders on a board with no ladders.
     30 
     31 ![Less fun than doing data analysis](Cnl03.jpg)
     32 
     33 I would recommend this book to any reader who's interested in this year's
     34 sexiest career -- especially those annoyed that anybody would label a career
     35 "sexy". However, I think this book is most valuable as a companion to
     36 introductory statistics classes, as it fills in key gaps left by traditional
     37 statistics curricula. This is particularly true for students, such as those at
     38 the beginning of a postgraduate education, who expect to analyze real data by
     39 the time the course has finished. Stats teachers have to cover a lot of ground
     40 in classes that begin with stories about ball-filled urns (some lucky students
     41 get actual candy) and end with ANOVA tables. In the end, few future scientists
     42 finish the term with much more than a list of specific tests that can be
     43 matched to familiar situations. Although most students gain additional training
     44 during their graduate student apprenticeship, an education in data analysis
     45 competes for time with the acquisition of domain knowledge and more specific
     46 skills. This often leads to a cargo-cult approach to data analysis among people
     47 who are specifically trained to seek out truth. I believe that reading this
     48 book will help counteract these shortcomings. 
     49 
     50 My favorite chapters discuss the process of building models of data, and I
     51 really appreciate that the authors describe the difference between modeling for
     52 inference (i.e., statistical modeling) and modeling for prediction (i.e.,
     53 machine learning). As a grad student, I had completed my program's required
     54 stats classes before I *really* understood the relationship between "running a
     55 (statistical) test" and "fitting a model"; learning this lesson earlier would
     56 have made me better at both. I also like that the book encourages analysts to
     57 build visualizations of their data early and often. This indispensable process
     58 has become much easier in recent years as new libraries in several programming
     59 languages streamline the process of tidying and plotting complicated data. The
     60 only section that I would like to see expanded is the chapter on communication,
     61 which might have benefitted from a brief discussion on building visualizations
     62 for an audience. I suppose that this omission may have been intentional; there
     63 are so many new developments in dataviz, such as a shift toward dynamic and
     64 interactive visualizations, that it would be difficult to write something that
     65 won’t soon be out of date. 
     66 
     67 If you're new to data analysis, or interested in integrating a more formal (but
     68 still flexible) framework into your practice, I think you'll find The Art of
     69 Data Science to be worth a look. It is a quick read and available at a
     70 pay-what-you-like price, so even poor grad students have no excuse not to.
     71