**A Brief Review of the Dook, R for Data Science**

**Dale Bowman and Lih-Yuan Deng***

*Department of Mathematical Sciences, University of Memphis, USA*

**Submission: ** February 21, 2018; **Published: ** May 25, 2018

***Corresponding author: ** Lih Yuan Deng, Department of Mathematical Sciences, University of Memphis, Memphis, Tennessee, USA, Tel: 901-678-3134; Fax: 901-678-2480; Email: [email protected]

**How to cite this article: **Dale B, Lih-Yuan D. A Brief Review of the Dook, R for Data Science. Biostat Biometrics Open Acc J. 2018; 7(1): 555704.
DOI: 10.19080/BBOAJ.2018.07.555704

**Abstract**

This article reviews a recent textbook that introduces a data scientist to the *tidyverse*, an environment in R that provides useful packages for importing data, *tidying* data, transforming data, visualization, modeling and communication of results. All of these tools are needed for managing a data science project.

**Keywords: ** Tidy verse; Tidy data principles; *pipe*; *tibble*; GG Plot2; Communicate results; Data science; Exploration; Important packages; Environment; Data handling easier; Visualization; Data manipulation; Existing Variables; Outliers; Data Analysis; Graphical Techniques; Traditional data Fame; Operating system; Strings; Writing functions; Loops; R’s Modeling functions

**Introduction**

**R for Data Science**

Import, Tidy, Transform, Visualize and Model Data is a recent text written by Hadley Wickham, developer of several widely used R packages including *ggplot2*, and Garrett Grolemund. The text was published by O’Reilly in 2016. The text is available online at [1] with source code for the examples at [2]. The text is divided into five parts illustrating the four-step model of the tools needed for a data science project:

I. Import, tidy, and transform the data,

II. Visualize and explore the data,

III. Model the data and

IV. Communicate results.

The book is not organized in the order in which a data science project would be completed, but rather in the order that the authors believe is the best way to learn the individual components. The first part of the book involves basic tools needed for exploration of data including *ggplot2* for data visualization, data transformations with *dplyr*, and basic concepts of exploratory data analysis. The second part of the text involves importation of the data, *tidying* of the data, and transforming the data as needed. This process is termed wrangling and the authors give a nice introduction to the problems that can arise in this stage – the initial one in the data science process. For the third part, the authors take the reader through some of the basic programming concepts needed to successfully complete the four-step model. The basics of modeling data for exploration is described in the fourth part of the text, with an emphasis on linear models. The final part of the book focuses
on communicating results to decision makers and collaborators using R markdown.

**Discussion**

The textbook is designed to provide the reader with a solid foundation in some important packages in the *tidyverse* environment. Tidy data principles provide powerful tools for making data handling easier and more efficient. Tidy data has been formatted so that

a) Each column contains a single variable,

b) Each row contains a single observation, and

c) Each value has its own cell.

The basic *tidyverse* packages that are useful for working with data include: *ggplot2* for visualization, *dplyr* for data manipulation, tidyr for data *tidying*, *readr* for fast data import, purr for functional programming, and *tibble* a modern data frame.

The first part of the textbook focuses on exploration of an already wrangled data set. The graphics package in the *tidyverse*, *ggplot2* is investigated in some detail through applications to several data sets. The key functions of the data manipulation package, *dplyr*, allow the user to pick observations by value, reorder observations, pick variables by their name, create new variables as functions of existing variables, and summarize data by certain attributes. The final major component of this first part of the book takes the reader through the steps of exploratory data analysis, including examining variation, outliers, relationships between variables and examining patterns in data distributions using primarily graphical techniques. The idea of the *pipe* (borrowed from UNIX Sell), one of the key components for
working in the *tidyverse*, is introduced.

The second part of the book on data wrangling introduces
the concept of a *tibble*, how to create one and how it differs
from a traditional data frame. From there the *readr* package is
discussed for importing rectangular data into R. The advantages
of the *readr* functions over their traditional counter parts (read.
csv() for example) include speed, independence from operating
system, and the creation of *tibble*s instead of data frames.
Following the import process, this part of the text moves on to
*tidying* the data using functions in the tidyr package. Since most
of the data a scientist will encounter will not be tidy, the text
works through many different data structures that might occur
and provide functions for *tidying* the data. Principles of how to
work with relational data tables are covered using functions in
*dplyr*. Techniques for working with strings, factors and data/time
variables are also discussed [3].

The third part of the text includes more on using the *pipe* and
other good programming practices including writing functions
for code that will appear in several places in a program, the use
of vectors in R as the component variables in a *tibble*, working
with lists in R, and programming using for and while loops using
purr. The material is covered thoroughly and is accessible to those
without a programming background.

Modeling is introduced in the fourth part of the book with
a focus on the linear model as the next step in the data science
project. Emphasis is on modeling for hypothesis generation
rather than confirmatory analysis. The *modelr* package is used in
conjunction with base R’s modeling functions so that they work
with the *pipe*. The modeling process is first introduced on some
simple simulated data to illustrate the components. Later models
are built using real data. Methods of handling many models using
purrr and broom packages are discussed. While no comprehensive
discussion was given, many valuable references on the subject are
given.

The final part of the book is focused on communicating results. R Markdown is introduced to integrate code, graphs, and text into reports for decision makers and for fellow analysts. The exploratory graphics discussed in the first part are transformed into expository graphics to aid decision makers for a better understanding the data. The reader can learn how to produce other output, such as websites, presentations and dashboards using R Markdown [4].

**Conclusion**

In summary, R for Data Science is a very useful text for
introducing data scientists to the principles required to work in
the *tidyverse*. Advantages of *tibble*s over data frames, *readr* over
base R input functions, the *pipe* over traditional programming,
and the advantages of tidyr and *dplyr* for data manipulation and
*tidying* are clearly specified. The use of *ggplot2* to produce high
quality graphics is well covered. The authors do not presume any
previous programming experience in R nor extensive statistical
knowledge, which makes this book suitable for introducing the
*tidyverse* principles to most data scientists. However, the book
does not overreach and indeed, as pointed out in the preface, it
is not intended to cover all topics in data science. In particular,
none of the data sets used through the text are big data sets and all
are rectangular. The book is RStudio specific although RStudio is
not a requirement for working in the *tidyverse*. Overall, it is a well
written, gentle introduction to the *tidyverse* for programmers and
non-programmers alike.