Skip to main content

The Power of dplyr in R - part 3


Today I would like to present pipe operator which simplify our code and makes it more readable. As we can see all of the dplyr functions take a data frame (or tibble) as the first argument. Dplyr provides the %>% operator from magrittr that chains the functions so x %>% f(y) turns into f(x, y). Therefore  the result from one step is then “piped” into the next step. We will use pipe operator in further examples. 
Additionally we will focus on grouping, ordering and summarising functions. As previously I will continue using mtcars dataset which is included in your R base program.

count() #count the unique values of one or more variables  
n() 
n_distinct() #number of unique observation found in a category 
group_by() # group by a column, allows to group operation in the “split-apply-combine" concept  

library(dplyr)
data("mtcars")
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars%>% group_by(cyl)%>%count() 
# A tibble: 3 x 2
# Groups:   cyl [3]
    cyl     n
  <dbl> <int>
1     4    11
2     6     7
3     8    14

ungroup() remove the grouping 
summarize() # aggregate based on groups, will create summary statistics for a given column. 
mtcars%>% summarise(Avg.qsec=mean(qsec)) 
Avg.qsec
1 17.84875

arrange() # reorders the rows 
mtcars%>%arrange(carb) 
                    mpg  cyl  disp  hp drat    wt  qsec vs am  gear carb
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

mtcars%>% arrange(desc(cyl)) # by using desc() we are arrange rows in descending order 
                   mpg  cyl  disp  hp drat   wt  qsec vs am  gear carb
Hornet Sportabout  18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
Duster 360         14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
Merc 450SE         16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3
Merc 450SL         17.3   8 275.8 180 3.07 3.73 17.60  0  0    3    3
Merc 450SLC        15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4
 
rename() change the columns name 
mtcars%>% rename(cylinder=cyl) #rename col cyl, below I present just a few rows
                    mpg    cylinder disp hp drat   wt  qsec vs am   gear   carb
Mazda RX4           21.0        6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0        6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8        4 108.0  93 3.85 2.320 18.61  1  1    4    1

distinct() find unique values. Use the combination of distinct() and select(),to explore the unique values of a variables. 
mtcars%>%select(cyl)%>%distinct() 
                  cyl
Mazda RX4           6
Datsun 710          4
Hornet Sportabout   8

distinct() helps also to identify for duplicates, in other worlds observation that is present in a dataset multiple times and that is unique throughout all variables or pre-defined subset of variables. Below example of looking for distinct values of selected columns. The output .keep_all=T features all variables. 

mtcars%>% distinct(cyl,gear,.keep_all = T)  
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Toyota Corona     21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Porsche 914-2     26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Ford Pantera L    15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6

cumsum() return a vector whose elements are the cumulative sums 

Comments

Popular posts from this blog

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Model Residuals in Time Series Data

Residuals are the indicator of the model quality. Based on Rob J Hyndman's book "Forecasting: Principles & Practice", residuals in forecasting is difference between observed value and its forecast based on all previous observations. Residuals are useful in checking whether a model has adequately captured the information in the data. All the patterns should be in the model, only randomness remains in the residuals. Therefore the ideal model has to be: uncorrelated has zero mean and useful properties are: constant variance  be normally distributed First I will activate some useful libraries we will be using. library(fpp) library(forecast) For our example I will use dowjones index as a data set. The idea will be to set up already well know simple models like: Mean Model, Naive model and Drift Model. In previous post I described  it more detailed. Next, knowing what attributes  the ideal model should  have we can check which one of those 3 are quite good or  def...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...