Skip to main content

The Power of dplyr in R - part 1



The dplyr is one of the library in Tidyverse package. In other word a collection of R libraries that work together in order to achieve clean and tidy data. I have started the discovery of its content while learning process of data pre-processing, data aggregation. It turns out to be very efficient, easy to use and fast tool so lot of people including me use it very often. It will help you with manipulation of data.frame, queries, sorting, summary statistics,  joining tables and more. 

My math’s teacher used to say that when you are trying to solve the problem it matters which way you choose to achieve the goal. It is up to us to choose the most efficient tool so all the process will go smoothly. This is the reason why dplyr package  is worth learning! It allows you not only to do your tasks but it will do it in quite easy and fast way. Pay attention for data you are taking while using dplyr - it can be tibble or  data.frame. 

I will use mtcars dataset which is included in your base R program to present how you can transform this dataset, by using a simple commands. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). It is data frame with 32 observations on 11 (numeric) variables where:  

[, 1] mpg Miles/(US) gallon 
[, 2] cyl Number of cylinders 
[, 3] disp Displacement (cu.in.) 
[, 4] hp Gross horsepower 
[, 5] drat Rear axle ratio 
[, 6] wt Weight (1000 lbs) 
[, 7] qsec 1/4 mile time 
[, 8] vs Engine (0 = V-shaped, 1 = straight) 
[, 9] am Transmission (0 = automatic, 1 = manual) 
[,10] gear Number of forward gears 
[,11] carb Number of carburetors 

Let's jump into R, add dplyr library, display the first six rows of our data set and select some columns.

library(dplyr)
data("mtcars") 
head(mtcars) 
                  mpg cyl disp  hp drat    wt  qsec vs am  gear  carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   

Head() function allows us to display first six rows and I will use it several times in this article. In dplyr we have got glimpse () function which also might be useful at the stage of unknowledge our self with the structure of data we are dealing. 

glimpse() display the structure of the data, data types 

Let me introduce now the select () function which allows us to pick variables we want. 
 
select()   select a subset of columns (pick variables)

head(select(.data=mtcars,cyl,vs))  #select columns cyl and vs
                   cyl vs
Mazda RX4           6  0
Mazda RX4 Wag       6  0
Datsun 710          4  1
Hornet 4 Drive      6  1
Hornet Sportabout   8  0
Valiant             6  1

Use "-" operator to select all the columns except a specific column, for example:
head(select(.data=mtcars,-mpg)) #select all columns exept mpg. 
                  cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1

head(select(.data=mtcars,drat:vs)) #select range of columns by name
                 drat    wt  qsec  vs
Mazda RX4         3.90 2.620 16.46  0
Mazda RX4 Wag     3.90 2.875 17.02  0
Datsun 710        3.85 2.320 18.61  1
Hornet 4 Drive    3.08 3.215 19.44  1
Hornet Sportabout 3.15 3.440 17.02  0
Valiant           2.76 3.460 20.22  1

head(select(.data=mtcars,starts_with("c"))) #select columns starts with character string "c"
                  cyl carb
Mazda RX4           6    4
Mazda RX4 Wag       6    4
Datsun 710          4    1
Hornet 4 Drive      6    1
Hornet Sportabout   8    2
Valiant             6    1

Remember that you can select columns based on specific criteria with help of dplyr functions like: 
starts_with() starts with a character string “ “ 
ends_with()  ends with a character string “ “ 
contains()  contains a character string 
matches()  match regular expression 
one_of()  select columns names that are from a group of names 

Comments

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...