The Power of dplyr in R

The Power of dplyr in R - part 1

The dplyr is one of the library in Tidyverse package. In other word a collection of R libraries that work together in order to achieve clean and tidy data. I have started the discovery of its content while learning process of data pre-processing, data aggregation. It turns out to be very efficient, easy to use and fast tool so lot of people including me use it very often. It will help you with manipulation of data.frame, queries, sorting, summary statistics, joining tables and more.

My math’s teacher used to say that when you are trying to solve the problem it matters which way you choose to achieve the goal. It is up to us to choose the most efficient tool so all the process will go smoothly. This is the reason why dplyr package is worth learning! It allows you not only to do your tasks but it will do it in quite easy and fast way. Pay attention for data you are taking while using dplyr - it can be tibble or data.frame.

I will use mtcars dataset which is included in your base R program to present how you can transform this dataset, by using a simple commands. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). It is data frame with 32 observations on 11 (numeric) variables where:

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders

[, 3] disp Displacement (cu.in.)

[, 4] hp Gross horsepower

[, 5] drat Rear axle ratio

[, 6] wt Weight (1000 lbs)

[, 7] qsec 1/4 mile time

[, 8] vs Engine (0 = V-shaped, 1 = straight)

[, 9] am Transmission (0 = automatic, 1 = manual)

[,10] gear Number of forward gears

[,11] carb Number of carburetors

Let's jump into R, add dplyr library, display the first six rows of our data set and select some columns.

library(dplyr)

data("mtcars")

head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Head() function allows us to display first six rows and I will use it several times in this article. In dplyr we have got glimpse () function which also might be useful at the stage of unknowledge our self with the structure of data we are dealing.

glimpse() display the structure of the data, data types

Let me introduce now the select () function which allows us to pick variables we want.

select() select a subset of columns (pick variables)

head(select(.data=mtcars,cyl,vs)) #select columns cyl and vs

cyl vs

Mazda RX4 6 0

Mazda RX4 Wag 6 0

Datsun 710 4 1

Hornet 4 Drive 6 1

Hornet Sportabout 8 0

Valiant 6 1

Use "-" operator to select all the columns except a specific column, for example:

head(select(.data=mtcars,-mpg)) #select all columns exept mpg.

cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1

head(select(.data=mtcars,drat:vs)) #select range of columns by name

drat wt qsec vs

Mazda RX4 3.90 2.620 16.46 0

Mazda RX4 Wag 3.90 2.875 17.02 0

Datsun 710 3.85 2.320 18.61 1

Hornet 4 Drive 3.08 3.215 19.44 1

Hornet Sportabout 3.15 3.440 17.02 0

Valiant 2.76 3.460 20.22 1

head(select(.data=mtcars,starts_with("c"))) #select columns starts with character string "c"

cyl carb

Mazda RX4 6 4

Mazda RX4 Wag 6 4

Datsun 710 4 1

Hornet 4 Drive 6 1

Hornet Sportabout 8 2

Valiant 6 1

Remember that you can select columns based on specific criteria with help of dplyr functions like:

starts_with() starts with a character string “ “

ends_with() ends with a character string “ “

contains() contains a character string

matches() match regular expression

one_of() select columns names that are from a group of names

My journey through the data science - by Karolina M'Goma

Search This Blog

The Power of dplyr in R - part 1

Comments

Post a Comment

Popular posts from this blog

Model Residuals in Time Series Data

The Power of dplyr in R - part 3

Random number generators, reproducibility and sampling with dplyr