The Power of dplyr in R

The Power of dplyr in R - part 3

Today I would like to present pipe operator which simplify our code and makes it more readable. As we can see all of the dplyr functions take a data frame (or tibble) as the first argument. Dplyr provides the %>% operator from magrittr that chains the functions so x %>% f(y) turns into f(x, y). Therefore the result from one step is then “piped” into the next step. We will use pipe operator in further examples.

Additionally we will focus on grouping, ordering and summarising functions. As previously I will continue using mtcars dataset which is included in your R base program.

count() #count the unique values of one or more variables

n()

n_distinct() #number of unique observation found in a category

group_by() # group by a column, allows to group operation in the “split-apply-combine" concept

library(dplyr)

data("mtcars")

head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

mtcars%>% group_by(cyl)%>%count()

# A tibble: 3 x 2

# Groups: cyl [3]

cyl n

1 4 11

2 6 7

3 8 14

ungroup() remove the grouping

summarize() # aggregate based on groups, will create summary statistics for a given column.

mtcars%>% summarise(Avg.qsec=mean(qsec))

Avg.qsec

1 17.84875

arrange() # reorders the rows

mtcars%>%arrange(carb)

mpg cyl disp hp drat wt qsec vs am gear carb

Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

mtcars%>% arrange(desc(cyl)) # by using desc() we are arrange rows in descending order

mpg cyl disp hp drat wt qsec vs am gear carb

Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2

Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4

Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3

Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.60 0 0 3 3

Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3

Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4

rename() change the columns name

mtcars%>% rename(cylinder=cyl) #rename col cyl, below I present just a few rows

mpg cylinder disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

distinct() find unique values. Use the combination of distinct() and select(),to explore the unique values of a variables.

mtcars%>%select(cyl)%>%distinct()

cyl

Mazda RX4 6

Datsun 710 4

Hornet Sportabout 8

distinct() helps also to identify for duplicates, in other worlds observation that is present in a dataset multiple times and that is unique throughout all variables or pre-defined subset of variables. Below example of looking for distinct values of selected columns. The output .keep_all=T features all variables.

mtcars%>% distinct(cyl,gear,.keep_all = T)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

cumsum() return a vector whose elements are the cumulative sums

My journey through the data science - by Karolina M'Goma

Search This Blog

The Power of dplyr in R - part 3

Comments

Post a Comment

Popular posts from this blog

Model Residuals in Time Series Data

Random number generators, reproducibility and sampling with dplyr