Joining observation units with dplyr

Today I would like to show examples of different ways you can join data frames. Let's define and display them first. In first data.frame I will collect some information's about certain. It will contain name, high and nationality.

df1<-data.frame(name=c("Ania","Marek","Kamil","Joanna","Patrice"),high=c(178,190,175,168,175),nationality=c("polish","polish","polish","polish","french"))
df1

name high nationality
1 Ania 178 polish
2 Marek 190 polish
3 Kamil 175 polish
4 Joanna 168 polish
5 Patrice 175 french

In second data.frame I will put observation about other group, containing their name and weight. What does this two data.frame have in common ? We can see that both contain column with the name of the person and what is more some person like Ania and Patrice are being describe in both data.frame.

df2<-data.frame(name=c("Ania","Julia","Patrice","Jim"),weight=c(67,55,75,75))
df2

name weight
1 Ania 67
2 Julia 55
3 Patrice 75
4 Jim 75

Let's see now different possibilities of joining these elements:

library(dplyr)

inner_join(df1,df2) 

Inner_join() will return observations which are in both df1 and df2. All columns both from df1 and df2 will be present.

Joining, by = "name"
name high nationality weight
1 Ania 178 polish 67
2 Patrice 175 french 75

left_join(df1,df2)

Left_join() will return columns of df1 and df2 containing unique observation of df1 and those which exists both in df1 and df2.

Joining, by = "name"
name high nationality weight
1 Ania 178 polish 67
2 Marek 190 polish NA
3 Kamil 175 polish NA
4 Joanna 168 polish NA
5 Patrice 175 french 75

right_join(df1,df2)

Right_join() will return columns of df1 and df2 containing unique observations of df2 and those which exists both in df1 and df2.

joining, by = "name"
name high nationality weight
1 Ania 178 polish 67
2 Patrice 175 french 75
3 Julia NA <NA> 55
4 Jim NA <NA> 75

full_join(df1,df2)

Full-join() will return columns of df1 and df2 containing all observations present in df1 and df2.

Joining, by = "name"
name high nationality weight
1 Ania 178 polish 67
2 Marek 190 polish NA
3 Kamil 175 polish NA
4 Joanna 168 polish NA
5 Patrice 175 french 75
6 Julia NA <NA> 55
7 Jim NA <NA> 75

anti_join(df1,df2)

Anti_join() excludes rows from df1 which are present in df2.

Joining, by = "name"
name high nationality
1 Marek 190 polish
2 Kamil 175 polish
3 Joanna 168 polish

semi_join(df1,df2)

Semi_join() will match the rows.

Joining, by = "name"
name high nationality
1 Ania 178 polish
2 Patrice 175 french

My journey through the data science - by Karolina M'Goma

Search This Blog

Joining observation units with dplyr

Comments

Post a Comment

Popular posts from this blog

Model Residuals in Time Series Data

Random number generators, reproducibility and sampling with dplyr

The Power of dplyr in R - part 2