Skip to main content

Joining observation units with dplyr

Today I would like to show examples of different ways you can join data frames. Let's define and display them first. In first data.frame I will collect some information's about certain. It will contain name, high and nationality.

df1<-data.frame(name=c("Ania","Marek","Kamil","Joanna","Patrice"),high=c(178,190,175,168,175),nationality=c("polish","polish","polish","polish","french"))
df1
      name high nationality
1    Ania   178      polish
2   Marek   190      polish
3   Kamil   175      polish
4  Joanna   168      polish
5 Patrice   175      french

In second data.frame I will put observation about other group, containing their name and weight. What does this two data.frame have in common ? We can see that both contain column with the name of the person and what is more some person like Ania and Patrice are being describe in both data.frame. 

df2<-data.frame(name=c("Ania","Julia","Patrice","Jim"),weight=c(67,55,75,75)) 
df2 
       name weight
1    Ania     67
2   Julia     55
3 Patrice     75
4     Jim     75

Let's see now different possibilities of joining these elements:
library(dplyr)
inner_join(df1,df2) 

Inner_join() will return observations which are in both df1 and df2. All columns both from df1 and df2 will be present.

Joining, by = "name"
     name high nationality weight
1    Ania   178      polish     67
2 Patrice   175      french     75

left_join(df1,df2) 
Left_join() will return columns of df1 and df2 containing unique observation of df1 and those which exists both in df1 and df2.

Joining, by = "name"
     name high nationality weight
1    Ania   178      polish     67
2   Marek   190      polish     NA
3   Kamil   175      polish     NA
4  Joanna   168      polish     NA
5 Patrice   175      french     75

right_join(df1,df2) 

Right_join() will return columns of df1 and df2 containing unique observations of df2 and those which exists both in df1 and df2.

joining, by = "name"
     name high nationality weight
1    Ania   178      polish     67
2 Patrice   175      french     75
3   Julia    NA        <NA>     55
4     Jim    NA        <NA>     75

full_join(df1,df2) 

Full-join() will return columns of df1 and df2 containing all observations present in df1 and df2.

Joining, by = "name"
     name high nationality weight
1    Ania   178      polish     67
2   Marek   190      polish     NA
3   Kamil   175      polish     NA
4  Joanna   168      polish     NA
5 Patrice   175      french     75
6   Julia    NA        <NA>     55
7     Jim    NA        <NA>     75

anti_join(df1,df2) 
Anti_join() excludes rows from df1 which are present in df2.

Joining, by = "name"
    name high nationality
1  Marek   190      polish
2  Kamil   175      polish
3 Joanna   168      polish

semi_join(df1,df2)

Semi_join() will match the rows.

Joining, by = "name"
     name high nationality
1    Ania   178      polish
2 Patrice   175      french

Comments

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...