Skip to main content

Ggplot2 visualizations - examples - in the subject of Covid19 again!

 I have prepared example of two charts, Multiseries Line chart and Scatterplot to illustrate how ggplot2 is working.  Additionally I have put some formatting elements to show how we can improve looks of our charts.

We need libraries below to create our graphs. Install them if you don't have it yet.

library(readr) #data import tool, part of the Tidyverse.
library(dplyr) #perfect package for data manipulation, queries and much more, part of the Tidyverse.
library(ggplot2) #the subject of this post, package for data visualizations, part of the Tidyverse.
library(RColorBrewer) #package which is helpful while we are choosing the colors.

I downloaded data file "COVID_data_2021_01_19" from website: https://shiny.rstudio.com/gallery/covid19-tracker. Thanks to readr package I import dataset to R and transform it a bit. I used dplyr to pick some observations I wanted to visualized therefore I created "covid2" data.frame.

covid<-COVID_data_2021_01_19
covid$country=as.factor(covid$country) # We make sure that this variables is factor.
covid$date=as.Date(covid$date,format='%m%d%y') #We are putting the date format.
covid=as.data.frame(covid) # data set us data.frame
covid2<- covid%>%filter(country == c("France","Poland","UK","Germany","Spain","Italy"))
head(covid2)

The first six rows of my data.frame looks as following:

country date cumulative_cases new_cases_past_week cumulative_deaths new_

1  France 2020-02-04      6                   2                 0     0
2      UK 2020-02-25     34                  15                 0     0
3  France 2020-03-03    212                 198                 4     3
4      UK 2020-03-03    189                 155                 0     0
5  France 2020-03-10   1783                1571                33    29
6   Italy 2020-03-24  69176               37670              6820  4317
 cumulative_cases_per_million new_cases_per_million_past_week cumul
1                        0.1                             0.0      0.0
2                        0.5                             0.2      0.0
3                        3.2                             3.0      0.1
4                        2.8                             2.3      0.0
5                       27.3                            24.1      0.5
6                     1144.1                           623.0    112.8  
new_deaths_per_million_past_week
1                              0.0
2                              0.0
3                              0.0
4                              0.0
5                              0.4
6                             71.4

The covid2 data.frame contains country which I have chosen:  France, Poland, UK, Germany, Spain, Italy. The variable "country" is categorical variables.  The remaining variables are numeric except "date" variables.

Multiseries line chart

ggplot(data=covid2, aes(x=date, y=cumulative_cases,color=country))+ geom_line(size=1.25)+ theme_gray(base_size = 12)+
ggtitle("Number of covid 19 cases")+
theme(plot.title=element_text(hjust=0.5,face = "bold"))+xlab("Month")+
ylab("Number of cases")+
theme(panel.background = element_rect(fill="cornsilk"))+
guides(color=guide_legend(title = "Country", label.position = "right",reverse=T))
                                                                           
The results of this code looks like this:


Let me explain now what each of the ggplot2 layers means. We start with defining the data we are going to use which is covid2 data.frame. Then we define aesthetics. On x-axis we take date, on y-axis numeric variable "cumulative_cases" and additionally we will visualize categoric variable "country" by colour differentiation. We need to precise now what type of chart we want.  I did it by using geom_line() which returns a line chart. If we want different type of chart we will use different geom function. To have better visibility I added size equal 1.25.

The remain function I used help me with formatting the chart. With ggtitle() I defined the main title and with theme() I bold this title and put it in the middle of chart. I add the name of the x and y axis with xlab() and ylab(). With theme() I changed the colour of the background to " cornsilk". At the end I format also the legend with the function guides().

Scatterplot

ggplot(data=covid2,
       aes(x=cumulative_deaths,y=cumulative_cases,color=country))+
  geom_point()+
  geom_smooth(method="lm")+
  ggtitle("Correlation between number of deaths vs. number of cases per country")+
  xlab("Number of deaths")+ylab("Number of cases")+ theme_gray(base_size=12)+ theme(plot.title=element_text(hjust=0.5),title=element_text(face = "bold"))+theme(panel.background = element_rect(fill="cornsilk"))+guides(color=guide_legend(title = "Country",
label.position = "left", reverse=T)) 

The results of this code looks like this:


Let me explain now what each of the ggplot2 layers means. We start with defining the data we are going to use which is covid2 data.frame. Then we define aesthetics. On x-axis we take numeric variable "cumulative_deaths", on y-axis numeric variable "cumulative_cases" and additionally we will visualize categoric variable "country" by colour differentiation. We need to precise now what type of chart we want.  I did it by using geom_point() which returns a scatterplot. With geom_smooth() I add regression line. 

The remain function I used, help me with formatting the chart. With ggtitle() I defined the main title. I add the name of the x and y axis with xlab() and ylab(). With theme() I bold the title and put it in the middle of chart, I changed the color of the background to " cornsilk". At the end I format also the legend with the function guides(). 

Comments

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...