Skip to main content

Ggplot2 for data visualizations


 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero.

Before jumping to the ggplot2 structure I will share with you some tips I find useful.

  • First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, cleaned, summarize your data set before. Therefore it is crucial to learn this skills first. 
  • Secondly, you need to be aware of what type of variables are in your dataset because based on this you will choose the correct graphs. Are you analysing categorical or numeric data or some timeseries data? Do you want you to create graph of one or two variables or maybe more? 
  • Thirdly, check out what basic plot types exist and what they are showing? The Scatterplot, Histogram, Boxplot, Bar chart, Dot plot, Pie chart, Bubble chart, Heatmap, Line chart, Step chart are the basis. You have to know the definition of each of them to choose proper one. 
  • Next, be aware of the plotting system possibilities. You can do most visualisations in Base R, lattice and ggplot2 but each plotting system has some pros and cons and it is helpful to have this knowledge.

Coming back to ggplot2, it is one of the core libraries of the Tidyverse package therefore in order to use it, either you install Tidyverse or alternatively ggplot2.

install.packages("tidyverse")
install.packages("ggplot2")

The general setup of a ggplot2

ggplot(data,aes())+ geom_type() + geom_type2(optional)+ theme(optional)+...

Most of the time you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). After you add on layers like geom_point() if you want the scatterplot for example. With the layers you can add to your graphs a lot of elements, format them, add some annotations. The important rule is not putting too much at the same time, just enough elements to have good visibility. More doesn't mean better in this case.

Remember that Ggplot2 does great job when you want to show what the data is saying and it is done by writing couple of line of code so it is a good investment of learning it. Next time I will try to prove it by showing some examples of using ggplot2. 

Comments

Popular posts from this blog

Model Residuals in Time Series Data

Residuals are the indicator of the model quality. Based on Rob J Hyndman's book "Forecasting: Principles & Practice", residuals in forecasting is difference between observed value and its forecast based on all previous observations. Residuals are useful in checking whether a model has adequately captured the information in the data. All the patterns should be in the model, only randomness remains in the residuals. Therefore the ideal model has to be: uncorrelated has zero mean and useful properties are: constant variance  be normally distributed First I will activate some useful libraries we will be using. library(fpp) library(forecast) For our example I will use dowjones index as a data set. The idea will be to set up already well know simple models like: Mean Model, Naive model and Drift Model. In previous post I described  it more detailed. Next, knowing what attributes  the ideal model should  have we can check which one of those 3 are quite good or  def...

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

The Power of dplyr in R - part 3

Today I would like to present pipe operator which simplify our code and makes it more readable. As we can see all of the dplyr functions take a data frame (or tibble) as the first argument. Dplyr provides the %>% operator from magrittr that chains the functions so x %>% f(y) turns into f(x, y). Therefore  the result from one step is then “piped” into the next step. We will use pipe operator in further examples.  Additionally we will focus on grouping, ordering and summarising functions. As previously I will continue using mtcars dataset which is included in your R base program. count() #count the unique values of one or more variables   n()  n_distinct() #number of unique observation found in a category  group_by() # group by a column, allows to group operation in the “split-apply-combine" concept   library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat...