Skip to main content

Building a frequency word cloud in R in other words what French government recommends in Corona's time

Word cloud is a text mining method to visualize textual data. As a result we will see the most frequent used worlds in a text we are analysingThe packages we are going to use are the following : tm which is text mining package, SnowballC which is text stemming package, word cloud which allow us to generate cloud image and RColorBrewer for choosing the colour palettes. 

You can install them first by using command install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer"). This step is not necessarily if by some reason you were using them before and there are already in your computer.

To build a frequency word cloud I will use text which I found in French government site about covid 19 and official recommendation the government is giving. The URL of this site is https://www.gouvernement.fr/en/coronavirus-covid-19

Because of the fact that news about corona is changing rapidly and I don't know how long this information will be available, I took this text into file and name it "covid.csv".  Therefore the steps of building the frequency word cloud will be presented based on  .csv file.  Let's started!

library(tm) # Load text mining package
library(SnowballC) # Load SnowballC package for text stemming
library(wordcloud) # Load the wordcloud package
library(RColorBrewer) # Load color palettes package
corona<- read.csv("covid.csv") # download the text

In the first step we need to define a corpus of text for world cloud analysis. VectorSource() function from "tm" package will help us to do it. What is important, the corpus contain character vectors therefore we need to make sure we are dealing with vectors containing only the text. It can be done as following:

corona <- as.character(corona$TEXT) # Convert to characters and create a vector containing only text
head(corona,3) # to check the first tree entries

[1] "This information is valid from 28 December 2020."                       
[2] "On this website you can find information and guidance from the French Government regarding the current outbreak of coronavirus disease COVID-19 in France."
[3] "RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE"  

Now we will build the corpus. Note that VectorSource() create a corpus only of character vectors.

covid <- Corpus(VectorSource(corona)) # Convert plots into a corpus 

If we want to check the results we can use command like this:

inspect(covid) # Inspect the entry in the corpus. I will present here only a few lines of it.

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 59
 [1] This information is valid from 28 December 2020.                               [2] On this website you can find information and guidance from the French Government regarding the current outbreak of coronavirus disease COVID-19 in France.                                                                             [3] RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE                           [4] A nightly curfew is currently in force in metropolitan France. Between 8PM and 6AM you may only leave your residence for the following reasons and with an exemption certificate:   
 

To check the first entry we can use command:

covid[[1]]$content # Inspect first entry in the corpus
[1] "This information is valid from 28 December 2020."

We have build the corpus, now on the second step I am going to transform this text into structured text data. We are going to use tm_map () function from "tm". In most of the cases it is necessarily to:

  • convert to lower case (content_transformer(tolower) function),
  • remove punctuation (removePunctuation() function),
  • remove numbers (removeNumbers() function),
  • remove stopwords (words which don't add much meaning to the sentence e.g a, the, he, have. We will use removeWords()) 
  • reduce to steam words (in a result we will get the root of the word e.g we can have love from loving lovingly loved lover lovely). We are using fuction stemDocument () to perform this step however I have decided not to use it in this text. StemDocument in R cut some word too much. There are some ways to avoid this problem but it is not a subject of this post.
  • strip some whitespace (stripWhitespace() function)

Transformation of the text by using tm_map ()

# Convert terms to lower case

covid <- tm_map(covid, content_transformer(tolower))

# Remove punctuation

covid <- tm_map(covid, removePunctuation)

# Remove numbers

covid<- tm_map(covid,removeNumbers)

# Remove stop words from corpus

covid <- tm_map(covid, removeWords, stopwords("english"))

# Reduce terms to stems in corpus. As I said before I decided not to use it but I put below example of how to use it:

covid <- tm_map(covid, stemDocument, "english")

# Strip whitespace from corpus

covid <- tm_map(covid, stripWhitespace)

After executed each line I was checking the results by displaying its content e.g

# Inspect first entry in the corpus

covid[[1]]$content

Note! There is also function PlainTextDocument() which convert corpus to text document however after using it when it was time to create word cloud  I got an error:

 "Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus),  : 
  'i, j' invalid". 

Word cloud was working after removing this line.

The next step is to build a a term-document matrix which is a table containing the frequency of the words. We will use The function TermDocumentMatrix() from tm package.

tdm <- TermDocumentMatrix(covid)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
df <- data.frame(word = names(v),freq=v)

Now lets check how the table looks like by using head() or just display all content

head(df, 15)
                  word freq
information information    5
france           france    5
must               must    5
open               open    5
remain           remain    5
can                 can    4
coronavirus coronavirus    4
covid             covid    4
may                 may    4
work               work    4
public           public    4
will               will    4
virus             virus    4
find               find    3
french           french    3
df

I will not display here all the table however when I did it I noticed that two weird world was counted so I decide to deleted them.

covid <- tm_map(covid, removeWords, c("one’", "€")) # removing two symbols which were counted as a word

Finally we are ready to create a word cloud with wordcloud() function.

set.seed(100) # for reproducibility
wordcloud(
  words = df$word,
  freq = df$freq, min.freq = 1,
  max.words=50, random.order=FALSE, rot.per=0.3, 
  colors=brewer.pal(8, "Dark2"), scale=c(3,0.45))

Let me explain that max.words is a limit of words in the cloud, rot.per is the percentage of vertical text, we choose also the colours from the ColorBrewer package.

Let's see also how the same input will looks like by using wordcloud 2 package. First you need to install it on your computer.

require(devtools)
install_github("lchiffon/wordcloud2")
set.seed(100) # for reproducibility
library(wordcloud2)

Now it's time to build wordcloud2.

wordcloud2(data = df, color = "random-light", backgroundColor = "black")

Note that df is a data.frame including word and frequency in each column.

We can go further and start checking how words which appears more frequently are correlated with each other. For example if we want to find words which occur at least four times we can use command below:

findFreqTerms(tdm, lowfreq = 4)

[1] "information" "can"         "coronavirus" "covid"       "france"      "may"        
 [7] "work"        "must"        "public"      "open"        "remain"      "will"       
[13] "virus"      

Now we can analyse the correlation between frequent terms by using using findAssocs() function.

findAssocs(tdm, terms = "must", corlimit = 0.5)
$must
transport      work 
     0.62      0.52 

findAssocs(tdm, terms = "can", corlimit = 0.5)
$can
protect  french 
   0.69    0.55 

As we can see "must" was used often when it was time to discussed transport and work conditions. 
"can" is correlated with protection advices.

Comments

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Let's assume that you want to take some random observations from your data set. Dplyr helps you with the function sample_n(). To make your code reproducible you seed the ID of a “random” set of values. You need to indicate number of rows you want to extract and specify if the rows should be replaced or not. To show you how it works I will use again mtcars dataset which is included in your base R program. Let's see first six rows of this data frame.  library(dplyr) data("mtcars") head(mtcars)                    mpg cyl disp  hp drat    wt  qsec vs am gear carb Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 Datsun 710        22.8   4  108  93 3.85 2.320 18.61...

Ggplot2 for data visualizations

 When I have started my adventure with R, immediately I've noticed that everybody was taking about ggplot2 and its importance.  Tap "ggplot2"  in google and check it by yourself.  You will see a lots of professional fancy graphs, articles, blogs and other great materials.  I was so impressed  that I was even trying to start my learning of R programming from ggplot2.  Soon I understood, that I needed some basics first and it is better to take time if you are starting from zero. Before jumping to the ggplot2 structure I will share with you some tips I find useful. First it is good to remember that there are some steps while you explore your data. Most of the time you have to collect data first,  do some pre-processing and exploration,  modelling & analysis and only after comes visualization. Of course in previous steps,  graphs also can be helpful to interpret the situation correctly however it is important that you have prepared, clea...

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...