Building a frequency word cloud in R in other words what French government recommends in Corona's time

Word cloud is a text mining method to visualize textual data. As a result we will see the most frequent used worlds in a text we are analysing. The packages we are going to use are the following : tm which is text mining package, SnowballC which is text stemming package, word cloud which allow us to generate cloud image and RColorBrewer for choosing the colour palettes.

You can install them first by using command install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer"). This step is not necessarily if by some reason you were using them before and there are already in your computer.

To build a frequency word cloud I will use text which I found in French government site about covid 19 and official recommendation the government is giving. The URL of this site is https://www.gouvernement.fr/en/coronavirus-covid-19

Because of the fact that news about corona is changing rapidly and I don't know how long this information will be available, I took this text into file and name it "covid.csv". Therefore the steps of building the frequency word cloud will be presented based on .csv file. Let's started!

library(tm) # Load text mining package
library(SnowballC) # Load SnowballC package for text stemming
library(wordcloud) # Load the wordcloud package
library(RColorBrewer) # Load color palettes package

corona<- read.csv("covid.csv") # download the text

In the first step we need to define a corpus of text for world cloud analysis. VectorSource() function from "tm" package will help us to do it. What is important, the corpus contain character vectors therefore we need to make sure we are dealing with vectors containing only the text. It can be done as following:

corona <- as.character(corona$TEXT) # Convert to characters and create a vector containing only text
head(corona,3) # to check the first tree entries

[1] "This information is valid from 28 December 2020."

[2] "On this website you can find information and guidance from the French Government regarding the current outbreak of coronavirus disease COVID-19 in France."
[3] "RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE"

Now we will build the corpus. Note that VectorSource() create a corpus only of character vectors.

covid <- Corpus(VectorSource(corona)) # Convert plots into a corpus

If we want to check the results we can use command like this:

inspect(covid) # Inspect the entry in the corpus. I will present here only a few lines of it.

<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 59
[1] This information is valid from 28 December 2020. [2] On this website you can find information and guidance from the French Government regarding the current outbreak of coronavirus disease COVID-19 in France. [3] RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE [4] A nightly curfew is currently in force in metropolitan France. Between 8PM and 6AM you may only leave your residence for the following reasons and with an exemption certificate:

To check the first entry we can use command:

covid[[1]]$content # Inspect first entry in the corpus

[1] "This information is valid from 28 December 2020."

We have build the corpus, now on the second step I am going to transform this text into structured text data. We are going to use tm_map () function from "tm". In most of the cases it is necessarily to:

convert to lower case (content_transformer(tolower) function),
remove punctuation (removePunctuation() function),
remove numbers (removeNumbers() function),
remove stopwords (words which don't add much meaning to the sentence e.g a, the, he, have. We will use removeWords())
reduce to steam words (in a result we will get the root of the word e.g we can have love from loving lovingly loved lover lovely). We are using fuction stemDocument () to perform this step however I have decided not to use it in this text. StemDocument in R cut some word too much. There are some ways to avoid this problem but it is not a subject of this post.
strip some whitespace (stripWhitespace() function)

Transformation of the text by using tm_map ()

# Convert terms to lower case

covid <- tm_map(covid, content_transformer(tolower))

# Remove punctuation

covid <- tm_map(covid, removePunctuation)

# Remove numbers

covid<- tm_map(covid,removeNumbers)

# Remove stop words from corpus

covid <- tm_map(covid, removeWords, stopwords("english"))

# Reduce terms to stems in corpus. As I said before I decided not to use it but I put below example of how to use it:

covid <- tm_map(covid, stemDocument, "english")

# Strip whitespace from corpus

covid <- tm_map(covid, stripWhitespace)

After executed each line I was checking the results by displaying its content e.g

# Inspect first entry in the corpus

covid[[1]]$content

Note! There is also function PlainTextDocument() which convert corpus to text document however after using it when it was time to create word cloud I got an error:

"Error in simple_triplet_matrix(i, j, v, nrow = length(terms), ncol = length(corpus), :
'i, j' invalid".

Word cloud was working after removing this line.

The next step is to build a a term-document matrix which is a table containing the frequency of the words. We will use The function TermDocumentMatrix() from tm package.

tdm <- TermDocumentMatrix(covid)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
df <- data.frame(word = names(v),freq=v)

Now lets check how the table looks like by using head() or just display all content

head(df, 15)

word freq
information information 5
france france 5
must must 5
open open 5
remain remain 5
can can 4
coronavirus coronavirus 4
covid covid 4
may may 4
work work 4
public public 4
will will 4
virus virus 4
find find 3
french french 3

I will not display here all the table however when I did it I noticed that two weird world was counted so I decide to deleted them.

covid <- tm_map(covid, removeWords, c("oneâ€™", "â‚¬")) # removing two symbols which were counted as a word

Finally we are ready to create a word cloud with wordcloud() function.

set.seed(100) # for reproducibility
wordcloud(
words = df$word,
freq = df$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.3,
colors=brewer.pal(8, "Dark2"), scale=c(3,0.45))

Let me explain that max.words is a limit of words in the cloud, rot.per is the percentage of vertical text, we choose also the colours from the ColorBrewer package.

Let's see also how the same input will looks like by using wordcloud 2 package. First you need to install it on your computer.

require(devtools)
install_github("lchiffon/wordcloud2")
set.seed(100) # for reproducibility
library(wordcloud2)

Now it's time to build wordcloud2.

wordcloud2(data = df, color = "random-light", backgroundColor = "black")

Note that df is a data.frame including word and frequency in each column.

We can go further and start checking how words which appears more frequently are correlated with each other. For example if we want to find words which occur at least four times we can use command below:

findFreqTerms(tdm, lowfreq = 4)

[1] "information" "can" "coronavirus" "covid" "france" "may"

[7] "work" "must" "public" "open" "remain" "will"

[13] "virus"

Now we can analyse the correlation between frequent terms by using using findAssocs() function.

findAssocs(tdm, terms = "must", corlimit = 0.5)

$must

transport work

0.62 0.52

findAssocs(tdm, terms = "can", corlimit = 0.5)

$can

protect french

0.69 0.55

As we can see "must" was used often when it was time to discussed transport and work conditions.

"can" is correlated with protection advices.

Basic Statistics for Time Series

What we can say about the time series data at the beginning? How we can describe it and what elements determinate the method we will use to forecast the data? For my own personal use I have prepared some notes which help me to answer questions above. I was using some definitions from the book of "Forecasting: Principles & Practice" by Rob J Hyndman like also some other blog's article like: https://towardsdatascience.com/descriptive-statistics-in-time-series-modelling Basic Statistics for Time Series When you make sure that your data has time series class, you can check the data with the basic functions we have in R. ts() is useful to build Time Series from scratch. mean() shows the average of a set of data. median() shows the middle value of the arranged set of data. plot() shows on the graph how the Time series looks like sort() sort the data quantile() function returns quantiles which are cut points dividing the range of a probability distribution into continuous ...

My journey through the data science - by Karolina M'Goma

Search This Blog

Building a frequency word cloud in R in other words what French government recommends in Corona's time

Comments

Post a Comment

Popular posts from this blog

Random number generators, reproducibility and sampling with dplyr

Ggplot2 for data visualizations

Basic Statistics for Time Series