Building a frequency word cloud in R in other words what French government recommends in Corona's time
Word cloud is a text mining method to visualize textual data. As a result we will see the most frequent used worlds in a text we are analysing. The packages we are going to use are the following : tm which is text mining package, SnowballC which is text stemming package, word cloud which allow us to generate cloud image and RColorBrewer for choosing the colour palettes.
You can install them first by using command install.packages(c("tm", "SnowballC", "wordcloud", "RColorBrewer"). This step is not necessarily if by some reason you were using them before and there are already in your computer.
To build a frequency word cloud I will use text which I found in French government site about covid 19 and official recommendation the government is giving. The URL of this site is https://www.gouvernement.fr/en/coronavirus-covid-19
Because of the fact that news about corona is changing rapidly and I don't know how long this information will be available, I took this text into file and name it "covid.csv". Therefore the steps of building the frequency word cloud will be presented based on .csv file. Let's started!
library(SnowballC) # Load SnowballC package for text stemming
library(wordcloud) # Load the wordcloud package
library(RColorBrewer) # Load color palettes package
In the first step we need to define a corpus of text for world cloud analysis. VectorSource() function from "tm" package will help us to do it. What is important, the corpus contain character vectors therefore we need to make sure we are dealing with vectors containing only the text. It can be done as following:
head(corona,3) # to check the first tree entries
[3] "RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE"
Now we will build the corpus. Note that VectorSource() create a corpus only of character vectors.
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 59
[1] This information is valid from 28 December 2020. [2] On this website you can find information and guidance from the French Government regarding the current outbreak of coronavirus disease COVID-19 in France. [3] RESTRICTIONS AND REQUIREMENTS IN METROPOLITAN FRANCE [4] A nightly curfew is currently in force in metropolitan France. Between 8PM and 6AM you may only leave your residence for the following reasons and with an exemption certificate:
To check the first entry we can use command:
We have build the corpus, now on the second step I am going to transform this text into structured text data. We are going to use tm_map () function from "tm". In most of the cases it is necessarily to:
- convert to lower case (content_transformer(tolower) function),
- remove punctuation (removePunctuation() function),
- remove numbers (removeNumbers() function),
- remove stopwords (words which don't add much meaning to the sentence e.g a, the, he, have. We will use removeWords())
- reduce to steam words (in a result we will get the root of the word e.g we can have love from loving lovingly loved lover lovely). We are using fuction stemDocument () to perform this step however I have decided not to use it in this text. StemDocument in R cut some word too much. There are some ways to avoid this problem but it is not a subject of this post.
- strip some whitespace (stripWhitespace() function)
Transformation of the text by using tm_map ()
# Convert terms to lower case
# Remove punctuation
# Remove numbers
covid<- tm_map(covid,removeNumbers)
# Remove stop words from corpus
# Reduce terms to stems in corpus. As I said before I decided not to use it but I put below example of how to use it:
# Strip whitespace from corpus
covid <- tm_map(covid, stripWhitespace)
After executed each line I was checking the results by displaying its content e.g
# Inspect first entry in the corpus
Note! There is also function PlainTextDocument() which convert corpus to text document however after using it when it was time to create word cloud I got an error:
'i, j' invalid".
Word cloud was working after removing this line.
The next step is to build a a term-document matrix which is a table containing the frequency of the words. We will use The function TermDocumentMatrix() from tm package.
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
df <- data.frame(word = names(v),freq=v)
Now lets check how the table looks like by using head() or just display all content
information information 5
france france 5
must must 5
open open 5
remain remain 5
can can 4
coronavirus coronavirus 4
covid covid 4
may may 4
work work 4
public public 4
will will 4
virus virus 4
find find 3
french french 3
I will not display here all the table however when I did it I noticed that two weird world was counted so I decide to deleted them.
Finally we are ready to create a word cloud with wordcloud() function.
wordcloud(
words = df$word,
freq = df$freq, min.freq = 1,
max.words=50, random.order=FALSE, rot.per=0.3,
colors=brewer.pal(8, "Dark2"), scale=c(3,0.45))
Let me explain that max.words is a limit of words in the cloud, rot.per is the percentage of vertical text, we choose also the colours from the ColorBrewer package.
Let's see also how the same input will looks like by using wordcloud 2 package. First you need to install it on your computer.install_github("lchiffon/wordcloud2")
set.seed(100) # for reproducibility
library(wordcloud2)
Now it's time to build wordcloud2.
Note that df is a data.frame including word and frequency in each column.
Now we can analyse the correlation between frequent terms by using using findAssocs() function.


Comments
Post a Comment