Text Analytics: Creating Word Cloud

In the word cloud, words that appear most frequently in documents are plotted bolder and bigger than less frequently occurring words. The size and maybe the color of the word is determined by the frequency of the word in the documents. R package 'wordcloud' provides functions that generate such word cloud.

A collection of documents is converted into a vector of characters.

sentence <- c("sentence1", "sentence2", "sentenceN")

We will use gsub() function to remove whitespace, number, and control characters from these sentences.

sentence <- gsub('[[:punct:]]', ' ', sentence)

sentence <- gsub('[[:cntrl:]]', ' ', data)

sentence <- gsub('\\d+', ' ', sentence)

stopwords() function from text mining package 'tm' provides a list of stop words in English. We will use this list to remove stopwords from the sentences later.

stopwords <- stopwords("SMART")

Let us convert our sentence into a corpus.

corpus <- Corpus(VectorSource(sentence))

tm package provides several predefined transformation that can be applied to text using tm_map() function. getTransformtions() function will list those transformation function names. tm package provides content_transformer() function to create our own text processing function. For example, to convert the text to lower case, we can use:

corpus <- tm_map(corpus, content_transformer(tolower))

and to remove whitespace and stopwords, we can use the predefined function removePunctuation and removeWords

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeWords, stopwords)

Next, we create a term-document matrix using function TermDocumentMatrix(). This function needs a list of options, which we set by creating a list as shown below. Here, minWordLength will discard all those words whose lenght this less than 4 characters.

control <- list(weighting=weightTf, minWordLength=4, removeNumbers=TRUE)

tdm <- TermDocumentMatrix(corpus, control=control)

To reduce the size of tdm object for generating the word cloud, we can remove all those words that are found in less than 5% of the documents or sentences.

tdm <- removeSparseTerms(tdm, 0.95)

If you display the tdm object as a matrix, this will print document id as column name and word as row name with the occurrence of the word in the particular document as the value.

Now, we want to sum the count of each word in each document to get the frequency of that words.

word_count <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)

word_name_frequency <- data.frame(word=names(word_count), freq= word_count)

Create a palette using the brewer function from RColorBrewer package. Several color options are available to create the palette. For example, we are using the OrRd palette to generate 7 different colors. Function display.brewer.pal(7,"OrRd") will show all the colors of the palette used in drawing word cloud. Remove any color from the palette that is same as the background color.

palette <- brewer.pal(7, "OrRd")

palette <- palette[-(1)]

Open a png file and plot the word cloud. Function wordcloud takes multiple arguments, but I am using the default option for most of them. Wordcloud provides arguments to control the number of words, height or width of the words, the minimum frequency of the words required to plot.

png(filename='wordcloud.png')

wordcloud(word_name_frequency $word,

word_name_frequency $freq,

random.order=FALSE,

colors=palette)

dev.off()

Machine Learning

Search This Blog

Text Analytics: Creating Word Cloud

Comments

Post a Comment

Popular posts from this blog

Decision Tree

Recommender System using Collaborative filtering

Sentimental Analysis Using Scikit-Learn and Neural Network