Using Scikit-Learn and NLTK for Sentimental Analysis
Sentimental analysis is a way of categorizing text into subgroup based on the opinion or sentiments expressed in the text. For example, we would like to categorize review or comments of people about a movie to determine how many like the movie and how many don't.
In a supervised sentimental analysis, we have some training data which is already categorized or sub-grouped into different categories, for example, into 'positive' or 'negative' sentiments. We used these training data to train our model to learn what makes a text to be part of a specific group. By text I mean a sentence or a paragraph. Using this labeled sentences, we are going to build a model.
So, let us say we have following training text:
training_positive = list()
training_positive[0] = "bromwell high is a nice cartoon comedy perfect for family"
training_positive[1] = "homelessness or houselessness as george carlin stated as nice movie"
training_negative = list()
training_negative[0] = "story of a man who has unnatural feelings with plain"
training_negative[1] = "airport starts as a brand new luxury plane with ludicrous"
We need to convert the above text into a vector representation. To do that, we need to determine what feature words represent positive training text and what feature words negative training text. We can use a heuristic to determine the feature words. For example, one heuristic could be that pick the top 1% of the most frequent words in positive and negative training text together after removing stop words. Let us say those top most frequent words are:
feature_words = ['nice', 'perfect','plain', ludicrous', 'is', 'the']
NLTK python library provides a list of stop words in english. We use this list of stop words to trim our feature words list.
stopwords = nltk.corpus.stopwords.words('english')
After removing stop words, the update feature worlds list looks like this:
feature_words = ['nice', 'perfect','plain', ludicrous']
These features words form our state space. Now, each positive or negative training text will be converted into a vector based on the presence feature words. For example, the first sentence in positive training text has 'nice' and 'perfect' words whereas the second sentence in the positive training text has only 'nice' from the feature words list. So,
stopwords = nltk.corpus.stopwords.words('english')
After removing stop words, the update feature worlds list looks like this:
feature_words = ['nice', 'perfect','plain', ludicrous']
These features words form our state space. Now, each positive or negative training text will be converted into a vector based on the presence feature words. For example, the first sentence in positive training text has 'nice' and 'perfect' words whereas the second sentence in the positive training text has only 'nice' from the feature words list. So,
pos_feature_vec[0] = [1,1,0,0]
pos_feature_vec[1] = [1,0,0,0]
Similarly, we convert the negative training text into feature vectors.
neg_feature_vec[0] = [0,0,1,0] # only 'plain' is found in first negative sentence
neg_feature_vec[1] = [0,0,0,1] # only 'ludicrous' is found in first negative sentence
vec_list = pos_feature_vec+neg_feature_vec
Now, once all positive and negative training text is converted into feature vectors, we are ready to train our model. We can use naive Bayes Bernoulli model or logistic regression. To build a model, we need to label each positive and negative sentence. Let us say, we label each positive training text as 'positive' and negative training text as 'negative'.
label = ['positive', 'positive','negative', 'negative'] #
bernoulli_model = sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=None)
bernoulli_model.fit(vec_list, label)
To predict using this model, create a feature vector of the new text and pass that as the argument to model. The result will be 'positive' or 'negative' text.
result = model.predict(test_pos_vec)
A similar model can be built using logistic regression using sklearn.linear_model.LogisticRegression()
Using Neural Network for sentimenatal analysis
The gensim python package provides the neural network implemtation of converting labled senteneces into a feture vector. First, we convert each sentences into a 'LabeledSentence' object. So, let us convert our positive and negative training sentences into 'LabeledSentence' object.label_pos_0 = LabeledSentence(training_positive[0], ["TRAIN_POS_0"])
label_pos_1 = LabeledSentence(training_positive[1], ["TRAIN_POS_1"])
Similarly, we convert negative sentiment text into labled objects.
label_neg_0 = LabeledSentence(training_negative[0], ["TRAIN_NEG_0"])
label_neg_1 = LabeledSentence(training_negative[1], ["TRAIN_NEG_1"])
sentences = [label_pos_0,label_pos_1, label_neg_0, label_neg_1]
We initialize the Doc2Vec model. I initialize with feature vector 'size' of 100, context window of 10, which is how far to look on the left and right of the word in the sentence to determine the context of a word. 'min_count' is the minimum frequency below which word is ignored.
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=4)
model.build_vocab(sentences)
Next comes training the model. To allow for faster convergence of stochastic gradient descent in neural network, we will will do multiple pass (epochs) of our training data, shuffled on each iteratioin or epochs.
I have picked 5 as number of epochs to train the model. We could pick higher or lower number depending upon how much time it takes to train the model.
for i in range(5):
random.shuffle(sentences)
model.train(sentences)
Now, we can get the vector for each labeled sentences using doc2vecs property of the model. So, to get the vector for 'TRAIN_POS_0' do,
train_pos_vec0 = model.docvecs["TRAIN_POS_0"]
Similary, we can find other vectors for each labeled sentences. The size of the each vector is 100 as we specified while building the model. The value of each dimension is real number. The vector for train_pos_vec0 might look like this:
print(train_pos_vec0)
[-1.92423627e-01, -7.73756266e-01, 4.07620847e-01,
-5.17467201e-01, 6.34539545e-01, -2.01772735e-01, ...]
vec_list = [train_pos_vec0, train_pos_vec0, train_pos_vec0, train_pos_vec0]
Build model using training data.
naive_bayes_gauss_model = sklearn.naive_bayes.BernoulliNB(alpha=1.0, binarize=None)
naive_bayes_gauss_model.fit(vec_list, label) # label was defined above
Prediction can be done using the 'infer_vector' funciton of the model by passing the text string.
predict_vector = model.infer_vector("This movie was very good but it is not good to watch with family")
And then using the model to predict the label or classification group this vector belongs to:
naive_bayes_gauss_model.predic([predict_vector]) # convert to list before passing predict_vector
The output should be similar to:
array(['negative'],
dtype='|S3')
Comments
Post a Comment