Naive Bayes classifier bases decision only on a-priori probabilities

0 votes

I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071.

I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted.

trainingset dataframe:


   **text | sentiment**

   *this stock is a good buy* | Buy

   *markets crash in tokyo* | Sell

   *everybody excited about new products* | Hold


Now I want to train the model using the tweet text trainingset[,2] and the sentiment category trainingset[,4].

classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)

Looking into the elements of classifier with


I find that the conditional probabilities are calculated..There are different probabilities for every tweet concerning Buy,Hold and Sell.So far so good.

However when I predict the training set with:

predict(classifier, trainingset[,2], type="raw")

I get a classification which is based only on the a-priori probabilities, which means every tweet is classified as Hold (because "Hold" had the largest share among the sentiment). So every tweet has the same probabilities for Buy, Hold, and Sell:


      **Id | Buy | Hold | Sell**

      1  |0.25 | 0.5  | 0.25

      2  |0.25 | 0.5  | 0.25

      3  |0.25 | 0.5  | 0.25

     ..  |..... | ....  | ...

      N  |0.25 | 0.5  | 0.25


Any ideas what I'm doing wrong? Appreciate your help!

Mar 23, 2022 in Machine Learning by Dev
• 6,000 points

1 answer to this question.

0 votes

You seem to have trained the model with complete phrases as inputs, whereas you appear to wish to utilize words as input features.

This is how it is used:

## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)

## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, ...)

The Arguments
 x: A numeric matrix, or a data frame of categorical and/or numeric variables.

 y: Class vector.

(Taken from r documentation)

Try to train the Naive Bayes like this

x <- c("johny likes almonds", "maria likes dogs and johny")
y <- as.factor(c("good", "bad")) 
bayes<-naiveBayes( x,y )

the classifier recognizes these two sentences.

#Naive Bayes Classifier for Discrete Predictors

naiveBayes.default(x = x,y = y)

A-priori probabilities:
 bad good 
 0.5  0.5 

Conditional probabilities:
y      johny likes almonds maria likes dogs and johny
  bad                0                         1
  good               1                         0

In order to get a word level classifier run it with words as inputs

x <-             c("johny","likes","almonds","maria","likes","dogs","and","johny")
y <- as.factors( c("good","good", "good","bad",  "bad",  "bad", "bad","bad") )
bayes<-naiveBayes( x,y )

The Output

Naive Bayes Classifier for Discrete Predictors

naiveBayes.default(x = x,y = y)

A-priori probabilities:
 bad good 
 0.625 0.375 

Conditional probabilities:
y            and    almonds     dogs     johny     likes     maria
  bad  0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
  good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000

R is not well suited for processing NLP data in general; python (or at the very least Java) would be a far better choice.

The strsplit function can be used to split a sentence into words.

unlist(strsplit("johny likes almonds"," "))
[1] "johny"  "likes" "almonds" a
answered Mar 25, 2022 by Nandini
• 5,480 points

