Naives Bayes Classifier for bag of vectorized sentences

Question

Summary: How to train a Naive Bayes Classifier on a bag of vectorized sentences?

Example here :

X_train[0] = [[0, 1, 0, 0], [1, 0, 0, 1], [0, 0, 0, 1]] y_train[0] = 1 X_train[1] = [[0, 0, 0, 0], [0, 1, 0, 1], [1, 0, 0, 1], [0, 1, 0, 1]] y_train[1] = 0

.

1) Context of the project: perform sentiment analysis on a batch of tweets to perform market prediction

I am working on sentiment analysis for stock market classification. As I am new to these techniques, I tried to replicate one from this article: http://cs229.stanford.edu/proj2015/029_report.pdf

But I am facing a big issue with it. Let me explain the main steps of the article I realized :

I collected a huge amount of tweets over 4 months (7 million)
I cleaned them
I grouped them into period intervals of 1 hour
I created a target that tells if the price of the Bitcoin has gone down or up after one hour (0 = down ; 1 = up)

What I need to do next is to train my Bernoulli Naive Bayes Model with this. To do this, the article mentions vectorizing the tweets this way.

enter image description here

[.....] enter image description here

What I did with the CountVectorizer class from sklearn.

2 ) The issue: the dimension of the inputs doesn't match Naive Bayes standards

But then I encounter an issue when I try to fit the Bernoulli Naive Bayes model, following the article method :

enter image description here

So, one observation is shaped this way :

input shape (one observation): (nb_tweets_on_this_1hour_interval, vocabulary_size= 10 000)

one_observation_input = [ [0, 1, 0 ....., 0, 0], #Tweet 1 vectorized ...., [1, 0, ....., 1, 0] #Tweet N vectorized ]#All of the values are 0 or 1

output shape (one observation): (1,)
one_observation_output = [0] #Can only be 0 or 1

When I tried to fit my Sklearn Bernoulli Naive Bayes model with this type of value, I am getting this error

>>> ValueError: Found array with dim 3. Estimator expected <= 2.

Indeed, the model expects binary input shaped this way :

input : (nb_features)
ex: [0, 0, 1, 0, ...., 1, 0, 1]

while I am giving it vectors of binary values!

3 ) What I have tried

So far, I tried several things to resolve this :

Associating the label for every tweet, but the results are not good since the tweets are really noisy
Flatten the inputs so the shape for one input is (nb_tweets_on_this_1hour_interval*vocabulary_size, ). But the model can not train as the number of tweets every hour is not constant.

4 ) Conclusion

I don't know if the error comes from my misunderstanding of the article or of the Nayes Bayes models.

How to train efficiently a naive Bayes classifier on a bag of tweets?

Here is my training code :

bnb = BernoulliNB() uniqueY = [0, 1]#I give the algorithm the 2 classes I want to classify the tweets with. This is needed for the partial fit for _index, row in train_df.iterrows():#I have to use a for loop to partialy fit my Bernouilli Naive Bayes classifier to prevent from out of memory issues #row["Tweet"] contains all the (cleaned) tweets over 1hour interval this way : ["I like Bitcoin", "Nice portfolio", ...., "I am the last tweet of the interval"] X_train = vectorizer.transform(row["Tweet"]).toarray() #X_train contrains all of the row["Tweet"] tweets vectorizes with a bag of words algorithm which return this kind of data : [[0, 1, 0 ....., 0, 0], ....,[1, 0, ....., 1, 0]] y_train = row["target"] #Target is 0 if the market is going down after the tweets and 1 if it is going up bnb.partial_fit([X_train], [y_train], uniqueY)

I use partial fit to avoid out of memory issues

Rahul · Answer 1 · Apr 11, 2022

The error is basically the [X_train] which is increasing the number of dimensions in code. In your code

bnb.partial_fit([X_train], [y_train], uniqueY) #X_train in brackets are causing your error

The Bernoulli NB is expecting an array with TWO dimensions only and putting X_train in the square is making it three dimensions instead.

If you change your code to this then it should work:-

bnb.partial_fit(X_train, y_train, uniqueY)

answered Apr 11, 2022 by Rahul
• 9,690 points

Naives Bayes Classifier for bag of vectorized sentences

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Blockchain

How can I use blockhain for storing a proof of a document such as an image?

What could be the best term to use for the collection of contracts in a .sol file?

Are there any plans for composer, to make usage of the recently released Side DB feature?

How to decrypt result of query when using the Hyperledger Client SDK for Node.js

Classification in Naive Bayes algorithm

Reliability of Bayes Theorem

how do i change string to a list?

how can i randomly select items from a list?

What is the physical location of Blockchain?

Where does Hyperledger fabric store the public key and private key of the user?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES