How does one go about feature extraction for training labelled tweets for sentiment analysis?

General Tech Learning Aids/Tools 2 years ago

0 2 0 0 0 tuteeHUB earn credit +10 pts

5 Star Rating 1 Rating

Posted on 16 Aug 2022, this text provides information on Learning Aids/Tools related to General Tech. Please note that while accuracy is prioritized, the data presented might not be entirely correct or up-to-date. This information is offered for general knowledge and informational purposes only, and should not be considered as a substitute for professional advice.

Take Quiz To Earn Credits!

Turn Your Knowledge into Earnings.

tuteehub_quiz

Answers (2)

Post Answer
profilepic.png
manpreet Tuteehub forum best answer Best Answer 2 years ago

 

 

I am currently working on a sentiment analysis project where the end goal is to try and predict who wins the Nigerian 2019 presidential election. My goal is to create a model that analyses tweets and detects the tone of a tweet (whether positive, negative or neutral). This model can in turn be used to predict who wins. Here's what I have done so far.

  1. I have a program running on a server streaming tweets daily
  2. I have labelled about 5000 tweets manually into different categories (positive, negative and neutral tweets).
  3. I previously used text blob to determine sentiment on each tweet. But it was not accurate enough.
  4. I was then advised to train a model on the way Nigerians speak on twitter, which can sometimes be a little different from normal English

From my research, I found out I have to use feature extraction before I can train the model with my labelled data. My problem now is that I don't understand how to go about these feature extraction. I have read materials about what people use (bag of words, bigrams, unigrams, part of speech tags etc). I don't understand the explanation I got so far, how it all comes together or which to use. Please could anyone explain these in layman's terms or point me to resources that explains this in layman's terms. Thank you

profilepic.png
manpreet 2 years ago

Feature Extraction is an important step when dealing with natural languages because the text you've collected isn't in a form understandable by a computer. If you have a tweet that goes something like

I do not like the views of @Candidate1 on #Topic1. Too conservative!! I can't stand it!

then we can't just feed this into a learning algorithm. We need to convert it into a proper format, so we perform pre-processing on our data.

To start, we might want to try to tokenize the tweet. To tokenize is something like segmentation. If you're familiar with Python, we can use ready made libraries (like NLTK) to aid us. Depending on how your tokenizer is made, you could transform the previous tweet into something like

['I', 'do', 'not', 'like', 'the', 'views', 'of', '@', 'Candidate1', 'on', '#', 'Topic1', '.', 'Too', 'conservative', '!', '!', 'I', 'can', "'", 't', 'stand', 'it', '!']

I manually segmented the tweet with the rule that words separated by a space should be tokens and punctuation should be tokenized separately. The way you build a tokenizer (or adjust the settings of a ready made one) will determine the output from a tweet. Notice how '@' and 'Candidate1' are separated? A tokenizer for regular text might not be able to identify that this is a social media entity -- a user mention. If you can adjust your tokenizer to account for social media identities and contractions (like "can't"), you could produce a list of tokens as such

['I', 'do', 'not', 'like', 'the', 'views', 'of', '@Candidate1', 'on', '#Topic1', '.', 'Too', 'conservative', '!', '!', 'I', "can't", 'stand', 'it', '!']

Now, you mentioned bigrams and unigrams. An n-grams (e.g. 1-gram == unigram) is just a sequence of tokens. So what we produced awhile ago was just a unigram list of tokens. If you want a bigram, then you'd want to take 2 tokens at a time. An example output of a bigram list of tokens would be

['I do', 'do not', 'not like', 'like the', 'the views', 'views of', 'of @Candaite1', '@Candidate1 on', 'on #Topic1', '#Topic1 .', '. Too', 'Too conservative', 'conservative !', '! !', '! I', "I can't", "can't stand", 'stand it', 'it !',]

Notice how words repeat? Imagine what a 3-gram or 5-gram would look like.

Before anything, why would we use bigrams over unigrams or those of higher n values? Well, the higher the n, the more about of order you're able to capture. Sometimes order is an important factor in learning. Playing around with how you'll represent your data might show you important features.

Now that we have our text tokenized, we can start extracting features! We can turning our example test, and other text samples, into a Bag-Of-Words (BOW) model. Think of a BOW as a table with column headers as the words/terms and rows as your text samples/tweets. A cell could then contain the number of words/terms for a given sample of text. You could start with counting each term in a sample, so based on the tweet, you'd come up with something like

tweet1: {
   'I': 2,
   'do': 1,
   'not': 1,
   'like': 1,
   'the': 1,
   ...
   '!': 3,
   ... 
}

I don't want to manually write it all, but I hope you get the picture here. See how 'I' is 2 because it was mentioned twice in the sample. '!' was mentioned trice, so its value was 2. You'll find that there will be a lot of 1 values, specially in tweets, because there isn't room for much to be written.

You'd do this for each of your tweets and you'll come up with something like

       | 'I' | '!' | '#Candiate1' | ...
tweet1 |  2  |  3  |       1
tweet2 |  0  |  0  |       0
...    
tweetn 
                                                    
                                                    
0 views   0 shares

No matter what stage you're at in your education or career, TuteeHub will help you reach the next level that you're aiming for. Simply,Choose a subject/topic and get started in self-paced practice sessions to improve your knowledge and scores.