Week 1: Sentiment Analysis with Logistic Regression

Full Lecture W1

Untitled

Representing Text as a Vector

Define the vocabulary of every distinct word

Untitled

Positive and negative counts of frequencies

Untitled

Feature extraction with frequencies

SUM only the unique frequencies of the word

Given a corpus with positive and negative tweets as follows:

Untitled

You have to encode each tweet as a vector. Previously, this vector was of dimension VV. Now, as you will see in the upcoming videos, you will represent it with a vector of dimension 33. To do so, you have to create a dictionary to map the word, and the class it appeared in (positive or negative) to the number of times that word appeared in its corresponding class.

Untitled

In the past two videos, we call this dictionary freqs. In the table above, you can see how words like happy and sad tend to take clear sides, while other words like "I, am" tend to be more neutral. Given this dictionary and the tweet, "I am sad, I am not learning NLP", you can create a vector corresponding to the feature as follows:

Untitled

To encode the negative feature, you can do the same thing.