Sentiment Analysis of My Personal Diary

Last Friday, my roommate Greg had the great idea of doing sentiment analysis (the process of determining/categorizing emotions, attitudes, and opinions in text) on my diary. I've kept a personal diary since October 12, 2012, and have written almost every day since- as of the end of September 2017, my diary's word count is a little over 500,000. That makes for a lot of interesting personal data, so I spent most of my weekend and some of my week building a python program that uses NLTK and TextBlob to do some analysis on my diary. This post shares some of the code that I wrote and some of the more interesting insights that I got.

I spent a while digging around for the right library, and ended up using TextBlob (which is built on top of NLTK) because I didn't want to spend the time to train a classifier. TextBlob's classifier is trained on a big dataset of movie reviews from nltk.corpus, which was good enough for me, although I may revisit this in the future.

What is the average sentiment of my diary?

The first thing I wanted to do was to just read in the diary and get the average sentiment. The sentiment property in TextBlob returns a tuple of the form (polarity, subjectivity).  Polarity is a float within the range [-1.0, 1.0], where -1.0 is very negative and 1.0 is very positive. Subjectivity is a float within the range [0.0, 1.0], where 0.0 is very objective and 1.0 is very subjective. 

My diary is unfortunately written in an encrypted word doc, so I had to copy and paste it :-( to another file. Once I had the file opened though, making a TextBlob object out of a string is ridiculously easy.

from textblob import TextBlob

if __name__ == '__main__':
  with open('diary', 'r') as d:
          raw_diary = d.read().decode('utf-8')
      diary = TextBlob(raw_diary)

and once I have the TextBlob object diary, getting the polarity is a one liner:

polarity = diary.sentiment.polarity

The sentiment of my diary overall, from 10/12/2012 to 09/30/17 is:

Polarity Subjectivity
0.121205516 0.554559743

which is expected but pretty disappointing. Apparently the last 6 years (according to my diary and TextBlob) have been pretty average in general ¯\_(ツ)_/¯.

What are the most frequent words in my diary?

The next question I was interested in answering was what the most frequent words in my diary were. This is also pretty easy to do with TextBlob, since a TextBlob object has the same methods as a python string. 

def get_wf(diary):
    wf = {}
    length = len(diary.words)
    for word in diary.words:
        wf[word] = wf[word]+1 if word in wf else 1.0
    for word in wf:
        wf[word] = (wf[word], round(wf[word]/length*100,5))
    return wf

def sort_wf(diary, number, time):
    word_frequency = get_wf(diary)

    sorted_words = sorted(word_frequency.items(), key=lambda x: x[1][1], reverse=True)
    words, scores = zip(*sorted_words[:number])
    for i, (word, score) in enumerate(sorted_words[:number]):
        tf = score[0]
        percentage = str(score[1]) + '%'

get_wf takes in a diary object and returns a map of word to (frequency, percentage). sort_wf takes that map, sorts it by percentage, and gets the top number of words (also writing it to a .csv). The first time I ran the code all the results were words like "the", "and", "yes," etc., so I removed filler words like that using NLTK's set of stopwords:

def clean_words(raw_diary):
    diary = TextBlob(raw_diary)
    stop = set(stopwords.words('english'))
    cleaned_diary = []
    for word in diary.words:
        if word.lower() not in stop:
            cleaned_diary.append(re.sub(r'\W+', '', word.lower()).title())
    return TextBlob(' '.join(cleaned_diary))

Using Excel, I graphed the results:

 WF Overall, Top 50

Some of the more interesting results include:

  • League at #52
  • Mom at #62
  • Gym at #73
  • and finally, shit and fucking at #56 and #88 respectively

Unfortunately, most of the data here are still filler words, so I reran the code, this time using TextBlob's noun_phrases method to just get the noun phrases in a TextBlob. 

def get_wf(diary):
    wf = {}
    length = len(diary.words)
    for word in diary.noun_phrases:
    ...

(this took over 3 hours to run, so I only ran it once)

WF Overall Top 50 Nouns.png

Some of these pictures are

senor chang.jpg

so again I'll just share some parts of the data I found interesting:

  • long day at #21 :-(
  • league w[ith] at #42, validating core gamer status
  • went gym at #66
  • good day at #73 :-)

What is the average sentiment year by year? Month by month?

My next idea was to split the data up to graph how polarity changes over time. I split the data using some regex based on how I structured my diary. All entries for each month is headed by [Month Year], e.g. March 2014, so for splitting by year, I could do a find on January [year] and split based on that index.

def split_by_year(raw_diary):
    diary_by_year = []
    for i in range(1,num_years):
        year = base_year+i-1
        idx = raw_diary.find('January ' + str(year))
        diary_by_year.append( (str(year+1), clean_words(raw_diary[:idx])) )
        raw_diary = raw_diary[idx:]
    diary_by_year.append((str(base_year+num_years-1), clean_words(raw_diary)))
    print('Number of years analyzed: ' + str(len(diary_by_year)))
    assert len(diary_by_year) == num_years
    return diary_by_year

For month, I did a regex search on [month] [year] instead:

def split_by_month(raw_diary):
    diary_by_month = []
    num_months = (num_years-1)*12 + (end_month-start_month)
    for i in range(1,num_months+1):
        month_year = get_month_year(i-1)
        idx = raw_diary.find(get_month_year(i))
        diary_by_month.append((month_year, clean_words(raw_diary[:idx])))
        raw_diary = raw_diary[idx:]
    diary_by_month.append( (get_month_year(num_months), clean_words(raw_diary)) )
    print('Number of months analyzed: ' + str(len(diary_by_month)))
    return diary_by_month

Once the data was split by month and by year, it was pretty straightforward to get the sentiment for each month/year.

Polarity Year by Year.png

In general polarity seems to be trending upwards from 2012 to 2017, with the exception of 2016 to 2017. The marginal increase from 2013 to 2014 makes sense, since that was my first year of college, and the drastic drop from 2016 to 2017 can probably be accounted for by how much I enjoyed 2016 and how turbulent 2017 was, what with my PVNS and the surgery and some other major life changes.

Some parts of this data that I found interesting:

  • I started the diary 10/12/2012, and October 2012 had an average polarity of 0.03645, which makes October one of the worst months since I've started writing my diary. This aligns with the fact that I started writing the diary as an outlet for my teenage sadboy-ness.
  • One of the sharpest increase and subsequent decrease in the graph is October 2012 to December 2012, which correlates with: girl trouble -> getting over girl trouble -> more girl trouble.
  • There is a pretty deep plunge from July 2013 to August 2013, which is about when I moved to NY and started at Columbia. This makes sense since moving was stressful and starting at Columbia was tough.
  • There is another dive from November 2013 to December 2013, which I suspect is probably also from girl trouble.
  • The graph looks like a checkmark from December 2014 to about August 2015, which captures nicely my most painful semester at Columbia, when I struggled with depression and had trouble leaving my bed/room. It is very gratifying to see in data how polarity started going up since March 2015, and the months after are higher in polarity than the months before. 
  • The graph roughly peaks from June 2016 to August 2016, which is when I was doing my internship at Riot. I really liked my internship and I had a lot of fun; it's cool to see that reflected in the data.
  • The other interesting drop (I think one of the most drastic ones) is April 2017, which is when I got my surgery and spent a week in the hospital then two weeks at home in deep pain.

What are the frequencies of specific words month to month, and how do they change?

The next thing I wanted to see was frequency of word month to month. The results are in the appendix, because I didn't personally find them that insightful and I didn't do any analysis beyond eyeballing to see if there was any correlation. But here's the short code anyways:

def get_wf_for_word(word, wfs):
    wf_word = []
    for wf in wfs:
        idx = wf[0]
        text = wf[1]
        tf, percentage = text.get(word, (0, 0.0))
        wf_word.append( (idx, word, tf, percentage) )
    return wf_word

What is the average sentiment of every month of the year?

I had code to get the polarity month by month, so it was relatively easy to extend it to also get the polarity of each month of the year, averaged across the various months.

def analyze_sentiment_month(diary_by_month):
    monthly_sentiments = {}
    for month in months.values():
        monthly_sentiments[month] = (0.0,0)
    for diary in diary_by_month:
        month_year = diary[0]
        text = diary[1]
        sentiment = text.sentiment
        polarity = sentiment.polarity
        for month in months.values():
            if month in month_year:
                prev_sentiment = monthly_sentiments[month][0]
                prev_num_months = monthly_sentiments[month][1]
                monthly_sentiments[month] = (prev_sentiment+polarity, prev_num_months+1)
                break
    monthly_sentiments_aggregated = {}
    for k in monthly_sentiments.keys():
        sentiment = monthly_sentiments[k]
        polarity = sentiment[0]/sentiment[1]
        monthly_sentiments_aggregated[k] = polarity
    return monthly_sentiments_aggregated
Polarity by Month Aggregate.png

I liked this a lot because it supports my previously ungrounded hunch that my worst months of the year were February, April, October, and December, and my best months were the summer months and November. If I had to guess, I think it's probably because seasons are changing April and October, it's cold February and December, and that's also roughly midterm season. It would be cool to see if that changes now that I'm in LA and I'm working. The summer months' higher polarity is probably because I've liked every internship I do, and the weather tends to be nicer. November has also generally been an easy month in between midterms & finals for me at Columbia.

Which words become more frequent or less frequent month to month?

The last thing I tried to do was to find words that became used more often or less often month to month. I calculated this by getting the word frequency maps of every month, and for each word in each month's word frequency map, I got changes in frequency relative to last month's word frequency map, and returned the 10 greatest increases and the 10 greatest decreases.

def get_new_words(wfs):
    new_words = []
    for i in range(1, len(wfs)):
        idx = wfs[i][0]
        prev_wf = wfs[i-1][1]
        current_wf = wfs[i][1]
        word_diff = {}
        for word in current_wf.keys():
            (diff_freq, diff_percentage) = np.subtract(current_wf[word],prev_wf.get(word, (0, 0.0)))
            word_diff[word] = (diff_freq, diff_percentage)
        new_words.append( (idx, sorted(word_diff.items(), key=lambda x: x[1][1], reverse=True)) )
    for nw in new_words:
        print nw[0]
        contents += nw[0] + '\n'
        for (word, score) in nw[1][:10]:
            freq, percentage = score
            percentage = str(percentage) + '%'
        for (word, score) in nw[1][-10:]:
            freq, percentage = score
            percentage = str(percentage) + '%'
    return new_words

Some of that data is pretty personal and there's a lot of it (20 words each for about 60 months total) so I picked some of the more interesting stuff to highlight:

  • Math in January 2013 when I started getting into math
  • Lift in July 2013 when I started working out more seriously
  • The rise and then fall in frequency of my prom date from May 2013 to June 2013
  • Frank, Yoon, Kat in August 2013 when I started making new friends at Columbia
  • Putnam in December 2013 when I was really into the Putnam exam
  • Tess in February 2014 when we became friends
  • Yao, Specs in July 2015 when I was working at OTC Markets as an intern and designing a product
  • Pebbles in November 2016 when I was catsitting the best boy ever
  • Fallout in December 2016, and then dropping in January 2016 when I was playing the game a lot
  • Read in February 2017 (the highest increase for that month) when I started reading again
  • Bus decreased in August 2017 when I moved to LA (rip public transportation)
  • Isaac in September 2017, the coworker I paired with the most in the first month of work

Conclusion:

This was a good project and it was cool to have a way to validate my memories with qualitative data of my past. It's pretty easy to add new features and I would love any ideas or suggestions for additional analysis I can do.

I'd like to close the post with my favorite insight out of all of this data. In December 2013, when I was having some girl trouble, the polarity was -0.021784, the only negative polarity across all the months in my diary. In April 2017, when I had my surgery, was in the hospital for a week, stuck at home for 2 weeks, was high out of my mind from morphine and oxycodone, and just generally not having a good time, the polarity was 0.037985. What this means is that at least from the perspective of my diary, comparing teenage angst against a ridiculous amount of physical suffering, I was more sad about girls than literal tumors in my shoulder. 

Appendix:

The full code is available on my github here, but fair warning I wrote it all in a few days and some of it is probably a little gnarly.

If you'd like to see the data in non graphical form, let me know and I would be happy to send you some of the tables. They're really long and Squarespace doesn't support tables for some reason (???) so I didn't include them in this post.

Some of the word frequencies for specific words that I collected:

WF Tired vs Sad.png