Sentiment Analysis of My Personal Diary

Last Friday, my roommate Greg had the great idea of doing sentiment analysis (the process of determining/categorizing emotions, attitudes, and opinions in text) on my diary. I've kept a personal diary since October 12, 2012, and have written almost every day since- as of the end of September 2017, my diary's word count is a little over 500,000. That makes for a lot of interesting personal data, so I spent most of my weekend and some of my week building a python program that uses NLTK and TextBlob to do some analysis on my diary. This post shares some of the code that I wrote and some of the more interesting insights that I got.

I spent a while digging around for the right library, and ended up using TextBlob (which is built on top of NLTK) because I didn't want to spend the time to train a classifier. TextBlob's classifier is trained on a big dataset of movie reviews from nltk.corpus, which was good enough for me, although I may revisit this in the future.

What is the average sentiment of my diary?

The first thing I wanted to do was to just read in the diary and get the average sentiment. The sentiment property in TextBlob returns a tuple of the form (polarity, subjectivity).  Polarity is a float within the range [-1.0, 1.0], where -1.0 is very negative and 1.0 is very positive. Subjectivity is a float within the range [0.0, 1.0], where 0.0 is very objective and 1.0 is very subjective. 

My diary is unfortunately written in an encrypted word doc, so I had to copy and paste it :-( to another file. Once I had the file opened though, making a TextBlob object out of a string is ridiculously easy.

from textblob import TextBlob

if __name__ == '__main__':
  with open('diary', 'r') as d:
          raw_diary ='utf-8')
      diary = TextBlob(raw_diary)

and once I have the TextBlob object diary, getting the polarity is a one liner:

polarity = diary.sentiment.polarity

The sentiment of my diary overall, from 10/12/2012 to 09/30/17 is:

Polarity Subjectivity
0.121205516 0.554559743

which is expected but pretty disappointing. Apparently the last 6 years (according to my diary and TextBlob) have been pretty average in general ¯\_(ツ)_/¯.

What are the most frequent words in my diary?

The next question I was interested in answering was what the most frequent words in my diary were. This is also pretty easy to do with TextBlob, since a TextBlob object has the same methods as a python string. 

def get_wf(diary):
    wf = {}
    length = len(diary.words)
    for word in diary.words:
        wf[word] = wf[word]+1 if word in wf else 1.0
    for word in wf:
        wf[word] = (wf[word], round(wf[word]/length*100,5))
    return wf

def sort_wf(diary, number, time):
    word_frequency = get_wf(diary)

    sorted_words = sorted(word_frequency.items(), key=lambda x: x[1][1], reverse=True)
    words, scores = zip(*sorted_words[:number])
    for i, (word, score) in enumerate(sorted_words[:number]):
        tf = score[0]
        percentage = str(score[1]) + '%'

get_wf takes in a diary object and returns a map of word to (frequency, percentage). sort_wf takes that map, sorts it by percentage, and gets the top number of words (also writing it to a .csv). The first time I ran the code all the results were words like "the", "and", "yes," etc., so I removed filler words like that using NLTK's set of stopwords:

def clean_words(raw_diary):
    diary = TextBlob(raw_diary)
    stop = set(stopwords.words('english'))
    cleaned_diary = []
    for word in diary.words:
        if word.lower() not in stop:
            cleaned_diary.append(re.sub(r'\W+', '', word.lower()).title())
    return TextBlob(' '.join(cleaned_diary))

Using Excel, I graphed the results:

WF Overall, Top 50

Some of the more interesting results include:

  • League at #52
  • Mom at #62
  • Gym at #73
  • and finally, shit and fucking at #56 and #88 respectively

Unfortunately, most of the data here are still filler words, so I reran the code, this time using TextBlob's noun_phrases method to just get the noun phrases in a TextBlob. 

def get_wf(diary):
    wf = {}
    length = len(diary.words)
    for word in diary.noun_phrases:

(this took over 3 hours to run, so I only ran it once)

WF Overall Top 50 Nouns.png

Some of these pictures are

senor chang.jpg

so again I'll just share some parts of the data I found interesting:

  • long day at #21 :-(
  • league w[ith] at #42, validating core gamer status
  • went gym at #66
  • good day at #73 :-)

What is the average sentiment year by year? Month by month?

My next idea was to split the data up to graph how polarity changes over time. I split the data using some regex based on how I structured my diary. All entries for each month is headed by [Month Year], e.g. March 2014, so for splitting by year, I could do a find on January [year] and split based on that index.

def split_by_year(raw_diary):
    diary_by_year = []
    for i in range(1,num_years):
        year = base_year+i-1
        idx = raw_diary.find('January ' + str(year))
        diary_by_year.append( (str(year+1), clean_words(raw_diary[:idx])) )
        raw_diary = raw_diary[idx:]
    diary_by_year.append((str(base_year+num_years-1), clean_words(raw_diary)))
    print('Number of years analyzed: ' + str(len(diary_by_year)))
    assert len(diary_by_year) == num_years
    return diary_by_year

For month, I did a regex search on [month] [year] instead:

def split_by_month(raw_diary):
    diary_by_month = []
    num_months = (num_years-1)*12 + (end_month-start_month)
    for i in range(1,num_months+1):
        month_year = get_month_year(i-1)
        idx = raw_diary.find(get_month_year(i))
        diary_by_month.append((month_year, clean_words(raw_diary[:idx])))
        raw_diary = raw_diary[idx:]
    diary_by_month.append( (get_month_year(num_months), clean_words(raw_diary)) )
    print('Number of months analyzed: ' + str(len(diary_by_month)))
    return diary_by_month

Once the data was split by month and by year, it was pretty straightforward to get the sentiment for each month/year.

Polarity Year by Year.png

In general polarity seems to be trending upwards from 2012 to 2017, with the exception of 2016 to 2017. The marginal increase from 2013 to 2014 makes sense, since that was my first year of college, and the drastic drop from 2016 to 2017 can probably be accounted for by how much I enjoyed 2016 and how turbulent 2017 was, what with my PVNS and the surgery and some other major life changes.

Some parts of this data that I found interesting:

  • I started the diary 10/12/2012, and October 2012 had an average polarity of 0.03645, which makes October one of the worst months since I've started writing my diary. This aligns with the fact that I started writing the diary as an outlet for my teenage sadboy-ness.
  • One of the sharpest increase and subsequent decrease in the graph is October 2012 to December 2012, which correlates with: girl trouble -> getting over girl trouble -> more girl trouble.
  • There is a pretty deep plunge from July 2013 to August 2013, which is about when I moved to NY and started at Columbia. This makes sense since moving was stressful and starting at Columbia was tough.
  • There is another dive from November 2013 to December 2013, which I suspect is probably also from girl trouble.
  • The graph looks like a checkmark from December 2014 to about August 2015, which captures nicely my most painful semester at Columbia, when I struggled with depression and had trouble leaving my bed/room. It is very gratifying to see in data how polarity started going up since March 2015, and the months after are higher in polarity than the months before. 
  • The graph roughly peaks from June 2016 to August 2016, which is when I was doing my internship at Riot. I really liked my internship and I had a lot of fun; it's cool to see that reflected in the data.
  • The other interesting drop (I think one of the most drastic ones) is April 2017, which is when I got my surgery and spent a week in the hospital then two weeks at home in deep pain.

What are the frequencies of specific words month to month, and how do they change?

The next thing I wanted to see was frequency of word month to month. The results are in the appendix, because I didn't personally find them that insightful and I didn't do any analysis beyond eyeballing to see if there was any correlation. But here's the short code anyways:

def get_wf_for_word(word, wfs):
    wf_word = []
    for wf in wfs:
        idx = wf[0]
        text = wf[1]
        tf, percentage = text.get(word, (0, 0.0))
        wf_word.append( (idx, word, tf, percentage) )
    return wf_word

What is the average sentiment of every month of the year?

I had code to get the polarity month by month, so it was relatively easy to extend it to also get the polarity of each month of the year, averaged across the various months.

def analyze_sentiment_month(diary_by_month):
    monthly_sentiments = {}
    for month in months.values():
        monthly_sentiments[month] = (0.0,0)
    for diary in diary_by_month:
        month_year = diary[0]
        text = diary[1]
        sentiment = text.sentiment
        polarity = sentiment.polarity
        for month in months.values():
            if month in month_year:
                prev_sentiment = monthly_sentiments[month][0]
                prev_num_months = monthly_sentiments[month][1]
                monthly_sentiments[month] = (prev_sentiment+polarity, prev_num_months+1)
    monthly_sentiments_aggregated = {}
    for k in monthly_sentiments.keys():
        sentiment = monthly_sentiments[k]
        polarity = sentiment[0]/sentiment[1]
        monthly_sentiments_aggregated[k] = polarity
    return monthly_sentiments_aggregated
Polarity by Month Aggregate.png

I liked this a lot because it supports my previously ungrounded hunch that my worst months of the year were February, April, October, and December, and my best months were the summer months and November. If I had to guess, I think it's probably because seasons are changing April and October, it's cold February and December, and that's also roughly midterm season. It would be cool to see if that changes now that I'm in LA and I'm working. The summer months' higher polarity is probably because I've liked every internship I do, and the weather tends to be nicer. November has also generally been an easy month in between midterms & finals for me at Columbia.

Which words become more frequent or less frequent month to month?

The last thing I tried to do was to find words that became used more often or less often month to month. I calculated this by getting the word frequency maps of every month, and for each word in each month's word frequency map, I got changes in frequency relative to last month's word frequency map, and returned the 10 greatest increases and the 10 greatest decreases.

def get_new_words(wfs):
    new_words = []
    for i in range(1, len(wfs)):
        idx = wfs[i][0]
        prev_wf = wfs[i-1][1]
        current_wf = wfs[i][1]
        word_diff = {}
        for word in current_wf.keys():
            (diff_freq, diff_percentage) = np.subtract(current_wf[word],prev_wf.get(word, (0, 0.0)))
            word_diff[word] = (diff_freq, diff_percentage)
        new_words.append( (idx, sorted(word_diff.items(), key=lambda x: x[1][1], reverse=True)) )
    for nw in new_words:
        print nw[0]
        contents += nw[0] + '\n'
        for (word, score) in nw[1][:10]:
            freq, percentage = score
            percentage = str(percentage) + '%'
        for (word, score) in nw[1][-10:]:
            freq, percentage = score
            percentage = str(percentage) + '%'
    return new_words

Some of that data is pretty personal and there's a lot of it (20 words each for about 60 months total) so I picked some of the more interesting stuff to highlight:

  • Math in January 2013 when I started getting into math
  • Lift in July 2013 when I started working out more seriously
  • The rise and then fall in frequency of my prom date from May 2013 to June 2013
  • Frank, Yoon, Kat in August 2013 when I started making new friends at Columbia
  • Putnam in December 2013 when I was really into the Putnam exam
  • Tess in February 2014 when we became friends
  • Yao, Specs in July 2015 when I was working at OTC Markets as an intern and designing a product
  • Pebbles in November 2016 when I was catsitting the best boy ever
  • Fallout in December 2016, and then dropping in January 2016 when I was playing the game a lot
  • Read in February 2017 (the highest increase for that month) when I started reading again
  • Bus decreased in August 2017 when I moved to LA (rip public transportation)
  • Isaac in September 2017, the coworker I paired with the most in the first month of work


This was a good project and it was cool to have a way to validate my memories with qualitative data of my past. It's pretty easy to add new features and I would love any ideas or suggestions for additional analysis I can do.

I'd like to close the post with my favorite insight out of all of this data. In December 2013, when I was having some girl trouble, the polarity was -0.021784, the only negative polarity across all the months in my diary. In April 2017, when I had my surgery, was in the hospital for a week, stuck at home for 2 weeks, was high out of my mind from morphine and oxycodone, and just generally not having a good time, the polarity was 0.037985. What this means is that at least from the perspective of my diary, comparing teenage angst against a ridiculous amount of physical suffering, I was more sad about girls than literal tumors in my shoulder. 


The full code is available on my github here, but fair warning I wrote it all in a few days and some of it is probably a little gnarly.

If you'd like to see the data in non graphical form, let me know and I would be happy to send you some of the tables. They're really long and Squarespace doesn't support tables for some reason (???) so I didn't include them in this post.

Some of the word frequencies for specific words that I collected:

WF Tired vs Sad.png

Justin on Joel on Software

I just started reading the book Joel on Software and found myself mentally commenting on, replying to, and summarizing many of the chapters. I thought these might be worth documenting, so I decided to write them down in a new blog post. I was hesitant about the title Justin on Joel on Software since Joel is a much more accomplished and experienced programmer and I'm a noob, but the parallelism was too attractive.

Before I get into it, some main takeaways I got from the book:

  • Know your space, know your product, know your customers! Understand why what you're making has value and who benefits and would pay for that value
  • Always focus on maximizing value. That was an important lesson for me this summer especially since sometimes I get stuck on trying to fix or do something that's not that important
  • A way to do that is by economics, viewing your decisions in terms of economics (i.e. customers reached & pleased) to drive decision making

The chapter by chapter

Back to Basics
A fine argument for understanding lower levels of abstraction. There is a big benefit to understanding the foundation because when shit goes wrong on a higher level of abstraction the solution often lies in the more basic levels. Abstractions will fail someday and when they do, you better know what is being abstracted! I particularly liked his description of Schlemiel the painter algorithms:

Shlemiel gets a job as a street painter, painting the dotted lines down the middle of the road. On the first day he takes a can of paint out to the road and finishes 300 yards of the road. "That's pretty good!" says his boss, "you're a fast worker!" and pays him a kopeck.

The next day Shlemiel only gets 150 yards done. "Well, that's not nearly as good as yesterday, but you're still a fast worker. 150 yards is respectable," and pays him a kopeck.

The next day Shlemiel paints 30 yards of the road. "Only 30!" shouts his boss. "That's unacceptable! On the first day you did ten times that much work! What's going on?" "I can't help it," says Shlemiel. "Every day I get farther and farther away from the paint can!"

The Joel Test: 12 Steps to Better Code
Gives a cool litmus test for what good looks like, a quick 12 yes/no to see how your team is doing. One of his tests is "does your team fix bugs immediately?" I'm not too sure if a yes is always good, isn't it sometimes ok to have small bugs (especially if they're documented) to build out important features? I think there is a complex push/pull between debugging and developing new stuff that doesn't lie completely on the debugging side.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
I have really nothing to add except you should read it & learn it! Provides some interesting history on ASCII, Unicode, and encodings. Some common myths:

  • ASCII is not plain text
  • Unicode is not just 16 bits
  • There is no such thing as plain text!!!!

Painless Functional Specifications
I actually love writing specs; I like writing and I make shitty versions of specs for stuff I'm working on anyways. I find it helpful in figuring out what I want to do and the scope. Joel provides some fine arguments for writing specs, and delegates the task to PMs or to programmers. It is a really similar idea to RFCs at Riot.

Even though I like specs, I am still a little iffy on them. I agree that it's helpful to write them and have plans laid out but it seems like such a huge investment in time and effort. I wrote 4 RFCs this summer and it took me a few days with a lot of help. and we did them AFTER we worked on the stuff for a few sprints. I will say though that by writing the RFCs I learned stuff about the direction and higher level plan for our project that I haven't really thought about when I was working on it. Two points I found interesting:

  • 1 author for each spec, 1 point person
  • No (or bare) template- lowers the barrier to writing a spec (ugh so many categories to fill out). I kinda disagree cause I feel like a barrier is also "ugh what goes into a spec" and starting on a blank page is tough.

Evidence Based Scheduling
Schedules are hard. One of the weaknesses I profess in interviews is estimating (I actually talked about this at my Riot interviews). It seems like I am always off, and one of my troubles with storyboarding and making agile work is having good estimates of work. He provides a few suggestions, including short scheduling (nothing longer than 16 hours to do) and careful scheduling, to think about the smaller pieces that need to be done. I do try to do both, but one big issue I always run into is that I don't know what tasks are involved and stuff always just pops up.

But to resolve this, he makes a super interesting suggestion for another type of scheduling: evidence based scheduling. You track elapsed time on each story you take and your initial estimate, then record this past historical data to estimate future with a Monte Carlo simulation. The system assumes that the more consistent & accurate your estimates have been in the past, the more consistent & accurate they will be in the future, since distractions and mistakes happen in every software cycle with some probability (taken into account by the simulation). I am really excited to give this a try soon!!

Daily Builds Are Your Friend
Introduces the idea of the REP (read eval print) loop (modernized to an edit compile test loop) and highlights an eng-verb I learned from Kyle this summer:

The tighter the loop, the faster the programmer

The time saved is exponential, because the more you can hold in your brain, the faster you can work, and the less frustrated you will be. Waiting for stuff to run and compile is equivalent to the rage of waiting for a web page to load. While I get the benefits of automatic daily and complete builds, I think the technology has advanced and people are now investing time and effort into continuous integration, tightening the loop even further. You can read more about Docker/Jenkins at Riot on our tech blog!

Hard Assed Bug Fixin'
I hate shipping shitty code, I love cleaning stuff up, and I love having neat code, so sometimes I have to fight the urge to debug and refactor. The problem with this is that it takes ages, is often unnecessary work, and psychologically ties you to your work (which is never good). Instead of fixing bugs immediately, he suggests instead that your team document non critical bugs and evaluate what to fix based on user value (he says economic, but there are other similar metrics). Here is a quote I like from the article:
"Fixing bugs is only important when the value of having the bug fixed exceeds the cost of the fixing it."

Five Worlds:
The five worlds are: shrinkwrap, internal, embedded, games, and throwaway. These different types of software require different solutions, and so you should tailor advice and approach keeping in mind the different type of products you are making. This is a more general way of saying think about your users and your product space.

Don't Let Architecture Astronauts Scare You
Architecture astronauts are what Joel calls people on such a high level of abstraction that they run out of air to breathe. The emphasis is on usability and what the user wants, not just a lofty abstract ideal that doesn't solve a useful problem. Architectures and new toys are only useful because they solve problems!!

This is a cool explanation of what excellence looks like in software. The last 1% might take 500% of effort, so Joel defines craftsmanship is software written robustly. Often the effort is not worth it except for shrinkwrap software with lots of users. This was my experience too; often the last bit of work is the hardest (spinning up the shell of a vaguely functional product is not as hard).

Three Wrong Ideas from CS
Interesting ideas but not sure if they're still really relevant. I've never really thought about these things (local resources vs resources on the network, anti aliased text, and searching).

Guerrilla Guide to Interviewing
Parts I agree with:

  • A bad hire is worse than losing a good hire
  • Good hires are the most important thing to growth and excellence
  • Don't hire maybes
  • Look for people who are smart & gets things done
  • Reduce bias before interviews (don't talk to anyone before, try to not form opinions from their resume)
  • Give an opinion about the interviewee immediately
  • Ask open ended questions
  • Try to hire generalists, especially since tech changes so much

Parts I disagree with:

  • I don't think the type of question asked at interviews is good; it forces studying something irrelevant to the job. I get that understanding people's ability to think problems through is helpful but I just hate balancing a binary tree and have never done it outside an interview
  • What is "smart?" I think aptitude is probably more important, the ability to learn and adapt

Incentive Pay Considered Harmful
Super good article, lots that I agree with in here. Traditional reviews are not very helpful, and they don't accurately capture real skills and ignore other valuable ones. They are also very demotivating:
"For them, a positive review makes them feel like they are doing good work in order to get the positive review… as if they were Pavlovian dogs working for a treat, instead of professionals who actually care about the quality of the work that they do."
Most people end up being disappointed by their reviews, so reviews lower morale (which is horrible).

I think this can be improved by:

  • Better criteria to capture range of skills 
  • Review performance more often (i.e. 1:1s) so quarterly performance reviews are not surprises
  • Performance review to improve and identify strengths

Most of these ideas are explained in a Riot tech blog post by Mike Seavers :-)

Top Five (Wrong) Reasons You Don't Have Testers
Goooooood piece. Unfortunately I can't say I have much experience with testers, I've never worked with any. The reasons he provides are:

  • There will always be bugs
  • It is cheaper to catch them before
  • Tough to fix bugs after shipping
  • Damaging to let customers be your testers

He also suggests some interesting ways to get good testers.

Human Task Switches Considered Harmful
Gives a cool graph of task switching for CPUs to make an analogy with humans. Assigning 2 jobs severely slows someone down, so it is better to work sequentially. The worst thing to do for productivity is to task switch, and it gets worse the bigger the cost of task switching.
My takeaways:

  • Hiring new people is brutal; it's a big time sink to not only train new people but also to switch from your main task to help. Sorry team.
  • 1 owner per 1 story is good- a one person team doesn't have to task switch

Things You Should Never Do, Part 1
The thing you should never do is don't scrape old code! Often a rewrite suffers from the same mistakes, plus the added negative of the time it takes to reach feature parity. It is easy to want to raze it to the ground and restart, especially since as programmers you often trust yourself more. This is a fine argument for the careful refactoring of league instead of completely scrap it, especially because the time spent redesigning and rewriting is time we aren't bringing new stuff for the players.

The Iceberg Secret, Revealed
Most people focus on UI, even though UI is maybe 10% of the work. This is an interesting problem, and I've actually noticed it at work too. UI improvements are much easier to tangibly see than 2 weeks of solid coding and improvement under the hood. One sprint, for our demo, we rushed to make some UI improvements in a day and made our demo look much more impressive despite the 2 weeks of work we put into adding functionality. This brings up the interesting question of how to properly capture progress and show progress. Perhaps a way is to measure by numbers (story items completed)? 

I think it is also noteworthy that the problem is mitigated greatly when those you report to and demo to are technically inclined.

The Law of Leaky Abstractions
The article is pretty similar to Back to Basics. Abstractions leak, and when they do, to fix them, you have to understand the underlying, complexity being abstracted. Abstractions and tools save us time working, but they don't save us time learning. This is a fine argument for understanding pointers. Another good example is git. I know the commands and I can commit stuff but when things goes wrong I have no idea how to fix it because I don't really have any idea how git works.

Lord Palmerston on Programming
Leaky abstractions mean that we live with a hockey stick learning curve: you can learn 90% of what you use day by day with a week of learning, but the other 10% might take you a couple of years catching up. This sets the difference between really experienced programmers and novices- the base of knowledge to draw upon and build upon. Because of this, Joel suggests that every team have at least 1 domain expert. It is true that new tools pop up all the time, but when I was struggling with some javascript and unfamiliar APIs this summer, my coworkers (equally inexperienced) knew what to look for and had an intuition for learning. Perhaps that experience and knowledge is what most separates a senior engineer from a new grad? The difference feels like doing surgery by sawing everything open and learning as you go, vs using a scalpel (traipsing through the chrome sources is sawing the patient open to inspect the innards).

Strict metrics to measure success often fail, and people will work to those metrics instead actual progress. I think that instead of just applying strict and formal measurements, they can be accompanied by direct supervision to really understand the value that each person is bringing to the team and to the company. Then measurements become a way of understanding what good looks like instead of a rigid unsupervised bonus tool. The downside is, of course, you'll need a lot of managers.

What is the Work of Dogs in this Country
This is an article about the idea of eating your own dog food, or trying your own code as a user. Before a demo I did this summer my dev loops were so tight they weren't following the workflow the way a real user would use the app, so when I demo'd, the workflow was buggy. It is super embarrassing to test comprehensively, go to a demo, and have it fail immediately all because you never tested creating a new page and instead edited an old one... eat your own dog food.

Getting Things Done When You're Only a Grunt  
You can improve process even when you're a grunt!!! Things can always be made better by you and good process can be demonstrated by you to influence your team. This is important on bad teams, but also important on good teams (although you will have more impact on a good team). Process is very important and should be contributed to by all members of the team. Also a good lesson on how demonstrating by example is much stronger than speaking.

Two Stories
Joel shares two stories from two different software companies and highlights the importance of ownership and responsibility. Hire smart people and then empower them to make the right decisions! Even as an intern I had a lot of decision-making power and it was very motivating. I think management works the best as the ultimate support, and the less micromanaged programmers the more effective they are.

Big Macs versus the Naked Chef
Joel uses restaurants as an analogy for scalability problems in software companies. Big Macs are easy to make everywhere, so McDonalds pop up like weeds. On the other hand, high end restaurants don't scale well. It is a pretty good piece, a good explanation of why scaling is difficult and why there are so many useless methodologies out there. Highlights the importance of hiring the best and being careful- hiring programmers to scale your company is tough! 

Nothing is as Simple as It Seems
Always try to reduce risk, and one of the biggest risks is scheduling risk. Design will help you figure out the schedule and figure out the things that are not as simple as they seem. Don't rely on your gut or your first impressions unless you're very experienced and even then design will help you understand what you need to do and in what pieces. Of course there is the danger of over engineering and over designing wasting a lot of time, but I think in general incremental and careful design will always help you write better code faster.

In Defense of Not Invented Here Syndrome
Joel argues here that it is not always the best to look for third-party solutions and sometimes it is OK to have not invented here syndrome and try to build stuff in house. I agree that for certain things you have to build them in house and that includes your core product, especially since some solutions require custom work with things that do not exist open source or third-party and it would be a lot of work to adapt and be a lot less flexible and powerful. However I am skeptical that there is no skeleton or base on what you can build and contribute and often times I find that people default to building it themselves instead of seeing what has been done already.

Strategy Letter 1: Ben & Jerry's vs Amazon
He compares two different company strategies, Amazon and Ben & Jerry's. Ben & Jerry's is slower, has more competition, less risk, weaker network effect, and culture is important. On the other hand, Amazon is fast, has no competition, burns through cash, more risk, and cannot maintain culture. This article is great insight again into understanding your space. This is super important to start a company but also important to keep in mind when you're joining a new company! Riot is big but definitely feels more like a Ben & Jerry's- we care a lot about culture and there is heavy competition in genre and in gaming.

Strategy Letter II: Chicken and Egg Problem
I love these articles, they are so insightful. He presents the chicken and egg problem of merchants and customers, software and users for platforms. It is hard to drive users to new platforms without software written for the platform, but it is hard to drive programmers to write software for the platforms without users. His solution is backward compatibility to artificially create the chickens/eggs (or as Joel calls it, bring your own truck of chickens and eggs).

Strategy Letter III: Let Me Go Back!
Another great one. Addresses barriers of entry to customers, which need to be reduced and addressed them to encourage switching. He focuses on a surprising barrier of entry in particular: the stealth lock in.
It is a bad idea to try to lock in customers early, because they are non-existing. Instead, lock-in just locks out potential customers. An important part of driving growth is to make it easy to switch by making it easy to go back. Worry about retaining customers once you have them.

Strategy Letter IV: Bloatware and the 80/20 Myth
The big idea: big programs aren't that bad, space is cheap, and you shouldn't trim stuff in exchange for "bloat." Not sure about this one though, performance and memory usage is pretty important to League so various potatoes can play it, and for many mobile apps a slim app is pretty important to many users. I think maybe the approach is a happy medium where you treat size as another feature and don't let it dominate necessary features. For example, notepad is smaller than sublime but I sure as hell won't program in notepad.

Strategy Letter V: The Economics of Open Source
This article is mostly about open source economics, and why companies support open source. The underlying economic principle is: when complement prices go down, demand goes up and companies make more profit. Open source commodifies the complement, driving prices down and thereby increasing the demand of their actual product. As an example, the complement of league skins is esports and the game itself. They are free to watch and play, so demand for in game content goes up. Open source is interesting to think about in terms more than just idealistic open source is awesome and we should free the software. There are also strong economic incentives I haven't thought of before, such as working from a base already built by others, building a brand and attracting talent by contributing to the software community, and getting "free help."

How Microsoft Lost the API War
This is the kind of stuff I'd love to read about in a book. The article details what the Windows API is, why it's so important to Microsoft, how it lost the API war, and the consequences. Super interesting material, and ends with a convincing argument that the new API is HTML.

Perhaps the best support for his arguments is that as a new programmer, my last brush with rich client applications with Windows was at the hedge fund I worked at in my freshman year. Since then I've mainly worked with web based APIs and to be honest kinda see VB and .NET as outdated (I don't personally know any programmers who are experts in those areas).

agile Agile

In today's blog post I'd like to talk about the Agile methodology, specifically about good Agile, bad Agile, and agile Agile. In particular, I will champion Agile as a flexible and agile tool and condemn Agile as a cult and a religion. 

What is Agile?

The Agile methodology is a set of principles commonly used for software development. Under the Agile methodology, requirements and solutions evolve through adaptive planning, early delivery, and continuous iteration with the goal of flexible and fast building of software. The philosophy of the Agile methodology is described here. The motivation of Agile is that in a world where requirements are unpredictably evolutionary and estimates, plans, and predictions are almost invariably inaccurate, it is more efficient and makes more sense to iterate frequently and adapt to changes as they come. There are a lot of different processes and tools under the Agile methodology, but they generally promote and facilitate communication, collaboration, and adaptability throughout the software development life-cycle.

The term "agile" was coined in 2001 with the creation of the Agile Manifesto, and since then, the Agile methodology has been hailed & evangelized by many people who bring it up as a lightweight alternative to previous development philosophies, such as Waterfall and Cowboy, to name a few. There are now Agile conferences and training and coaches, and I find that "Agile," just like all popular business/ development ideas, has become a bit of a meme, joining the hallowed hall of business buzzwords along with "synergy," "organic," and "growth hack."

Good Agile

Memes aside, I think there are a lot of great things about Agile. Both engineering teams at my last two internships (OTC Markets and Riot Games) used Agile heavily to develop their software, and for good reason.

N.B. this is not one of the reasons

N.B. this is not one of the reasons

Emphasis on Delivering Value

One of the biggest pros to using Agile is its focus on delivering user value. A good analogy is League of Legends. The primary focus of any player in League is to destroy the enemy nexus. CS-ing helps, killing enemies helps, taking objectives helps, but you win by destroying the nexus. Similarly, in my opinion, the primary focus of any engineer is to deliver user value in some way. Working with new technologies helps, using different languages helps, and learning theory helps, but ultimately the goal is to serve some kind of user value. Product development in the Agile way is split into "sprints," set periods of work in which specific work is completed. At my last two internships, each sprint was a two week period of development. Each sprint begins with a planning period, during which the scope of the work for the sprint is determined, and ends with a review and a retro. During the sprint, the development team works on "user stories," small tasks the team commits to during the planning period. This entire process is very focused on user value, from planning to execution to review. Each user story describes and provides some kind of user value, and at the end of each sprint something tangible (should) will be provided to the user. As a engineering newb, I found this super helpful. I tend to get distracted by interesting problems or fancy new technology that are not always the most relevant or useful, so by focusing on user value through user stories in sprints, I can best ship, deliver and improve a useful product.

Small, Quick Iterations

I heard this from one of my coworkers at Riot, Kyle (source here). There was an experiment done where two groups of people were tasked to make the best ceramic pot they could. One group (group A) was asked to do one iteration, and take as much time as they needed. The other group (group B) was asked to do multiple iterations under a short time frame for each pot. It turns out, at the end of the experiment, those who had made multiple pots had much nicer ceramic pots than those who had dedicated their time to one pot ("waterfalled" it, if I may). The same idea, I think, extends to software. Software is cheap (relatively) and easy to write (relatively), so the cost of iteration tends to be low. In that case, the more iterations that a software development team can churn out, the better and the more accurate the final product will be, since it becomes easier to evaluate and reassess and learn from past prototypes. Agile processes lend themselves to these benefits because of the segmented & shorter nature of these processes (e.g. Sprints, user stories, etc.)

Fast and Flexible Improvements

Because Agile is so focused on delivering value to the customer and promotes quick iterations, a third big benefit of the Agile methodology is that it becomes easier to determine what is important to iterate on and improve. In a world where requirements are often incorrect, unclear, uncertain, and very mutable, the tighter the feedback loop the faster the improvement. By putting out a minimum viable product (MVP) and then iterating upon it, the development team can figure out what features to add or to improve. To extend the ceramic pots analogy, imagine you're asked to make some open-top ceramic container. An Agile-y way to go about it is by first making a really shitty bowl

fig. 1: a shitty bowl

fig. 1: a shitty bowl

learning that you actually want a pot, and then making a shitty pot

fig. 2: a shitty pot

fig. 2: a shitty pot

learning what a nicely shaped pot is and then making a pot with a better shape

fig. 3: a better shaped pot

fig. 3: a better shaped pot

wanting decorations and then making a well shaped pot thats nicely decorated. 

fig. 4: a nicely shaped, better decorated pot

fig. 4: a nicely shaped, better decorated pot

But by using quick iterations and a tight feedback loop, you can rapidly prototype ceramic pots and end up with a nice result.

Bad Agile

So if Agile is so awesome, why doesn't everyone and all companies use Agile? Well, Agile is a tool, and just like any tool, there is the good and there is the bad, and there is, unfortunately, also a lot of bad Agile.

Agile Hammer, Stiff Nails

The first point is that when all you have is an Agile hammer, everything tends to look a lot like nails. Unfortunately, not all problems are appropriate for the Agile methodology. For example, one of Agile's tenets is fail fast, because by failing quickly, we can learn from our iterations and prototypes and improve upon them. However, in the case of rocket launches, it is both cost prohibitive and morally wrong to fail fast on a rocket launch, as a lot of money and people's lives are at stake. So Agile may be inappropriate for something that cannot be iterated upon due to high costs per iteration. 

Another example is building a bridge- it is pretty unlikely, once the design is finalized, that the requirements of the bridge will be changing drastically midway through construction. In cases where the requirements are very well known, I think Agile is less appropriate. It is true that this is of no fault of Agile, but it remains something to consider nonetheless.

Housekeeping and Spring Cleaning

Another con of Agile is because of its emphasis on user value, some things get put aside and live on the backlog permanently. Things that do not directly provide user value in the form of new features or improved features are often de-prioritized (in my experience) in the Agile process. As a result, there isn't a lot of time dedicated to housekeeping, and prototypes are often treated as products. This is of course heavily dependent on the team and on its members and on the company, but it seems like because stuff like refactoring or cleaning up code is not direct user value, it is often tossed on the backlog as a nice to do.

In addition, because of an emphasis on providing user value immediately, things that would provide longer term value are sometimes ignored.

It is difficult in a short sprint to work on the long term goals and easy to postpone them when the
team is driven to do the tasks that would most immediately make an impact. For the same reason, I find it easier to convince myself to do the dishes and take out the trash than to clean out the closet and do my spring cleaning.

Educated Estimates and Bad Guesses

An important aspect of Agile is the ability to estimate progress and to continually iterate. This is obviously very difficult because of unexpected complexities. Everyone knows how it feels to take on a 1 day piece of work and have it grow to take a week instead, like decapitating a garden snake and finding out it's really a hydra. Work is really hard to estimate, and oftentimes accurate estimates are only possible after the work is finished. In addition, it's really hard to measure motivation and energy. People don't always operate in two week sprints, and sometimes people are less motivated or have less energy in that sprint, and less work gets delivered.

Disciplined Team, Disciplined Agile

I think however, the biggest negative of Agile is that it requires an intensely focused and disciplined team to be able to properly use Agile. Because Agile so heavily relies on every team member during the development process in every sprint to determine what to iterate on and how, Agile requires a very disciplined and motivated team. In a world of continuous iteration and improvement, there is a ton of pressure to always deliver and it's very easy, in my opinion, to slack and lose discipline. Agile relies on a good product owner to figure out business direction, a good team of engineers to solve relevant problems and estimate reasonable work, and a good development manager to maintain and overlook sprint rituals and Agile processes. Most importantly, in order to continue to improve upon a product, you need a team that is motivated and invested in the success of the product- and that is not the easiest thing to have.

agile Agile

However, ultimately Agile is just a tool. Just like the development of the product, the Agile process itself should be agile in iteration and improvement. Taking a hammer and trying to staple a piece of paper will fail, and I would not blame the hammer or the paper for the failure. It is important, I think, to recognize which tools are suitable for which job and to continually tailor and improve upon your tools for your most appropriate use case. For example, for some of the problems I brought up above, such as Housekeeping and Spring Cleaning, one idea I have seen is to introduce cleanup sprints dedicated to improving and refactoring and clearing up some of the accumulated technical debt. Another option is to put those items in as specific user stories so they are not merely acknowledged as nice to dos but planned intentionally within the sprint. For Educated Estimates and Bad Guesses, appropriate planning plus retros will help determine what is feasible to take on in the future, and in a communicative team, motivation and level of energy is easy to share and discuss. All of these problems can be attributed to and solved by the teams using Agile, and are not necessarily reflective of the Agile methodology itself.

I think the danger comes (as it often does) with the evangelists, who live and die by Agile and have the Agile Manifesto tattooed on the inside of their eyelids. Agile is a tool, not a prescription and definitely not a religion. The goal of Agile is to BE agile, and the philosophy and the tools are supposed to help with that. Agile is a good process for creating software, but the same principles of iteration can be used for Agile itself, so applying Agile to Agile. I find that tools tend to become unhelpful once people become rigid about them and stop treating them as tools that are flexible. So what does agile Agile look like? In my opinion, just like products are continuously iterated and improved upon, processes should be continually iterated and improved upon. Take the general toolkit and essence of Agile, sift out the good Agile, remove the bad Agile, and work towards the agile Agile for your project and your team. Like company culture, if you do not curate and evolve your development process, you will have one regardless- it might just not be the one that you want.

The Cult of Vim

I start too many of my blog posts like this, but I pay too much for this site and write too little, so I guess this is a good time for my annual revisit to my blog and to add some content. My past posts have been pretty philosophical in content, so in my next "series" of posts, I'd like to discuss some of the more technical engineering oriented things I've learned from my internship at Riot Games this summer. 

Today, I'd like to talk about the cult of vim, and why I am now a flag-waving card-bearing vim fanatic. In particular, in this post I'll share some of my thoughts regarding reasons to switch to vim I found personally compelling, challenges to switching to vim, the philosophy of vim, and some of my personal vim settings & config.

the official vi gang sign

the official vi gang sign

A Little Bit of Background

My first exposure to vim was from Jae's Advanced Programming course (CS3157) in my sophomore year, and my god, I hated vim so much. I did most of the course assignments in vim because he introduced us to and suggested using vim, and every minute of vim just sucked for me. I hated not being able to scroll, I hated having to type ':w' to write a file, I hated having to type 'i' to start typing, and most of all, I hated pressing the arrow key to move through characters of a line.

I understood that unless there was some masochistic trend amongst programmers, vim was probably not as hard to use as I thought it would be, but I was too lazy to learn C pointers and vim commands at the same time, so I just accepted that insert mode was weird and movement in vim was crappy.

After my experience in vim, I was introduced to IntelliJ from my internship at OTC Markets, and I was a pretty big fan. I could click on stuff and use my mouse and run a debugger, and I thought wow! IDEs are awesome. In my own personal work, from recommendations from friends, I usually used Sublime Text 2 and it was pretty decent for what I wanted it to do. I thought editors were just tools for programming, and since the meat of programming seemed to me to be the technical concept, quibbling about editors felt like putting the cart before the horse.

I started trying vim again and explored using it seriously after working with my colleague at Riot, the awesome Kyle Burton. When we were pair programming early in my internship I watched him work in vim, and I was convinced that vim was either super awesome or he was some kind of wizard (maybe both, still not sure about kbot...). I was deeply impressed by how fast he could execute commands to manipulate text from vim, and I felt like despite the steep learning curve and the bad first experience, I was slow enough in Sublime that ramping up to parity wouldn't take too much effort. Now I pretty much use vim exclusively for programming, I have a vim plugin installed for when I'm working in IntelliJ, and sometimes I even accidentally press shift-v to try to select lines in Google Docs.

Why Vim?

There are a couple of benefits of adopting vim, but in my mind, the most important one is that vim is an incredibly productive editor. There is a saying in the software engineering world that 90% of programming is thinking and 10% is actually coding, and I find that some people (me included) use that as a justification for less mastery over their tools. Since most of us spend most of our time thinking instead of typing, the logic is that optimizing for typing speed and editor speed doesn't yield that much of a gain. Personally, despite making the argument a couple of times in the past, I don't really buy that anymore. In my opinion, even if the premise is true, there are still very solid gains to be made for editing faster even in the 10% block of time a programmer is actually coding.

However, the real benefit of using a tool that you're very familiar with (and that could be vim, or any other editor), is that editing becomes second nature. If my primary focus in programming is to think about the problems, then I want to be as focused as I possibly can be, and any divergence in my thought or any distractions in my window of focus from thinking about my editor is *bad*. It doesn't matter how small the distraction is, how short the command is, any thinking you have to do about your editor is a distraction from your focus and that is *bad*.

I like this comic about interrupting programmers:



And I think it also applies in your editor. The less noisy your editor is in your mind, the faster you can do things with your tools, the better it is for your focus - and that is perhaps the most compelling argument for switching to vim.

Another strong plus of vim is that vim is available anywhere. I am recently beginning to appreciate this more, but vim requires no download, no installation, no setup, and almost anywhere where there is a UNIX environment, you can edit stuff with vim. That turns out to be super useful both when you change to a new machine, or more commonly, when you're working on another machine or in a VM. Even without your personal vimrc, vanilla vim is pretty decent and more than enough for most editing purposes. To me, it boils down to ease of portability, and in a Java-esque manner, vim is write once edit anywhere :p.

Challenges of Using Vim

So why don't more people use vim? Well, one of the significant challenges of moving to vim is the steep learning curve. I've been using vim every day for almost 2 months now, and I'm only starting to scratch the surface of beginning to "grok" vim. Vim is, for all intents and purposes, as far as I know, infinitely powerful, and as you learn more, there will be more and more commands and tips and tricks to learn until you can convince your friends and family that you are a vim wizard. The learning curve, in my mind, looks something like this:

poorly drawn image courtesy of paint

poorly drawn image courtesy of paint

But the productivity of using vim compared to most editors looks something like this:

disclaimer: these images are my own subjective views, based on editors I've used before

disclaimer: these images are my own subjective views, based on editors I've used before

As you can see, the learning curve is less of a curve, and more of a 90 degree wall. Learning vim is super hard, your productivity plummets when you start, and it feels like you're battling with your editor the entire time. However, for me, once I worked with vim more and understood the idea behind vim, I started sliding down the learning curve and boosting my productivity.  In the next section of the post, therefore, I'd like to discuss my understanding of the philosophy of vim, and how to address some of the challenges of vim.

Modal Editing in Vim

There's a lot that's confusing with vim, and one of the biggest things I think stumps new users of vim is the modal system of vim. Vim operates in a few different modes, but for most intents and purposes, vim has two modes: normal mode, where you move around and select text and run commands, and insert mode, where you insert new text. That is super confusing for a new user. 99% of other text editors only have insert mode, and the first time I used vim, I was very upset that I had to press 'i' to enter insert mode and I basically stayed in insert mode the entire time except when I was writing the file.

It turns out that that is a completely wrong way to think about vim.

The Philosophy of Vim

The correct way of using vim, in my opinion, is to generally stay in normal mode, and only leave for short bursts of typing in insert mode. The philosophy of this is that vim commands are meant to be combined, and this makes a lot more sense once you begin grokking vim, and stop seeing files as something you edit, but instead as grid-like blocks of text that you can freely manipulate and command.

The "Zen" of vim is that you're speaking a language, and a good way I learned to think about normal mode commands is as a language, with verbs, nouns, and adjectives. The verbs are your commands, such as 'c' (change), 'd' (delete), 'y' (yank), the nouns are your movements, such as 'w' (word), '}' (paragraph), or 'G' (eof), and the adjectives/adverbs are your numerical prefixes and your descriptors, such as 'a' (around) and 'i' (in). vim commands are meant to be combined into "sentences," which on a higher level, describes to vim what you want to do.

This turns out to be an incredibly powerful idea. Think of the time you spend actually inserting new text when you program, versus the time you spend moving around, copying text, deleting characters, etc. I would wager that the latter far outweighs the former. Part of the power of vim is that with proper knowledge of the commands, you can very quickly manipulate, move, copy, remove, replace, and delete text, leaving the actual typing of new code in short bursts in insert mode. The more vim you know, the shorter and more informative your descriptions are, and the more powerful vim becomes as a tool.

Another great benefit of using vim to its full power is the '.' command. The '.' command repeats the last command line change, and so the more succint and the more stuff your last command does, the more useful the '.' becomes. For example, if your last command was 'x' (delete 1 char after the cursor), then '.' is the same as 'x'. However, if you'd like to delete multiple words without counting them, or indent text, '.' is helpful as a "macro" for your last command.

This really only begins to scratch the surface of vim, and there's still a ton of commands and tools in vim that I don't know how to use yet. It can be daunting for someone coming into vim to be willing to take the dive and pay the high upfront cost of learning vim, so in this next section, I'd like to talk about what is maybe vim's greatest challenge- its super high learning curve.

How Do I Start Learning Vim?

When I learned about models for the atom in middle school & high school, I began by learning models that were largely incorrect, but were helpful in providing enough understanding to eventually refine my model. I think in a very similar way, learning vim is much easier when you ignore all the commands except the ones you immediately need. Instead of thinking of learning vim as learning the commands first and then trying to use every single one, remember that vim, just like any editor, is ultimately your tool. Describe to it what you want it to do, and then learn how to speak the vim language.

What is the first thing you'd like your editor to be able to do? A great place to start is the ability to move up, down, left, and right by characters. The first thing I learned in vim was to use 'h,j,k,l' instead of the arrow keys, and got familiar with keeping my hands on the home row. Once moving with 'h,j,k,l' started to annoy me (and it did pretty soon) I started to learn other movement commands such as moving by 'w' (word), 'b' (back 1 word) and '}' (paragraph). Once I got used to those movement commands and wanted to know more, I learned more powerful commands such as 'f_' (find next char), '/' (search), and 'z(t,z,b)' (keeps cursor in current position but scrolls view top, center, or bottom of the screen).

Let your pain drive your development, and trust that if something feels bad, there's probably a vim command for it. I found it a much more effective way to learn vim via describing to vim something I wanted to do, and once that proved to be annoying or painful, learn and describe to vim something more powerful. In my opinion, it's better to not try to learn every single command and use macros immediately. The plum model, while very wrong, is a decent model to start learning the structure of atoms, and moving by 'h,j,k,l', while very inefficient, is a decent way to start learning vim.


Vim is also great because it's super personalizable. It's easy to edit your .vimrc to suit your personal habits and needs, and just as good chefs have custom tools, good vim-mers have custom .vimrc files. I've only started customizing my .vimrc and adding plugins, but here is a link to my .vimrc on github, and my vim cheatsheet (although I'm sure there are better ones on the internet). I won't go through my .vimrc here because there's nothing too fancy there, and the file is commented pretty heavily.

Some Final Thoughts

I've only started using vim really recently, so there's still a ton left for me to learn. The next goal for me is to learn how to use tabs and buffers more efficiently, since I still think of them in terms of files instead of blocks of text. I would be more than happy to hear any comments/ suggestions/ lessons regarding making my vim-ming more efficient, any awesome plugins or .vimrc lines I've been missing out on, or the philosophy of vim. I'd also love to see if more people could benefit from vim,  so if you're thinking of taking the leap and climbing the curve, please let me know! I'd be happy to discuss vim with you anytime.

I would like to finish this post with a vim koan:

No ultimate difference

One day a monk visited Master Wq, and inquired, “Master, how will my code be different when I have mastered Vim?” Master Wq answered, “Before Vim: declare, define, process, print. After Vim: declare, define, process, print.”

p.s. I typed this post in vim.