Ten years in software

Last updated October 31st, 2016

I have been working professionally as a Software Engineer for the past 10 years. In that time, I've learned a huge
amount, gained a bit of confidence, and largely ignored the social nature of our field. I haven't given back to the
community and now feel like it's a good time to change that. I've been very lucky in my career thus far and want to
share the broad lessons that I've learned along the way.

This is part four of a series of pieces written reflecting on my career:

When I started at IMVU, it had a large, complex, and tech-debt-ridden codebase. I say that not to shame the company or
its codebase, but to acknowledge that the company was
operating at a loss for 5 years prior to my joining in 2010.
Unwieldy codebases happen and as far as I can tell are perfectly natural. That being said, as a newcomer, it was
difficult to reason about and difficult to extend. Comprehending the code and its original intent was hard, and making
deliberate changes often would require many false starts. Initially, the code was daunting, I was fearful and avoided
making changes to deep/core areas of the codebase.

Looking around, the engineers who were the most effective were able to do amazing things, changing large swaths of the
core codebase without causing breakage. I asked Chad how he did it, and he would quote Dune,
saying "fear is the mind killer." The engineers who were the most effective wouldn't avoid making changes to the scary
areas that were used heavily. Instead they would tackle them head on, taking the time to question the assumptions of
the APIs and proceed in a direction they felt was right.

There was a pattern in their work, they were not afraid to make mistakes, were not afraid to break things, and were very
quick to challenge and verify their assumptions. Need to know if something is used? Delete it. Need to know how an
untested component works? Write a test to reverse-engineer its behavior.

This quick and fearless approach was not a lone pillar of Chad and the other lead engineers. It was surrounded by a
culture which emphasized and supported reliability and clear communication. The organizational structure supported this
in its processes & patterns, and the engineering organization supported architectures which helped foster this type of
effective engineer.

IMVU had a mailing list called "change@" which was meant for developers to send messages to after they had performed any
notable change to how the product or developer experience worked. It was at the discretion of the maker of the change to
decide whether or not to send an email to this list. Aside from the details of what the change was and why they believed
it was a good thing, there as a standard postscript that was included in every single email, which answered a few
questions, notably:

I wrote/DID NOT write automated tests

I verified/DID NOT verify my change locally (in a dev environment)

I verified/DID NOT verify my change in production

I showed/DID NOT show my change to someone/anyone

These emails served both as a stream of changes for our support teams as well as a checklist of practices that helped
reinforce overall stability. There was no shame in making a change without writing tests, but it certainly helped to
publicly acknowledge that tests weren't written. When a change couldn't easily be verified in production, it helped to
call out to others to be on the lookout for potential problems.

We discussed failures openly and without hesitation. Whenever there was an outage, system failure, or other surprise, we
would research what happened and identify all of the contributing factors of the failure. The terms we used for these
"post-mortems" were "root cause analysis" or "five whys"—terms the Toyota company used to deal with production
failures—but the process was not one executed by searching for blame. We would create a timeline of events, looking for
problems which could have been identified earlier or prevented by automated means or manual processes. As a follow-up to
these meetings, we would create two sets of follow-up tasks:

Ones that must be done, which would have helped identify, diagnose, or reduce the impact of the outage

Ones that won't be done, which helped us acknowledge the limits of what we were capable of doing at the time

And of course, with full disclosure: there was always a growing backlog of these remediation items, even those which
were marked as must be done. However, we always strove to make progress against these.

In my first year at IMVU, I was on a team which handled the payment backend and anti-fraud systems. I made a small
change to our payments backend which was intended to add experimental behavior for staff members and normal behavior for
our customers. But I screwed up. Here are the main events:

I ended up getting a boolean expression backwards and shipped the experimental bits to all customers and left the old
behavior for our staff.

I didn't send a change email notifying everyone of my change.

I verified that payments worked for myself (which they did, because I was a staff member).

I didn't monitor the affect on our customers.

I didn't verify that the experimental code path was behaving properly.

To make matters worse, the bug happened to occur at a place after accepting payments, but before delivering the
purchased goods. We took money without giving anything in return. I broke things pretty badly.

It took an hour or so to identify and fix the regression, and much more time to identify and deliver the purchased goods
to the original purchasers. When I realized what I had done, my heart sank to the pit of my stomach. I did some mental
math and calculated the measured loss in revenue compared to my salary. My mind was racing and I was honestly fearful
for my job.

Thankfully, once we corrected the problem I wasn't blamed (aside from a stern, "don't do that") for the failure, but
instead I was strangely praised as someone who had managed to overcome our "immune system" of automated alerts, graphs,
and charts.

My manager took me aside to talk about what happened. We talked through the problems and asked me what I could have done
better. He told me straight up that I didn't have anything to worry about. It was fine for me to be feeling pretty low,
but also that I've got an opportunity to learn from my mistakes and be a champion for others to promote better practices
of communication, testing, and monitoring. Going through the post-mortem led to a number of changes from the high-level
how we communicate about changes to sensitive areas, to the operational monitoring of how we mark successful
transactions, to the low-level names of our functions to make them less prone to double-negatives in booleans and how we
can identify code which is only executed by customers and not staff.

I had my tail behind my legs for a few weeks, but despite my guilt I also made things better or everyone. I forced the
organization to improve its defense and caused myself to become a better engineer both in how I communicated with others
and how I wrote and verified code.

Do you want to learn more? Was something confusing? Was something insightful? In the NYC area
and want to grab a coffee? Feel free to drop me an email at
sufian@gmail.com or send a tweet my way
@sufianrhazi

Disclaimer: Unless stated otherwise, the above words are my own and do not represent the
opinions of any person or business but myself.