Something Similar2017-01-23T22:59:13-08:00http://www.somethingsimilar.com/Jeff Hodgesjeff@somethingsimilar.comA couple of phone calls to Congresspeople2017-01-23T22:59:13-08:00http://somethingsimilar.com/2017/01/23/a-couple-of-phone-calls-to-congresspeople/<p>I&rsquo;ve just heard that one of my senators, Senator Feinstein, has voted to confirm
Representative Pompeo for CIA Director. Rep. Pompeo has left the door wide open
for him to
<a href="http://www.thedailybeast.com/articles/2017/01/20/trump-cia-pick-leaves-door-open-to-waterboarding-more-spying-on-americans.html">bring back torture</a>.</p>
<p>I am embarrassed that I didn&rsquo;t reach out as a consituent to make it clear what I
expected of her. I&rsquo;m angry that she’s let us down. I couldn’t imagine that she
would ever vote to support any part of the Trump agenda. Sen. Feinstein doesn&rsquo;t
seem to understand the moment we are in, so it&rsquo;s up to us to let her know.</p>
<p>Whatever she got for that vote, it was a bad trade. The worst of all possible
options is that she did it for free. We have to resist compromising ourselves by
supporting the Trump agenda of deceit, bigotry, and sexism or we will find
justice impossible to achieve.</p>
<p>Join me in calling your members of Congress. Below is what I’m going to say to
her and the other members of Congress that represent me when their offices
open. Maybe these scripts will help you with your phone calls.</p>
<p>There’s some handy sites out there that’ll
<a href="http://act.commoncause.org/site/PageServer?pagename=sunlight_advocacy_list_page">help you find your representatives and their contact numbers</a>.</p>
<p>For Senator Dianne Feinstein:</p>
<blockquote>
<p>Hi, I’m Jeff Hodges. I’m a Democrat living in the Mission of San Francisco.</p>
<p>Yesterday, I heard that Senator Feinstein had voted to confirm Representative
Pompeo as the new CIA director. As someone who voted for her in 2012, I’m
embarrassed and angry that she would vote to confirm any of Trump’s cabinet
picks, but most especially a man who could not commit to writing a stance
against torture.</p>
<p>Voting confirming any of the Trump cabinet picks, is a confirmation of the
Trump agenda. An agenda of deceit, bigotry, and sexism is an agenda we must
resist against and I hope all of Senator Feinstein’s future votes reflect
that. Bipartisanship is not compatible with resistance. Thank you for your
time.</p>
</blockquote>
<p>For Senator Kamala Harris:</p>
<blockquote>
<p>Hi, I’m Jeff Hodges. I’m a Democrat living in the Mission of San Francisco.</p>
<p>I wanted to thank Senator Harris for her vote against Representative Pompeo and
the rest of the Trump cabinet picks. I’m grateful that Senator Harris was able
to understand the moment we were in. That’s exactly the kind of resistance
to the Trump agenda I had hoped she would bring. Thank you very much for your
time.</p>
</blockquote>
<p>For Representative Nancy Pelosi:</p>
<blockquote>
<p>Hi, I’m Jeff Hodges. I’m a Democrat living in the Mission of San Francisco.</p>
<p>I just got off the phone with the offices of Senator Feinstein and Senator
Harris. Those were two very different phone calls. I&rsquo;m calling today because I
wanted Representative Pelosi to know how important it is to me that her
colleagues in the Senate did not vote to confirm the Trump picks, and that all
of the Democrats in Congress continued to vote against the Trump agenda in all
of its forms. Thank you for your time.</p>
</blockquote>
<p>I can&rsquo;t believe this is the world we&rsquo;re living in.</p>
Code to Read When Learning Go2013-12-26T20:15:00-08:00http://somethingsimilar.com/2013/12/26/code-to-read-when-learning-go/<p>When folks ask me how they should learn Go, I usually say:</p>
<blockquote>
<p>Run through the activities in <a href="http://tour.golang.org/#1">A Tour of Go</a>, then read <a href="http://golang.org/doc/effective_go.html">Effective
Go</a> (and maybe check out <a href="https://gobyexample.com/">Go By Example</a>), and then
read some code.</p>
</blockquote>
<p>This seems to work well for them. But the last part could use some examples of
good, clear, idiomatic code to read. Let me do my part there.</p>
<p>Now, I&rsquo;m holding these up as someone who already knows Go, didn&rsquo;t use all of
these to learn Go, and, perhaps worse, came to Go already knowing how to
program. But even with these caveats, these are codebases that are useful to
learn from.</p>
<p><a href="https://github.com/bmizerany/pat">pat</a> (<a href="http://godoc.org/github.com/bmizerany/pat">godoc</a>) - Pat is a cute, small library that provides
Sinatra-like HTTP handling in Go on top of the already fairly easy to use
<a href="http://golang.org/pkg/net/http/"><code>net/http</code></a> Handlers. It&rsquo;s a breeze to read. Feature-ful Go projects
often have just a handful of files in them, gathering power from they way they
combine the standard library in a particular convenient and narrower fashion
than the original API. Pat is a great example of that (and Go&rsquo;s stdlib being
good enough for production is to be commended). It integrates well with other
code you&rsquo;ll use simply by relying on the interfaces in the <code>net/http</code> library.</p>
<p><a href="https://code.google.com/p/codesearch/">codesearch</a> (<a href="http://godoc.org/code.google.com/p/codesearch/">godoc</a>) - To see how a larger
project can fit together, I recommend the codesearch project. It&rsquo;s use of the
regexp package&rsquo;s <a href="http://golang.org/pkg/regexp/syntax/">abstract syntax tree API</a> to implement
regexp search is wonderful. For maximum understanding, pair this code with
Russ Cox&rsquo;s articles on <a href="http://swtch.com/~rsc/regexp/regexp4.html">how to implement searching with regular expressions in
queries</a>. I don&rsquo;t want to undersell that article. It describes how
Google&rsquo;s (now defunct) Code Search solved that problem and how codesearch, the
Go implementation Russ wrote, works. The whole series of articles on regular
expressions is good, but this one article is especially fun. My workplace
shipped this project internally and it&rsquo;s pretty much Solved&trade; our code
search problems.</p>
<p><a href="https://github.com/golang/groupcache">groupcache</a> (<a href="http://godoc.org/github.com/golang/groupcache">godoc</a>) - Groupcache is another
larger codebase for those interested in that, and contains plenty of goodies
within. It&rsquo;s part of the <a href="http://dl.google.com">dl.google.com</a>, Blogger and Google Code
architecture, and is, in many ways, a &ldquo;smarter memcached&rdquo; as a library. The
<a href="https://github.com/golang/groupcache/#readme">README</a> details its design. From its ownership model
designed to avoid <a href="http://en.wikipedia.org/wiki/Thundering_herd_problem">thundering herds</a>, to the
<a href="http://godoc.org/github.com/golang/groupcache/singleflight">singleflight</a> implementation of request collapsing, to the
<a href="http://godoc.org/github.com/golang/groupcache#Stats">interface for extracting stats</a>, groupcache is a maturely
designed piece of software. Not just good code, but wise distributed system
design is inside.</p>
<p><a href="http://golang.org/pkg/net/http/">net/http</a> - The net/http library is very Go-y, and used everywhere
in the Go community. It&rsquo;s been hardened for production use at scale and HTTP
1.x is a very hard protocol to get right, so it&rsquo;s not always the easiest
read. That said, it&rsquo;s worth learning how such a common building block was
designed. It also provides capabilities like <a href="http://golang.org/pkg/net/http/pprof/">net/http/pprof</a> that
takes Go&rsquo;s built-in profiler and makes it available over the web. (Reading
<a href="http://blog.golang.org/profiling-go-programs">Profiling Go Programs</a> is really nice, by the by. Be sure to check
out the goroutine blocking profiler.)</p>
<p>There are so many more projects I could talk about. Especially ones using the
Go AST tooling (like in <a href="http://godoc.org">godoc.org</a>&rsquo;s <a href="https://github.com/garyburd/gddo">gddo</a> codebase), or the
surprisingly clear crypto libraries (like <a href="http://golang.org/pkg/crypto/subtle/"><code>crypto/subtle</code></a>) but time
and space won&rsquo;t permit it. Poke your nose around, search for projects on <a href="http://go-search.org/">Go
Search</a>, and write some code. One of Go&rsquo;s major selling points is
it&rsquo;s readability, even to those inexperienced in it, and just digging in is
worth your time.</p>
<p>(And, if you do nothing else, read <a href="http://golang.org/doc/effective_go.html">Effective Go</a>.)</p>
A Story About a Photograph2013-12-09T08:15:00-08:00http://somethingsimilar.com/2013/12/09/a-story-about-a-photograph/<p><a href="https://twitter.com/j4cob">Jacob</a> and I shipped <a href="https://blog.twitter.com/2013/forward-secrecy-at-twitter-0">Forward Secrecy</a> at Twitter. A little
while later, the New York Times asked to take some photos of the two of us for
the <a href="http://bits.blogs.nytimes.com/2013/11/22/twitter-toughening-its-security-to-thwart-government-snoops/?_r=0">couple</a> of <a href="http://www.nytimes.com/2013/12/05/technology/internet-firms-step-up-efforts-to-stop-spying.html?pagewanted=1&amp;_r=1">pieces</a> they were writing about
what Twitter and other companies were doing in response to the Snowden
revelations.</p>
<p>When the photographer came, he asked us to work next to each other in a way
that &ldquo;didn&rsquo;t look too fake&rdquo;. Alright, authenticity is important, so Jacob and
I started poking at this DH parameter thing we&rsquo;d been meaning to investigate.</p>
<p>Ten minutes later, the photographer says &ldquo;alright, we&rsquo;re done over here. Let&rsquo;s
go to the next spot.&rdquo; But we don&rsquo;t look up, and Jacob says &ldquo;wait, one more
run&rdquo; while I give a distracted &ldquo;one minute&rdquo; wave.</p>
<p>I think it was real enough for him.</p>
<p><img class="img-responsive" src="/images/jsha_and_me_pfs_roofdeck.jpg" alt="Two engineers on a roof deck in front of a keyboard. One bites his lip in concentration with hands on the keyboard while the other points at the laptop's screen"></p>
<p>I kind of like the shot better in <a href="https://twitter.com/jmhodges/status/408723656689205248">greyscale</a>. Credit to <a href="http://www.noahbergerphoto.com/">Noah
Berger</a> for the shot.</p>
Notes on Distributed Systems for Young Bloods2013-01-14T08:15:00-08:00http://somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/<p>I&rsquo;ve been thinking about the lessons distributed systems engineers learn on
the job. A great deal of our instruction is through scars made by mistakes
made in production traffic. These scars are useful reminders, sure, but it&rsquo;d
be better to have more engineers with the full count of their fingers.</p>
<p>New systems engineers will find the <a href="http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing">Fallacies of Distributed
Computing</a> and the <a href="http://codahale.com/you-cant-sacrifice-partition-tolerance/">CAP theorem</a> as part of their
self-education. But these are abstract pieces without the direct, actionable
advice the inexperienced engineer needs to start moving<sup class="footnote-ref" id="fnref:1"><a href="#fn:1">1</a></sup>. It&rsquo;s surprising
how little context new engineers are given when they start out.</p>
<p>Below is a list of some lessons I&rsquo;ve learned as a distributed systems engineer
that are worth being told to a new engineer. Some are subtle, and some are
surprising, but none are controversial. This list is for the new distributed
systems engineer to guide their thinking about the field they are taking
on. It&rsquo;s not comprehensive, but it&rsquo;s a good beginning.</p>
<p>The worst characteristic of this list is that it focuses on technical problems
with little discussion of social problems an engineer may run into. Since
distributed systems require more machines and more capital, their engineers
tend to work with more teams and larger organizations. The social stuff is
usually the hardest part of any software developer&rsquo;s job, and, perhaps,
especially so with distributed systems development.</p>
<p>Our background, education, and experience bias us towards a technical solution
even when a social solution would be more efficient, and more pleasing. Let&rsquo;s
try to correct for that. People are less finicky than computers, even if their
interface is a little less standardized.</p>
<p>Alright, here we go.</p>
<p><a name="fail" href="#fail">#</a> <strong>Distributed systems are different because they fail often.</strong> When asked what
separates distributed systems from other fields of software engineering, the
new engineer often cites latency, believing that&rsquo;s what makes distributed
computation hard.</p>
<p>But they&rsquo;re wrong. What sets distributed systems engineering apart is the
probability of failure and, worse, the probability of partial failure. If a
well-formed mutex unlock fails with an error, we can assume the process is
unstable and crash it. But the failure of a distributed mutex&rsquo;s unlock must be
built into the lock protocol.</p>
<p>Systems engineers that haven&rsquo;t worked in distributed computation will come up
with ideas like &ldquo;well, it&rsquo;ll just send the write to both machines&rdquo; or &ldquo;it&rsquo;ll
just keep retrying the write until it succeeds&rdquo;. These engineers haven&rsquo;t
completely accepted (though they usually intellectually recognize) that
networked systems fail more than systems that exist on only a single machine
and that failures tend to be partial instead of total. One of the writes may
succeed while the other fails, and so now how do we get a consistent view of
the data? These partial failures are much harder to reason about.</p>
<p>Switches go down, garbage collection pauses make leaders &ldquo;disappear&rdquo;,
socket writes seem to succeed but have actually failed on the other machine, a
slow disk drive on one machine causes a communication protocol in the whole
cluster to crawl, and so on. Reading from local memory is simply more stable
than reading across a few switches.</p>
<p>Design for failure. <a href="#fail">#</a></p>
<p><a name="robustdist" href="#robustdist">#</a> <strong>Writing robust distributed systems costs more than writing robust
single-machine systems.</strong> Creating a robust distributed solution requires more
money than a single-machine solution because there are failures that only
occur with many machines. Virtual machine and cloud technology make
distributed systems engineering cheaper but not as cheap as being able to
design, implement, and test on a computer you already own. And there are
failure conditions that are difficult to replicate on a single
machine. Whether it&rsquo;s because they only occur on dataset sizes much larger
than can be fit on a shared machine, or in the network conditions found in
datacenters, distributed systems tend to need actual, not simulated,
distribution to flush out their bugs. Simulation is, of course, very useful. <a href="#robustdist">#</a></p>
<p><a name="robustoss" href="#robustoss">#</a> <strong>Robust, open source distributed systems are much less common than robust,
single-machine systems.</strong> The cost of running many machines for long periods
of time is a burden on open source communities. Hobbyists and dilettantes are
the engines of open source software and they do not have the financial
resources available to explore or fix many of the problems a distributed
system will have. Hobbyists write open source code for fun in their free time
and with machines they already own. It&rsquo;s much harder to find open source
developers who are willing to spin up, maintain, and pay for a bunch of
machines.</p>
<p>Some of this slack has been taken up by engineers working for corporate
entities. However, the priorities of their organization may not be in line
with the priorities of your organization.</p>
<p>While some in the open source community are aware of this problem, it&rsquo;s not
yet solved. This is hard. <a href="#robustoss">#</a></p>
<p><a name="coord" href="#coord">#</a> <strong>Coordination is very hard.</strong> Avoid coordinating machines wherever
possible. This is often described as &ldquo;horizontal scalability&rdquo;. The real trick
of horizontal scalability is independence &ndash; being able to get data to
machines such that communication and consensus between those machines is kept
to a minimum. Every time two machines have to agree on something, the service
becomes harder to implement. Information has an upper limit to the speed it can
travel, and networked communication is flakier than you think, and your idea
of what constitutes consensus is probably wrong. Learning about the <a href="http://en.wikipedia.org/wiki/Two_Generals%27_Problem">Two
Generals</a> and <a href="http://en.wikipedia.org/wiki/Byzantine_Generals%27_Problem">Byzantine Generals</a> problems is useful
here. (Oh, and Paxos really is <a href="http://research.google.com/pubs/pub33002.html">very hard to implement</a>; that&rsquo;s not
grumpy old engineers thinking they know better than you.) <a href="#coord">#</a></p>
<p><a name="memory" href="#memory">#</a> <strong>If you can fit your problem in memory, it&rsquo;s probably trivial.</strong> To a
distributed systems engineer, problems that are local to one machine are
easy. Figuring out how to process data quickly is harder when the data is a
few switches away instead of a few pointer dereferences away. In a distributed
system, the well-worn efficiency tricks documented since the beginning of
computer science no longer apply. Plenty of literature and implementations are
available for algorithms that run on a single machine because the majority of
computation has been done on singular, uncoordinated machines. Significantly
fewer exist for distributed systems. <a href="#memory">#</a></p>
<p><a name="slow" href="#slow">#</a> <strong>&ldquo;It&rsquo;s slow&rdquo; is the hardest problem you&rsquo;ll ever debug.</strong> &ldquo;It&rsquo;s slow&rdquo; might
mean one or more of the number of systems involved in performing a user
request is slow. It might mean one or more of the parts of a pipeline of
transformations across many machines is slow. &ldquo;It&rsquo;s slow&rdquo; is hard, in part,
because the problem statement doesn&rsquo;t provide many clues to the location of the
flaw. Partial failures, ones that don&rsquo;t show up on the graphs you usually look
up, are lurking in a dark corner. And, until the degradation becomes very
obvious, you won&rsquo;t receive as many resources (time, money, and tooling) to
solve it. <a href="http://research.google.com/pubs/pub36356.html">Dapper</a> and <a href="http://engineering.twitter.com/2012/06/distributed-systems-tracing-with-zipkin.html">Zipkin</a> were built for a reason. <a href="#slow">#</a></p>
<p><a name="backpressure" href="#backpressure">#</a> <strong>Implement backpressure throughout your system.</strong> Backpressure is the
signaling of failure from a serving system to the requesting system and how
the requesting system handles those failures to prevent overloading itself and
the serving system. Designing for backpressure means bounding resource
use during times of overload and times of system failure. This is one
of the basic building blocks of creating a robust distributed system.</p>
<p>Implementations of backpressure usually involve either dropping new
messages on the floor, or shipping errors back to users (and incrementing
a metric in both cases) when a resource becomes limited or failures
occur. Timeouts and exponential back-offs on connections and requests to other
systems are also essential.</p>
<p>Without backpressure mechanisms in place, cascading failure or unintentional
message loss become likely. When a system is not able to handle the failures
of another, it tends to emit failures to another system that depends on it. <a href="#backpressure">#</a></p>
<p><a name="partial" href="#partial">#</a> <strong>Find ways to be partially available.</strong> Partial availability is being able to
return some results even when parts of your system is failing.</p>
<p>Search is an ideal case to explore here. Search systems trade-off between how
good their results are and how long they will keep a user waiting. A typical
search system sets a time limit on how long it will search its documents, and,
if that time limit expires before all of its documents are searched, it will
return whatever results it has gathered. This makes search easier to scale in
the face of intermittent slowdowns, and errors because those failures are
treated the same as not being able to search all of their documents. The
system allows for partial results to be returned to the user and its
resilience is increased.</p>
<p>And consider a private messaging feature in a web application. At some point, no
matter what you do, enough storage machines for private messaging will be down
at the same time that your users will notice. So what kind of partial failure do
we want in this system?</p>
<p>This takes some thought. People are generally more okay with private messaging
being down for them (and maybe some other users) than they are with all users
having some of their messages go missing. If the service is overloaded or one of
its machines are down, failing out just a small fraction of the userbase is
preferable to missing data for a larger fraction. And, on top of that choice, we
probably don&rsquo;t want an unrelated feature, like public image upload, to be
affected just because private messaging is having a problem. How much work are
we willing to do to keep those failure domains separate?</p>
<p>Being able to recognize these kinds of trade-offs in partial availability is
good to have in your toolbox. <a href="#partial">#</a></p>
<p><a name="metrics" href="#metrics">#</a> <strong>Metrics are the only way to get your job done.</strong> Exposing metrics (such as
latency percentiles, increasing counters on certain actions, rates of change)
is the only way to cross the gap from what you believe your system does in
production and what it actually is doing. Knowing how the system&rsquo;s behavior on
day 20 is different from its behavior on day 15 is the difference between
successful engineering and failed shamanism. Of course, metrics are necessary
to understand problems and behavior, but they are not sufficient to know what
to do next.</p>
<p>A diversion into logging. Log files are good to have, but they tend to
lie. For example, it&rsquo;s very common for the logging of a few error classes to
take up a large proportion of a space in a log file but, in actuality, occur
in a very low proportion of requests. Because logging successes is redundant
in most cases (and would blow out the disk in most cases) and because
engineers often guess wrong on which kinds of error classes are useful to see,
log files get filled up with all sorts of odd bits and bobs. Prefer logging as
if someone who has not seen the code will be reading the logs.</p>
<p>I&rsquo;ve seen a good number of outages extended by another engineer (or myself)
over-emphasizing something odd we saw in the log without first checking it
against the metrics. I&rsquo;ve also seen another engineer (or myself)
Sherlock-Holmes&rsquo;ing an entire set of failed behaviors from a handful of log
lines. But note: a) we remember those successes because they are so very rare
and b) you&rsquo;re not Sherlock unless the metrics or the experiments back up the
story. <a href="#metrics">#</a></p>
<p><a name="percentiles" href="#percentiles">#</a> <strong>Use percentiles, not averages.</strong> Percentiles (50th, 99th, 99.9th, 99.99th)
are more accurate and informative than averages in the vast majority of
distributed systems. Using a mean assumes that the metric under evaluation
follows a bell curve but, in practice, this describes very few metrics an
engineer cares about. &ldquo;Average latency&rdquo; is a commonly reported metric, but
I&rsquo;ve never once seen a distributed system whose latency followed a bell
curve. If the metric doesn&rsquo;t follow a bell curve, the average is meaningless
and leads to incorrect decisions and understanding. Avoid the trap by talking
in percentiles. Default to percentiles, and you&rsquo;ll better understand how users
really see your system. <a href="#percentiles">#</a></p>
<p><a name="capacity" href="#capacity">#</a> <strong>Learn to estimate your capacity.</strong> You&rsquo;ll learn how many seconds are in a
day because of this. Knowing how many machines you need to perform a task is
the difference between a long-lasting system, and one that needs to be
replaced 3 months into its job. Or, worse, needs to be replaced before you
finish productionizing it.</p>
<p>Consider tweets. How many tweet ids can you fit in memory on a common machine?
Well, a typical machine at the end of 2012 has 24 GB of memory, you&rsquo;ll need an
overhead of 4-5 GB for the OS, another couple, at least, to handle requests,
and a tweet id is 8 bytes. This is the kind of back of the envelope
calculation you&rsquo;ll find yourself doing. Jeff Dean&rsquo;s <a href="http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf">Numbers Everyone Should
Know</a> slide is a good expectation-setter. <a href="#capacity">#</a></p>
<p><a name="flags" href="#flags">#</a> <strong>Feature flags are how infrastructure is rolled out.</strong> &ldquo;Feature flags&rdquo; are a
common way product engineers roll out new features in a system. Feature flags
are typically associated with frontend A/B testing where they are used to show
a new design or feature to only some of the userbase. But they are a powerful
way of replacing infrastructure as well.</p>
<p>Too many projects have failed because they went for the &ldquo;big cutover&rdquo; or a
series of &ldquo;big cutovers&rdquo; that were then forced into rollbacks by bugs found too
late. By using feature flags instead, you&rsquo;ll gain confidence in your project and
mitigate the costs of failure.</p>
<p>Suppose you&rsquo;re going from a single database to a service that hides the details
of a new storage solution. Using a feature flag, you can slowly ramp up writes
to the new service in parallel to the writes to the old database to make sure
its write path is correct and fast enough. After the write path is at 100% and
backfilling into the service&rsquo;s datastore is complete, you can use a separate
feature flag to start reading from the service, without using the data in user
responses, to check for performance problems. Another feature flag can be used
to perform comparison checks on read of the data from the old system and the new
one. And one final flag can be used to slowly ramp up the &ldquo;real&rdquo; reads from the
new system.</p>
<p>By breaking up the deployment into multiple steps and affording yourself quick
and partial reactions with feature flags, you make it easier to find bugs and
performance problems as they occur during ramp up instead of at a &ldquo;big bang&rdquo;
release time. If an issue occurs, you can just tamp the feature flag setting
back down to a lower (perhaps, zero) setting immediately. Adjusting the rates
lets you debug and experiment at different amounts of traffic knowing that any
problem you hit isn&rsquo;t a total disaster. With feature flags, you can also choose
other migration strategies, like moving requests over on a per-user basis, that
provide better insight into the new system. And when your new service is still
being prototyped, you can use flags at a low setting to have your new system
consume fewer resources.</p>
<p>Now, feature flags sound like a terrible mess of conditionals to a classically
trained developer or a new engineer with well-intentioned training. And the use
of feature flags means accepting that having multiple versions of infrastructure
and data is a norm, not an rarity. This is a deep lesson. What works well for
single-machine systems sometimes falters in the face of distributed problems.</p>
<p>Feature flags are best understood as a trade-off, trading local complexity (in
the code, in one system) for global simplicity and resilience. <a href="#flags">#</a></p>
<p><a name="idspace" href="#idspace">#</a> <strong>Choose id spaces wisely.</strong> The space of ids you choose for your system will
shape your system.</p>
<p>The more ids required to get to a piece of data, the more options you have in
partitioning the data. The fewer ids required to get a piece of data, the
easier it is to consume your system&rsquo;s output.</p>
<p>Consider version 1 of the Twitter API. All operations to get, create, and
delete tweets were done with respect to a single numeric id for each
tweet. The tweet id is a simple 64-bit number that is not connected to any
other piece of data. As the number of tweets goes up, it becomes clear that
creating user tweet timelines and the timeline of other user&rsquo;s subscriptions
may be efficiently constructed if all of the tweets by the same user were
stored on the same machine.</p>
<p>But the public API requires every tweet be addressable by just the tweet
id. To partition tweets by user, a lookup service would have to be
constructed. One that knows what user owns which tweet id. Doable, if
necessary, but with a non-trivial cost.</p>
<p>An alternative API could have required the user id in any tweet look up and,
initially, simply used the tweet id for storage until user-partitioned storage
came online. Another alternative would have included the user id in the tweet
id itself at the cost of tweet ids no longer being k-sortable and numeric.</p>
<p>Watch out for what kind of information you encode in your ids, explicitly and
implicitly. Clients may use the structure of your ids to de-anonymize private
data, crawl your system in unexpected ways (auto-incrementing ids are a
typical sore point), or perform a <a href="https://www.owasp.org/index.php/Top_10_2010-A4-Insecure_Direct_Object_References">host of other attacks</a>. <a href="#idspace">#</a></p>
<p><a name="dataloc" href="#dataloc">#</a> <strong>Exploit data-locality.</strong> The closer the processing and caching of your data
is kept to its persistent storage, the more efficient your processing, and the
easier it will be to keep your caching consistent and fast. Networks have more
failures and more latency than pointer dereferences and <code>fread(3)</code>.</p>
<p>Of course, data-locality means being nearby in space, but it also means nearby
in time. If multiple users are making the same expensive request at nearly the
same time, perhaps their requests can be joined into one. If multiple instances
of requests for the same kind of data are made near to one another, they could
be joined into one larger request. Doing so often affords lower communication
overheard and easier fault management. <a href="#dataloc">#</a></p>
<p><a name="cached" href="#cached">#</a> <strong>Writing cached data back to persistent storage is bad.</strong> This happens in
more systems than you&rsquo;d think. Especially ones originally designed by people
less experienced in distributed systems. Many systems you&rsquo;ll inherit will have
this flaw. If the implementers talk about &ldquo;Russian-doll caching&rdquo;, you have a
large chance of hitting highly visible bugs. This entry could have been left
out of the list, but I have a special hate in my heart for it. A common
presentation of this flaw is user information (e.g. screennames, emails, and
hashed passwords) mysteriously reverting to a previous value. <a href="#cached">#</a></p>
<p><a name="domore" href="#domore">#</a> <strong>Computers can do more than you think they can.</strong> In the field today, there&rsquo;s
plenty of misinformation about what a machine is capable of from practitioners
that do not have a great deal of experience.</p>
<p>At the end of 2012, a light web server had 6 or more processors, 24 GB of
memory and more disk space than you can use. A relatively complex <a href="http://en.wikipedia.org/wiki/Create,_read,_update_and_delete">CRUD</a>
application in a modern language runtime on a single machine is trivially
capable of doing thousands of requests per second within a few hundred
milliseconds. And that&rsquo;s a deep lower bound. In terms of operational ability,
hundreds of requests per second per machine is not something to brag about in
most cases.</p>
<p>Greater performance is not hard to come by, especially if you are willing to
profile your application and introduce efficiencies based on your
measurements. <a href="#domore">#</a></p>
<p><a name="cap" href="#cap">#</a> <strong>Use the CAP theorem to critique systems.</strong> The CAP theorem isn&rsquo;t something
you can build a system out of. It&rsquo;s not a theorem you can take as a first
principle and derive a working system from. It&rsquo;s much too general in its
purview, and the space of possible solutions too broad.</p>
<p>However, it is well-suited for critiquing a distributed system design, and
understanding what trade-offs need to be made. Taking a system design and
iterating through the constraints CAP puts on its subsystems will leave you
with a better design at the end. For homework, apply the CAP theorem&rsquo;s
constraints to a real world implementation of Russian-doll caching.</p>
<p>One last note: Out of C, A, and P, you <a href="http://codahale.com/you-cant-sacrifice-partition-tolerance/">can&rsquo;t choose CA</a>. <a href="#cap">#</a></p>
<p><a name="services" href="#services">#</a> <strong>Extract services.</strong> &ldquo;Service&rdquo; here means &ldquo;a distributed system that
incorporates higher-level logic than a storage system and typically has a
request-response style API&rdquo;. Be on the lookout for code changes that would be
easier to do if the code existed in a separate service instead of in your
system.</p>
<p>An extracted service provides the benefits of encapsulation typically
associated with creating libraries. However, extracting out a service improves
on creating libraries by allowing for changes to be deployed faster and easier
than upgrading the libraries in its client systems. (Of course, if the
extracted service is hard to deploy, the client systems are the ones that
become easier to deploy.) This ease is owed to the fewer code and operational
dependencies in the smaller, extracted service and the strict boundary it
creates makes it harder to &ldquo;take shortcuts&rdquo; that a library allows for. These
shortcuts almost always make it harder to migrate the internals or the client
systems to new versions.</p>
<p>The coordination costs of using a service is also much lower than a shared
library when there are multiple client systems. Upgrading a library, even with
no API changes needed, requires coordinating deploys of each client
system. This gets harder when data corruption is possible if the deploys are
performed out of order (and it&rsquo;s harder to predict that it will
happen). Upgrading a library also has a higher social coordination cost than
deploying a service if the client systems have different maintainers. Getting
others aware of and willing to upgrade is surprisingly difficult because their
priorities may not align with yours.</p>
<p>The canonical service use case is to hide a storage layer that will be
undergoing changes. The extracted service has an API that is more convenient,
and reduced in surface area compared to the storage layer it fronts. By
extracting a service, the client systems don&rsquo;t have to know about the
complexities of the slow migration to a new storage system or format and only
the new service has to be evaluated for bugs that will certainly be found with
the new storage layout.</p>
<p>There are a great deal of operational and social issues to consider when doing
this. I cannot do them justice here. Another article will have to be written. <a href="#services">#</a></p>
<p><em>Much love to my reviewers <a href="https://twitter.com/dehora">Bill de hÓra</a>, <a href="https://twitter.com/coda">Coda Hale</a>, <a href="https://twitter.com/jdmaturen">JD
Maturen</a>, <a href="https://twitter.com/nora">Micaela McDonald</a>, and <a href="https://twitter.com/tnm">Ted Nyman</a>. Your
insight and care was invaluable.</em></p>
<p><strong>Update</strong> (2016-08-15): I&rsquo;ve added permalinks for each section and cleaned up
some text in the sections on coordination, data-locality, feature flags, and
backpressure.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Of course, Rotem-Gal-Oz&rsquo;s <a href="http://www.rgoarchitects.com/Files/fallacies.pdf">take on the fallacies</a> is very good.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>
Monorail (n., jargon)2012-07-23T08:20:00-07:00http://somethingsimilar.com/2012/07/23/monorail/<p><strong>monorail</strong> <em>(n., jargon)</em> - A <a href="http://www.laputan.org/mud/mud.html#Abstract">Big Ball of
Mud</a> codebase that began as a
web application (esp. one written in a dynamic language) but grew into other
responsibilities beyond serving HTTP traffic to users. A monorail is created
when a webapp does not have its responsibilities moved to other services as
the demands of the product grow (e.g. offline processing, managing complex
data storage, user authentication). Coined by <a href="http://twitter.com/jeanpaul">Jean-Paul
Cozzatti</a> in 2010, &ldquo;monorail&rdquo; is a portmanteau of
the words &ldquo;monolithic&rdquo; and &ldquo;Ruby on Rails&rdquo;.</p>
<p>While it&rsquo;s difficult to determine the exact point that a webapp becomes a
&ldquo;monorail&rdquo;, a webapp is certainly a monorail when it begins to write
to and read from a distributed queue.</p>
<blockquote>
<p><em>&ldquo;Don&rsquo;t put that in the monorail.&rdquo;</em></p>
<p><em>&ldquo;It&rsquo;ll save us weeks of developer time if we do it in the monorail.&rdquo;</em></p>
<p><em>&ldquo;Nobody owned anything in the monorail because everyone&rsquo;s code touched
everyone else&rsquo;s.&rdquo;</em></p>
</blockquote>
Finding go.crypto and go.net2012-05-24T12:52:00-07:00http://somethingsimilar.com/2012/05/24/finding-go.crypto-and-go.net/
<p>It&rsquo;s kind of a pain in the ass to find the go.crypto and go.net packages and
there&rsquo;s wonderful goodies in both of them. To ease that, I&rsquo;m writing this up
with some pointers to them and some small discussion on a few of my favorite
libraries within them.</p>
<h2 id="go-crypto">go.crypto</h2>
<p>The <code>go.crypto</code> source code is <a href="http://code.google.com/p/go/source/checkout?repo=crypto">available in the go project</a>. Note
the drop down that let&rsquo;s you pick it or the other &ldquo;subrepos&rdquo; of the Go project
proper.</p>
<p>To install one of the libraries (let&rsquo;s call it $LIB) in <code>go.crypto</code>, run:</p>
<pre><code>go get code.google.com/p/go.crypto/$LIB
</code></pre>
<p>All of the libraries in <code>go.crypto</code> and <code>go.net</code> are written entirely in
Go. The documentation for all of the <code>go.crypto</code> libraries is <a href="http://gopkgdoc.appspot.com/pkg/code.google.com/p/go.crypto">available on
gopkgdoc</a>.</p>
<h3 id="crypto-bcrypt">crypto/bcrypt</h3>
<p>I&rsquo;m going down my favorites alphabetically and that just so happens to mean
that the library I wrote is first. Fancy that.</p>
<p><code>crypto/bcrypt</code> is an implementation of the <code>bcrypt</code> algorithm. <code>bcrypt</code>
is an easy-to-use, and very secure means of hashing passwords (and other
secrets) such that they cannot be reversed and are very difficult to brute
force. Additionally, <code>bcrypt</code> allows you to specify how difficult the hashed
password should be to brute force (and, therefore, how difficult it is to hash
later when, say, a user logs in). It also provides a means of migrating your
data to a more difficult cost as Moore&rsquo;s law takes hold by embedding the cost
you specified as part of the generated hash.</p>
<p>Using <code>crypto/bcrypt</code> is straight-forward. To generate a new hash from a users
password to be stored in a database:</p>
<pre><code>import (
&quot;code.google.com/p/go.crypto/bcrypt&quot;
)
func hashMyPassword(password []byte) []byte {
// bcrypt.DefaultCost can be substituted for any number between
// bcrypt.MinimumCost and bcrypt.MaximumCost, inclusively.
return bcrypt.GenerateFromPassword(password, bcrypt.DefaultCost)
}
</code></pre>
<p>When checking the whether a <code>bcrypt</code> hash matches a password given by a user,
you MUST use <code>bcrypt.CompareHashAndPassword</code>, which is cryptographically
secure (using the lovely <a href="http://golang.org/pkg/crypto/subtle"><code>crypto/subtle</code> package</a> in the standard
library).</p>
<p><code>bcrypt.CompareHashAndPassword</code> returns nil when the passwords match, and an
error otherwise. This is a little odd, but you&rsquo;ll only use it one or two
places in your system.</p>
<p>You MUST NOT use <code>bytes.Equal</code> to compare the returned <code>[]byte</code> with what is
in your database. Using the naive equality check will make your service
<a href="http://codahale.com/a-lesson-in-timing-attacks/">susceptible to timing attacks</a>.</p>
<p>Example usage:</p>
<pre><code>import (
&quot;code.google.com/p/go.crypto/bcrypt&quot;
)
func isCorrectPassword(user *User, password []byte) boolean {
hashedPassword := user.HashedPassword()
return bcrypt.CompareHashAndPassword(hashedPassword, password) == nil
}
</code></pre>
<p>Installation of <code>crypto/bcrypt</code>:</p>
<pre><code>go get code.google.com/p/go.crypto/bcrypt
</code></pre>
<p>The <a href="http://gopkgdoc.appspot.com/pkg/code.google.com/p/go.crypto/bcrypt">API documentation</a> of <code>crypto/bcrypt</code> is available on
gopkgdoc.</p>
<h3 id="crypto-ssh">crypto/ssh</h3>
<p><code>crypto/ssh</code> is a SSH client and server library. This API is too large to give
a great set of examples for, but I&rsquo;ll give the basics of its networking code.</p>
<p>The <code>crypto/ssh</code> package makes great use of the <code>net.Dial</code>, <code>net.Conn</code>, and
<code>net.Listener</code> patterns of building up network connections.</p>
<p>Making an SSH connection to a server and using it is easy:</p>
<pre><code>import (
&quot;code.google.com/p/go.crypto/ssh&quot;
&quot;net&quot;
&quot;net/http&quot;
)
func tunnelToPrivateServer() (*http.Response, error) {
config := &amp;ssh.ClientConfig{...}
client, err := ssh.Dial(&quot;tcp&quot;, &quot;example.com:22&quot;, config)
if err != nil {
return nil, err
}
defer client.Close()
// client is a *ssh.ClientConn that satisfies the net.Conn interface
// but it can also be used to tunnel to private resources.
tr := &amp;http.Transport{
Dial: func(network, addr string) (net.Conn, error) {
return client.Dial(network, addr)
},
}
httpClient := &amp;http.Client{Transport: tr}
return httpClient.Get(&quot;http://private.example.com/secrets.txt&quot;)
}
</code></pre>
<p>Setting up an SSH client terminal or server is slightly more complicated and I
defer to the helpful examples in the documentation for <a href="http://go.pkgdoc.org/code.google.com/p/go.crypto/ssh#Dial">ssh.Dial</a>
and <a href="http://go.pkgdoc.org/code.google.com/p/go.crypto/ssh#Listen">ssh.Listen</a>.</p>
<p>Installation of <code>crypto/ssh</code>:</p>
<pre><code>go get code.google.com/p/go.crypto/ssh
</code></pre>
<p>The <a href="http://go.pkgdoc.org/code.google.com/p/go.crypto/ssh">API documentation</a> of <code>crypto/ssh</code> is available on gopkgdoc.</p>
<h2 id="go-net">go.net</h2>
<p>The <code>go.net</code> source code is <a href="http://code.google.com/p/go/source/checkout?repo=net">available in the go project</a>.</p>
<p>To install one of the libraries (let&rsquo;s call it $LIB) in <code>go.net</code>, run:</p>
<pre><code>go get code.google.com/p/go.net/$LIB
</code></pre>
<p>The documentation for all of the <code>go.net</code> libraries is <a href="http://gopkgdoc.appspot.com/pkg/code.google.com/p/go.net">available on
gopkgdoc</a>.</p>
<h3 id="net-spdy">net/spdy</h3>
<p><code>net/spdy</code> is a library implementing the SPDY protocol. As of this writing, it
implements version 2 of the protocol. Unfortunately, I&rsquo;ve not had the chance
to play with this library, yet, so I&rsquo;m going to skip making an example. I&rsquo;d
recommend watching that space.</p>
<p>Installation of <code>net/spdy</code>:</p>
<pre><code>go get code.google.com/p/go.net/spdy
</code></pre>
<p>The <a href="http://go.pkgdoc.org/code.google.com/p/go.net/spdy">API documentation</a> of <code>net/spdy</code> is available on gopkgdoc.</p>