Adventures in the High-Tech Underbelly

Menu

One technique we explored in my team’s language work is something we call “fail-fast.” The idea is that a program fails as quickly as possible after a bug has been detected. This idea isn’t really all that novel, but the ruthless application of it, perhaps, was.

There are several sources of fail-fast in our system:

Contract violation.

Runtime assertion violation.

Null dereference.

Out of memory.

Stack overflow.

Divide by zero.

Arithmetic overflow.

The funny thing is that if you look at 90% of the exceptions thrown in .NET, they are due to these circumstances.

In my experience, developers usually end up doing one of two things in response to such a failure condition:

Catch and proceed, usually masking a bug and making it harder to detect later on.

Let the process crash. After the stack has unwound, and finallys run, potentially losing important debugging state.

I suppose there’s a third, which is legitimately catch and handle the exception, but it’s so rare I don’t even want to list it. These are usually programming errors that should be caught as early as possible. The “catch and ignore” discipline is an enormous broken window.

Exceptions have their place. But it’s really that 10% scenario, where things operate on IO, data, and/or are restartable (e.g., parsers).

As we applied fail-fast to existing code-bases, sure enough, we found lots of bugs. This doesn’t just include exception-based mishaps, but it also return-code based ones. One program we ported was a speech server. It had a routine that was swallowing HRESULTs for several years, but nobody noticed. Sadly this meant Taiwanese customers saw a 80% error rate. Fail-fast put it in our faces.

You might question putting arithmetic overflows in this category. Yes, we use checked arithmetic by default. Interestingly, this was the most common source of stress failures our team saw. (Thanks largely to fail-fast, but also the safe concurrency model which eliminated race conditions and deadlocks…but I digress). How annoying, you might say? No way! Most of the time, a developer really didn’t expect overflow to be possible, and the silent wrap-around would have produced bogus results. In fact, a common source of security exploits these days can be had by triggering integer overflows. Better to throw it in the developer’s face, and let him/her decide whether to opt-into unchecked.

Out of memory is another case that sits right on the edge. Modern architectures tend to tolerate failure (e.g., being restartable, journaling state, etc), rather than going way out of their way to avoid it, so OOM hardening tends to be rarer and rarer with time. Hence, the default of fail-fast is actually the right one for most code. But for some services — like the kernel itself — this may be inappropriate. It’s a blog post on its own how we handled this.

We are actively investigating applying the fail-fast discipline to C# and .NET more broadly. For that, please stay tuned. However, even in the absence of broad platform support, the discipline is an easy one to adopt in your codebase today.

No matter how smart of a leader you are, you’re going to be wrong sometimes. Often even. And you won’t always have the best ideas.

There are three reasons for this.

First, the numbers are against you.

Second, the people in your group have way more data than you possibly can. That’s not to say every person has more data — I’m continually amazed at how much data can be scoured and retained by some of the best leaders I’ve worked with — but statistically speaking, you’ll have blind spots, and certain people will significant outmatch you on depth every time. Memories and time being finite, and all.

Third, there are people with better ideas, who are smarter than you, in your group, anyway.

It’s critical to create an environment where the best ideas are heard, discussed, and ultimately able to grow into things that are much bigger. Your company’s next big success could be sitting right under your nose, and in an environment where those ideas have no voice, you lose doubly: first, you don’t capitalize on the idea; and, second, the person is likely to take their ideas elsewhere. Smart and creative people need outlets.

I also think it’s imperative to have an open mind about ideas. If you weren’t the one to have the idea, one possibility is that it’s just a bad idea. Unfortunately, too many people assume this. More likely, there’s some aspect of thinking behind it that you don’t truly appreciate. Thinking about problems from different angles is important. One trap I see time and time again is eager dismissal of an idea based on your biases, like past experience (which may be less relevant today than it was before), an iron grip on the strategic direction of a group (like obsessing over “making money” when really what you need is to make some key “architectural” investments to lay the foundation that pays off later), etc.

In some groups, the dictator model can “work.” I put work in quotes because these tend to be conservative and constrained environments where keeping innovation low is desirable. I can’t say it’s anywhere I’d want to work, though.

The worst thing leaders can do is to embellish the dictator model. It can happen subtly and innocuously, and worse, slowly over time. Naturally, as a leader, you want to delegate decision-making. But it’s critical to keep the pulse of the organization to ensure that every single decision and/or idea wasn’t generated in some top down manner. If that happens, it’s a sign that some deeper cultural problem is afoot. Too many high-level managers care more about the “what” than the “how”; how a team works — collaborating, generating ideas, engineering culture — is even more important than what it is building, because a healthy culture ensures success not just in the current project, but also equips the team to tackle future projects with agility.

Don’t confuse this with other models that appear dictatorial in nature, like Steve Jobs at Apple, and Bill Gates in the early days of Microsoft. That model can work well, which is having a person who is on the lookout to ensure everything the company does is as best as it can, consistent, and delivers in line with the strategy of an organization. Steve Jobs was incredibly open to ideas, and in fact that fueled Apple in a big way, even though he was quick to tear up the bad ones (and, I’m sure, a few good ones along the way). Having bottom-up innovation doesn’t mean anybody does whatever they want, but it does mean every idea gets its fair day in court. Google and Facebook get this.

I work in a team where the microkernel is developed in close partnership with the backend code-generator. Where the language is developed in close partnership with the class libraries. Where it’s just as common for a language designer to comment on the best use of the language style in the implementation of the filesystem, as it is for such an exchange in the context of the web browser, as it is for a device driver developer to help influence the overall async programming model’s design to better suit his or her scenarios.

In fact, the developers I love most are those who will go make a change to the language, ensure the IDE support works great, plumb the change through to the backend code-generator to ensure optimal code-quality, whip up a few performance tests along the way, deploy the changes to the class libraries so that they optimally use them (and on up through to the consumers of those libraries in applications), write the user-facing documentation for how to use this new language feature, … and beyond. All in a few days’ work.

It takes real guts. The best programmers are absolutely fearless.

I call this “codevelopment.” The idea is that you’re designing and building the system as a whole, and ensuring each part works well with all other parts as you go. It’s a special case of “eat your own dogfood.” Codevelopment is a key part of a startup culture. Big companies can afford to over-compartmentalize responsibilities, but little companies usually can’t. (And those that go overboard doing so don’t last long).

Obviously my present situation is a bit unique. Not everybody works on the development platform and operating system and everything in between, all at once. But there’s more opportunity for this than one might think; in fact, it’s everywhere. It can be a website or app’s UI vs. business logic, hardware vs. software, the engineering system vs. the product code, operations vs. testing vs. development, etc. Most people have sacred lines that they won’t cross. It saddens me when these lines are driven by organizational boundaries, when engineers should be knocking the lines down and collaborating across them.

A great example of wildly successful codevelopment is Apple’s products. They have always developed the hardware in conjunction with the software, focusing on the end-to-end user experience. Most companies disperse these responsibilities out across disparate organizations (or even separate companies!) without any one person really in charge of the end-to-end thing. And it shows.

A good test is: if you’re designing some system, or implementing a feature, do you ever hit an edge where you think you could come up with some great solution, but intentionally don’t because you think “person X is supposed to do this,” “that team over there would never accept it,” etc.? Or worse, “that’s not my job?” A special case of the latter is “I’m an X developer, and that is a Y component” (example for X: compiler, example for Y: networking). What an incredible opportunity to learn more about Y, and collaborate closely with some new colleagues, that is all-too-often missed! It’s almost like intentionally dumbing oneself down. I am aware of Conway’s Law — and teams exist for a reason (to lump together closely related work) — but the reality is the organization almost always lags behind the technology. Technology direction should shape the organization, not vice versa. Communication structures need to be put in place to facilitate this.

The technology suffers in a compartmentalized world too.

Thinking in terms of a series of black boxes stitched together leaves opportunities on the floor, whether it is economies of scale or opportunities for innovation, particularly if nobody is responsible for looking end-to-end across those boundaries to ensure they make sense. Abstractions afford a degree of independence, but I always regularly step back and wonder, “what is this abstraction costing me? what is it gaining me? is the tradeoff right?” Abstractions are great, so long as they are in the right place, and not organizationally motivated. The biggest sin is to knowingly create a lesser quality solution, simply for fear of crossing or dealing with the engineers on the other side of such an abstraction boundary.

Codevelopment is just as much about building the right architecture, as it is validating that architecture and its implementation as you go. If you are forced to think about – and even suffer the consequences of – the resulting code quality of a language feature you just wrote, and you are forced to see it in action and get feedback from its target audience by actually integrating the feature in some “real world” code, you are less apt to sweep problems under the rug. Especially if you have the right measures in place. I’ve been guilty numerous times in the past of hacking together some cool feature, and then moving on to the next one, only to find out (usually right before shipping to customers) that it didn’t work in the real world. I keep focusing on developer platform examples, but obviously these ideas extend well beyond developer platforms.

Another way to think about codevelopment is as a kind of “pre-flighting” for your changes. If you worked at Facebook, you wouldn’t just flip the switch to 100% instantly on some new timeline view, right? You’d want to do some A/B testing, make sure on your own that the change is going in the right direction, that its performance meets your expectations, etc., and to only commit once you’ve had sufficient telemetric validation.

Now obviously there is a limit to what’s reasonable, even just considering pure engineering costs. You’d be surprised at how cost effective codevelopment can be, however, even if there’s some ramp-up along the way. By the time you hit one of these boundaries, you’ve built up a ton of momentum and context. You probably even have an idea of how to just do what needs to get done. If you stop and offload that to another group, then you need to transfer all that momentum and context, which takes considerable time and energy. Clearly the equation doesn’t always work out in codevelopment’s favor, depending on the complexity of the code on the other side of the boundary and skill of the engineer involved, but it’s worth stopping to think. It’s just software, after all.

After doing codevelopment at scale for the past five years, frankly, I couldn’t imagine going back. Always be on the lookout for opportunities to build your system on your system.

One of the first questions I ask when joining a new team is: Where do code reviews happen?

The answer, and experience of joining in on such reviews, instantly tells you a lot about a team:

Is there engineering rigor?

How open is the team, e.g. is it more of a “closed team” or “everyone is welcome” kind of place?

Does the team culture embrace technical debate and discussion?

Who are the kickass programmers on the team?

What is the work ethic of the team, e.g. do people checkin around the clock, including on weekends, or is it just 9-5?

Related, is the team productive?

What are we actually building, and how do developers spend their time? Are we moving in a consistent direction?

Do people have their own comfort zones, or do developers collaborate across the whole codebase? What are the specific zones?

How much energy is spent on writing new code, versus fixing existing code (bugs)?

Is the environment more of a prototype and learn as you go one, or do checkins always come with buttoned up design specs?

Do developers pay attention to things like performance when writing their code, e.g. do they often cite the results of measurements?

How are our engineering tools, especially around code reviews and code sharing, and are they working well?

How well do individuals communicate their ideas, e.g. are checkin notes terse and unintelligible, or articulate and thoughtful?

And so very much more.

As a technical leader and manager, code reviews and checkins are literally the heartbeat of your team. Reading them religiously — although admittedly time consuming — is an absolute requirement for truly understanding what the team is doing and its strengths and weaknesses. If you’re joining a new team, it puts you right at the heart of the engineering dialogue, and in a position to start fixing whatever is broken (opening it up, encouraging debate, changing technical direction, etc). When it comes time to calibration meetings, you’ll already have a deep awareness of who’s actually getting stuff done, and who is writing the quality code. You’ll see the technical leaders very visibly, including who is really setting the pace for the rest of the team (coding output, good feedback, work ethic, etc).

And from time to time, you may even find the opportunity to offer up a small suggestion yourself. Some might see this as micromanagement, however I’ve found that developers sincerely appreciate when their boss or boss’s boss or whatever really cares enough to take the time to understand their work at this level of detail.

There’s also the converse of this, which is that it helps to keep the team on its toes.

I even recommend that other developers on the team go out of their way to read code reviews and checkins in areas totally unrelated to their day jobs. This helps for the same reason it helps managers: you can learn from others, understand how the team operates and the expected level of output amongst different peer groups (or even those more senior than you), pick up tips and tricks, and so on. Even though I manage a group focused on a developer platform, you bet I go out of my way to read changes to device drivers, kernel, filesystem, networking, browser, etc. I always learn something new.

Now, of course, reading the code doesn’t tell you everything. It won’t give you a complete picture of the design and architecture. It won’t tell you who is going out of their way to collaborate with the team and helping others with their designs behind closed doors. It won’t always tell you who is not a team player (although more often than not it will). It won’t tell you whose code is most effective when it lands in the hands of customers. All of these things are critical, and must not become blind spots, so you’ll need to rely on other data points to supplement the code-oriented perspective.

But I really do believe that reading code is the most effective way to understand the inner workings of your team at a very intimate level. And hey, as the hair gets pointier over time, at least you can fantasize that it is you who is writing it So much code, so little time!

What was meant to be an innocent blog post to ease into some open community dialogue has turned into umm quite a bit more.

As is hopefully clear from my bio, the language I describe below is a research effort, nothing more, nothing less. Think of me as an MSR guy publishing a paper, it’s just on my blog instead appearing in PLDI proceedings. I’m simply not talented enough to get such papers accepted.

I do expect to write more in the months ahead, but all in the spirit of opening up collaboration with the community, not because of any “deeper meaning” or “clues” that something might be afoot. Too much speculation!

I love to see the enthusiasm, so please keep the technical dialogue coming. The other speculation could die silently and I’d be a happy man.

—

My team has been designing and implementing a set of “systems programming” extensions to C# over the past 4 years. At long last, I’ll begin sharing our experiences in a series of blog posts.

The first question is, “Why a new language?” I will readily admit that world already has a plethora of them.

I usually explain it as follows. If you were to draw a spectrum of popular languages, with axes being “Safety & Productivity” and “Performance,” you might draw it something like this:

(Please take this with a grain of salt. I understand that Safety != Productivity (though they certainly go hand-in-hand — having seen how much time and energy is typically spent with safety bugs, lint tools, etc.), that there are many kinds of safety, etc.)

Well, I claim there are really two broad quadrants dominating our language community today.

In the upper-left, you’ve got garbage collected languages that place a premium on developer productivity. Over the past few years, JavaScript performance has improved dramatically, thanks to Google leading the way and showing what is possible. Recently, folks have done the same with PHP. It’s clear that there’s a whole family of dynamically typed languages that are now giving languages like C# and Java a run for their money. The choice is now less about performance, and more about whether you want a static type system.

This does mean that languages like C# are increasingly suffering from the Law of the Excluded Middle. The middle’s a bad place to be.

In the lower-right, you’ve got pedal-to-the-metal performance. Let’s be honest, most programmers wouldn’t place C# and Java in the same quadrant, and I agree. I’ve seen many people run away from garbage collection back to C++, with a sour taste permeating their mouths. (To be fair, this is only partly due to garbage collection itself; it’s largely due to poor design patterns, frameworks, and a lost opportunity to do better in the language.) Java is closer than C# thanks to the excellent work in HotSpot-like VMs which employ code pitching and stack allocation. But still, most hard-core systems programmers still choose C++ over C# and Java because of the performance advantages. Despite C++11 inching closer to languages like C# and Java in the areas of productivity and safety, it’s an explicit non-goal to add guaranteed type-safety to C++. You encounter the unsafety far less these days, but I am a firm believer that, as with pregnancy, “you can’t be half-safe.” Its presence means you must always plan for the worst case, and use tools to recover safety after-the-fact, rather than having it in the type system.

Our top-level goal was to explore whether you really have to choose between these quadrants. In other words, is there a sweet spot somewhere in the top-right? After multiple years’ of work, including applying this to an enormous codebase, I believe the answer is “Yes!”

The result should be seen more of a set of extensions to C# — with minimal breaking changes — than a completely new language.

The next question is, “Why base it on C#?” Type-safety is a non-negotiable aspect of our desired language, and C# represents a pretty darn good “modern type-safe C++” canvas on which to begin painting. It is closer to what we want than, say, Java, particularly because of the presence of modern features like lambdas and delegates. There are other candidate languages in this space, too, these days, most notably D, Rust, and Go. But when we began, these languages had either not surfaced yet, or had not yet invested significantly in our intended areas of focus. And hey, my team works at Microsoft, where there is ample C# talent and community just an arm’s length away, particularly in our customer-base. I am eager to collaborate with experts in these other language communities, of course, and have already shared ideas with some key people. The good news is that our lineage stems from similar origins in C, C++, Haskell, and deep type-systems work in the areas of regions, linearity, and the like.

Finally, you might wonder, “Why not base it on C++?” As we’ve progressed, I do have to admit that I often wonder whether we should have started with C++, and worked backwards to carve out a “safe subset” of the language. We often find ourselves “tossing C# and C++ in a blender to see what comes out,” and I will admit at times C# has held us back. Particularly when you start thinking about RAII, deterministic destruction, references, etc. Generics versus templates is a blog post of subtleties in its own right. I do expect to take our learnings and explore this avenue at some point, largely for two reasons: (1) it will ease portability for a larger number of developers (there’s a lot more C++ on Earth than C#), and (2) I dream of standardizing the ideas, so that the OSS community also does not need to make the difficult “safe/productive vs. performant” decision. But for the initial project goals, I am happy to have begun with C#, not the least reason for which is the rich .NET frameworks that we could use as a blueprint (noting that they needed to change pretty heavily to satisfy our goals).

I’ve given a few glimpses into this work over the years (see here and here, for example). In the months to come, I will start sharing more details. My goal is to eventually open source this thing, but before we can do that we need to button up a few aspects of the language and, more importantly, move to the Roslyn code-base so the C# relationship is more elegant. Hopefully in 2014.

At a high level, I classify the language features into six primary categories:

1) Lifetime understanding. C++ has RAII, deterministic destruction, and efficient allocation of objects. C# and Java both coax developers into relying too heavily on the GC heap, and offers only “loose” support for deterministic destruction via IDisposable. Part of what my team does is regularly convert C# programs to this new language, and it’s not uncommon for us to encounter 30-50% time spent in GC. For servers, this kills throughput; for clients, it degrades the experience, by injecting latency into the interaction. We’ve stolen a page from C++ — in areas like rvalue references, move semantics, destruction, references / borrowing — and yet retained the necessary elements of safety, and merged them with ideas from functional languages. This allows us to aggressively stack allocate objects, deterministically destruct, and more.

2) Side-effects understanding. This is the evolution of what we published in OOPSLA 2012, giving you elements of C++ const (but again with safety), along with first class immutability and isolation.

3) Async programming at scale. The community has been ’round and ’round on this one, namely whether to use continuation-passing or lightweight blocking coroutines. This includes C# but also pretty much every other language on the planet. The key innovation here is a composable type-system that is agnostic to the execution model, and can map efficiently to either one. It would be arrogant to claim we’ve got the one right way to expose this stuff, but having experience with many other approaches, I love where we landed.

4) Type-safe systems programming. It’s commonly claimed that with type-safety comes an inherent loss of performance. It is true that bounds checking is non-negotiable, and that we prefer overflow checking by default. It’s surprising what a good optimizing compiler can do here, versus JIT compiling. (And one only needs to casually audit some recent security bulletins to see why these features have merit.) Other areas include allowing you to do more without allocating. Like having lambda-based APIs that can be called with zero allocations (rather than the usual two: one for the delegate, one for the display). And being able to easily carve out sub-arrays and sub-strings without allocating.

5) Modern error model. This is another one that the community disagrees about. We have picked what I believe to be the sweet spot: contracts everywhere (preconditions, postconditions, invariants, assertions, etc), fail-fast as the default policy, exceptions for the rare dynamic failure (parsing, I/O, etc), and typed exceptions only when you absolutely need rich exceptions. All integrated into the type system in a 1st class way, so that you get all the proper subtyping behavior necessary to make it safe and sound.

6) Modern frameworks. This is a catch-all bucket that covers things like async LINQ, improved enumerator support that competes with C++ iterators in performance and doesn’t demand double-interface dispatch to extract elements, etc. To be entirely honest, this is the area we have the biggest list of “designed but not yet implemented features”, spanning things like void-as-a-1st-class-type, non-null types, traits, 1st class effect typing, and more. I expect us to have a handful in our mid-2014 checkpoint, but not very many.

Assuming there’s interest, I am eager to hear what you think, get feedback on the overall idea (as well as the specifics), and also find out what aspects folks would like to hear more about. I am excited to share, however the reality is that I won’t have a ton of time to write in the months ahead; we still have an enormous amount of work to do (oh, we’re hiring ;-)). But I’d sure love for y’all to help me prioritize what to share and in what order. Ultimately, I eagerly await the day when we can share real code. In the meantime, Happy Hacking!

I updated my blog software over the weekend to something worthy of the year 2013.

Aside from culling boatloads of spam comments that have accumulated over the years, all of the important content has been preserved.

I also added some permalink redirection goo so that old hyperlinks continue to work. That includes my old RSS URL, so hopefully those with feed readers won’t notice a hiccup in service. Although to be honest I have no idea whether, all of a sudden, almost a decade of posts will appear to have been newly posted yet again. I apologize for any disruption in the event that there is any.

As part of the transition, I’ve also begun using http://joeduffyblog.com as my hostname. Thanks to the permalink redirection, most of the old URLs should still get you to the right place. Feel free to update your RSS feeds to the new link at http://joeduffyblog.com/feed … or not, as the old RSS link will continue working indefinitely.

If you happen upon a broken link or something that doesn’t seem right, please do let me know. I am hosting this thing along with my mail servers myself, so we might hit some bumps.

And yes, this does mean that I intend to blog a whole lot more in the months to come.

What I am about to say admittedly flies in the face of common wisdom. But I grow more convinced of it by the day. Put simply, there ought not to be a distinction between software research and software engineering.

I’ll admit that I’ve seen Microsoft struggle with this at times, and that this is partly my motivation for writing this essay. An artificial firewall often exists between research and product organizations, a historical legacy more than anything else, having been the way that many of our industry’s pioneers have worked (see Xerox PARC, Bell Labs, IBM Research, etc). This divide has a significant cultural impact, however. And although I have seen the barriers being broken down in most successful organizations over time, we still have a long way to go. The reality is that the most successful research happening in the industry right now is happening in the context of real products, engineering, and measurements.

The cultural problem manifests in different ways, but the end result is the same: research that isn’t as impactful as it could be, and products that do not reach their full innovation potential.

One pet peeve of mine is the term “tech transfer.” This very phrase makes me cringe. It implies that someone has built a technology that must then be “transferred” into a different context. Instead of doing this, I would like to see well-engineered research being done directly within real products, in collaboration with real product engineers. Those doing research can run tests, measure things, and see whether – in an actual product setting – the idea worked well or not. By deferring this so-called “transfer”, in contrast, the research is always a mere approximation of what is possible. It may or may not actually work in practice.

Often product groups attempt to integrate so-called “incubation” efforts within their team, however it is seldom effective. The idea is to take research and morph it into a real product. I actually think that by doing joint research and engineering, we can fail faster on the ideas that sounded good on the tin but didn’t quite pan out, while ensuring that the good ideas come to life quicker and with higher quality and confidence.

A common argument against this model is that “researchers have different skillsets than engineers.” It’s an easy thing to say, and almost believable, however I really couldn’t disagree more.

This mindset contributes to the cultural divide. I’ll be rude for a moment, and depict what happens in the extreme of total separation between research and engineering. Should that happen, people on the product side of things see researchers as living in an ivory tower, where ideas – though they make for interesting papers – never work out in practice. And people on the research side naturally prefer that they can more rapidly prototype ideas and publish papers, so that they can more quickly move to the next iteration of the idea. They can sometimes view the engineers as lacking brilliant ideas, or at least not recognizing the importance of what was written in papers. Unfortunately, though this characterization is oozing with cynicism, both parties may both actually be correct! Because the research is done outside of real product, the ideas need some “interpretation” in order to work. And of course it’s usually in the best interest of the engineers to stick to their own (admittedly less ambitious) ideas, given that they are more pragmatic and naturally constrained by the realities of the codebase they are working in.

Back in the age of think-tank software research organizations – such as Xerox PARC and Bell Labs – there truly was a large intellectual horsepower divide between the research and engineering groups. This was intentional. And so the split made sense. These days, however, the Microsofts and Googles of the world have just as many bright engineers with research-worthy qualifications (PhDs from MIT, CMU, Harvard, UCLA, etc.) as they have working in the research-oriented groups. The line is blurrier than ever.

Now, I do divide computer science research activities broadly into two categories. The first is theoretical computer science, and the second is applied computer science. I actually do agree that they require two very different skills. The former is mathematics. The latter isn’t really science per se; rather, it’s really about engineering. I also understand that the time horizon for the former is often unknowable, and does not necessarily require facing the realities of software engineering. It’s about creating elegant solutions to mathematical problems that can form the basis of software engineering in the future. There is often no code required – at most just a theoretical model of it. And yet this work is obviously incredibly important for the long-term advancement our industry, just as theoretical mathematics is important to all industries known to mankind. This is the kind of science that has led to modern processor architecture, natural language processing, machine learning, and more, and is undoubtedly what will pave the way for enormous breakthroughs in the future like quantum computing.

But if you’re doing applied research and aren’t actively writing code, and as a result aren’t testing your ideas in a real-world setting, I seriously doubt the research is any good. It certainly isn’t going to advance the state of the art very quickly. Best case, the ideas really are brilliant, and – often a few years down the road – someone will discover and implement them. Worst case, and perhaps more likely, the paper will get filed underneath the “interesting ideas” bucket but never really change the industry. This is clearly not the most expedient way to impact the world, when compared to just building the thing for real.

Coding, put simply, is the great equalizer.

Academia is, of course, a little different than industry, as there are frequently no “software assets” of long lasting value within a particular university, and therefore certainly no easy way to directly measure the success of those assets. But I still think it is critical to engineer real systems when doing academic research. For academics, there are options. You can partner with a software company or contribute to open source, for example. Both offer a glimpse into real systems which will help to validate, refine, and measure the worth of a good idea as realized in practice.

I absolutely adore the story of how Thad Starner managed to walk this line perfectly. While researching wearable computing, he partnered with a company with ample resources (Google) to build a truly innovative product that was years ahead of the competition (Google Glass).

As you read on, I hope you agree that dichotomy is beginning to make a bit less sense…

Now I love the idea of writing papers. Doing so is critical for sharing knowledge and growing our industry as a whole. We do far less of this than other industries, and as a result I believe the rate of advancement is slower than it could be. And as a company, I believe that Microsoft engineering groups do a very poor job of sharing their valuable learnings, whereas our research groups do an amazing job. I truly believe the usefulness of those papers would grow by an order of magnitude, however, if they covered this research in a true product setting. I believe that sharing information and sharing code is essential to the future growth of our industry, as it helps us all collectively learn from one another and enables us to better stand on the shoulders of giants. And if an idea fails in practice, we should understand why.

A lot of research organizations value code and building real things, but still keep the group separate from the engineering groups. The building real things part is a step in the right direction, however the tragedy is that most of the time such research ends with a “prototype”; at best, some number of months (or years!) later, the product team will have had a chance to incorporate those results. Perhaps it happened friction-free, but in most cases, changes in course are needed, new learnings are discovered, etc. What great additions these would make to the paper.

And, man, how painful is it to realize that you could have delivered real customer value and become a true technological trendsetter, but instead sat idle, in the worst case never delivering the idea beyond a paper, and in the best case delaying the delivery and thus giving your competitors an easy headstart and blueprint for cloning the idea. Even if you disagree with everything I say above, I doubt anybody would argue with me that the pace of innovation can be so much greater when research and engineering teams work more closely with one another.

The good news is that I see a very forceful trend in the opposite direction of the classical views here. With online services and an ecosystem where innovation is being delivered at an increasingly rapid pace, I do believe that mastering this lesson really will be a “life or death” thing for many companies.

Next time somebody says the word “research”, I encourage you to stop and ponder the distinction from engineering they are really trying to draw. Most likely, I assert, you will find that it’s unnecessary. And that by questioning it, you may find a creative way to get that innovative idea into the hands of real human beings faster.

I am naturally drawn to teams that work at an insane pace. The momentum, and persistent drive to increase that momentum, generates amazing results. And it’s crazy fun.

In such environments, however, I’ve found one thing to be a constant struggle for everybody on the team — leaders, managers, and individual doers alike: remembering to take the necessary time to do the right thing. This sounds obvious, but it’s very easy to lose sight of this important principle when deadlines loom, customers and managers and shareholders demand, and the overall team is running ahead at a breakneak pace.

A nice phrase I learned from a past manager of mine was, “sometimes you need to slow down to speed up.”

By taking shortcuts today, though attractive in that they help meet that next closest deadline, you almost always pay for them down the road. You might subsequently become quagmired in bugs because quality was comprimised from the outset. You may create a platform that others build upon, only to realize later that the architecture is wrong in need of revamping, incurring a ripple effect on an entire software stack. You may realize that your whole system performs poorly under load, such that just when your startup was beginning to skyrocket to success, users instead flee due to the poor experience. The manifestation differs, but the root cause is the same.

The level of quality you need for a project is very specific to your technology and business. I’ll admit that working on systems software demands different quality standards than web software, for example. And the quality demands change as a project matures, when the focus shifts from writing reams of new code to modifying existing code… although the early phases are in fact the most challenging: this is when the most critical cultural traits are not yet set but are developing, when things have the highest risk of getting set off in the wrong direction, and is when you are most likely to scrimp on quality due to the need to make rapid progress on a broad set of problems all at once.

So how do you ensure people end up doing the right thing? Well, I’d be lying if I didn’t say it is a real challenge.

As a leader, it is important to create a culture where individuals get rewarded for doing the right thing. Nothing beats having a team full of folks that “self-police” themselves using a shared set of demanding principles.

To achieve this, leaders needs to be consistent, demanding, and hyper-aware of what’s going on around them. You need to be able to recognize quality versus junk, so that you can reward the right people. You need to set up a culture where critical feedback when shortcuts are being taken is “okay” and “expected.” I’ve made my beliefs pretty evident in prior articles, however I simply don’t believe you can do this right in the early days without being highly technical yourself. As a team grows, your attention to technical detail may get stretched thin, in which case you need to scale by adding new technical leaders that share, recognize, and maintain or advance these cultural traits.

You also can’t punish people for getting less done than they could have if they took those shortcuts. Many cultures reward those who hammer out large quantities of poorly written code. You get what you reward.

In fact, you must do the opposite, by making an example out of the people who check in crappy code.

Facebook has this slogan “move fast and break things.” It may seem that what I’m saying above is at odds with that famous slogan. Indeed they are somewhat contradictory, however paradoxically they are also highly complementary. Although you need to slow down to do the right thing, you do also need to keep moving fast. If that seems impossible, it’s not; but it sure is difficult to find the right balance.

I have a belief that I’m almost embarassed to admit: I believe that most people are incredibly lazy. I think most quality comprimise stems from an inherent laziness that leads to details being glossed over, even if they are consciously recognized as needing attention. The best developers maintain this almost supernatural drive that comes from somewhere deep within, and they use this drive to stave off the laziness. If you’re moving fast and writing a lot of code, strive to utilize every ounce of intellectual horsepower you can muster — sustained, for the entire time you are writing code. Even if that’s for 16 hours straight. If at any moment a thought occurs that might save you time down the road, stop, ponder it, course correct on the fly. This is a way of “slowing down to speed up” but in a way where you can still be moving fast. Many lazier people let these fleeting thoughts go without exploring them fully. They will consciously do the wrong thing because doing the right thing takes more time.

I’ve developed odd habits over the years. As a compile runs, I literally pore over every modified line of code, wondering if there’s a better way to do it. If I see something, I push it on the stack and make sure to come back to it. By the time I’ve actually commited some new code — regardless of whether it’s 10,000 lines of freshly written code, or a 10 line modification to some existing stuff — chances are that I’ve read each line of code at least three times. I disallow any detail I see to slip through the cracks. And my mind obsesses over all aspects of my work, even during “off times” (e.g., eating dinner, walking down the hallway, etc). Each of these opportunities represents a chance to slow down, reflect, and course correct.

Do I still miss thing? Sure I do. But that’s why it’s so critical to have a team around you who shares the same principles and will help to identify any shortcomings that I’ve missed.

Another practice I encourage on my team is fixing broken windows. I’m sure folks are aware of the so-called broken windows theory, where neighborhoods in which broken windows are tolerated tend to accumulate more and more broken windows with time. It happens in code, too. If people are discouraged from stopping to fix the broken windows, you will end up with lots of them. And guess what, each broken window actually slows you down. As more and more accumulate, it can become a real chore to get anything meaningful done. I guarantee you will not be able to move very fast if too many broken windows pile up and start needing attention. Slowing down to fix them incrementally, as soon as they are noticed, speeds you up down the road.

Building a quality-focused team isn’t easy. But creating a culture that slows down to do the right thing, while simultaneously moving fast, provides an enormous competetive advantage. It’s not as common as you might think.

I mentioned a few months back that my team had collaborated with MSR to publish a paper to OOPSLA about some novel aspects of our programming language (see here and here).

I was excited when Jonathan over at InfoQ asked to interview me about this work. We had a fun back and forth, and I hope the result helps to clarify some of the design goals and decisions we made along the way.

It’s really hard to build a great team. It can take years of hard work and an enormous amount of patience.

The reality is that there’s only a finite (read: small) number of truly amazing software developers in the world, especially compared to the opportunities and exciting projects available to them.

And yet, great teams are fueled first and foremost by great people. I often liken this to the aphorism “a rising tide lifts all boats.”

The original meaning of the phrase of course had nothing to do with software. It was the notion that focusing on growth of the overall economy’s GDP will necessarily have a positive impact on the incomes of individuals within that economy. Now, of course, it’s not always true, and I’m no theoretical economist, however the basic idea in spirit is an intuitively interesting one.

Applying this thinking to teams, it implies you should always strive to hire better and better people. That by doing so, the overall quality of the team will rise. Hiring better and better people has a nonlinear impact to the culture, because a team is not just a disjoint set of nodes, but is instead a fully connected graph of individuals who have conversations and collaborate together. A greater overall quality of the team means richer connections and more powerful, higher quality innovation and software. It means your chance of truly changing the world has grown nonlinearly as well.

I strive to only hire people who are better than me, and better than people already on the team, in some interesting dimension. As soon as you let your high standards drop even an ounce, the average drops and there is a cumulative snowballing effect. The connections grow weaker, and a nonlinear drop in quality and innovation will occur. This is my nightmare scenario because it can go downhill very quickly.

This applies to an entire company as well as individual teams. Including what can happen should the tides lower. The brain drain begins as a slow drip, and can quickly turn into a torrential downpour in an instant. It often starts from the top, because culture and hiring start from the top.

Now, I will be the first to admit that raising the tide is hard. Damn hard, in fact. I have another phrase which is “always be on.” That incredible engineer you worked with ten years ago just might be the piece missing in the puzzle today, and a good way to lift the boats. Opportunities come and go when you least suspect them, and you want those people to want to join your team. I have several individuals that literally took years of effort to recruit. And the wait was well worth it. This advice applies to individual contributers as much as it does to managers. You never know if in a few years, you’ll be leading a team, kicking off your own startup, or even just helping to make your own team a better place.

And as a leader you owe all of this to your existing team. By lifting the boats, your entire team benefits. They grow, learn new things, and reach new heights in their own careers.

Despite being hard work, this all pays off the end. There is very little I find more satisfying in life than building and growing a great team, seeing the year over year improvements, and creating amazing things together. Perhaps even more than coding. (gasp)