In his autobiography Rare Air: Michael on Michael, Michael Jordan talks a lot about teammates and teamwork. One page I distinctly remember has a picture of Jordan on the court in his red home jersey bent over with his hands on his knees, and Scottie Pippen is standing next to Jordan with his hand placed on Jordan’s sweaty, shaved head . It was a picture taken by acclaimed photographer Walter Iooss Jr. during the heated NBA playoffs in the early 90s and it tells a thousand words.

There’s a weariness in their body language, they are exhausted or very close to it. But there is also a sense of determination and connectedness, a feeling between the two players that they are going to fight through the challenges in front of them together and as a team – that they would fight together. Jordan obviously carried the team, but he was never alone, it was as a team they won the championship that year and for so many years after.

Another paragraph in the book quotes Michael while he talks about his pick-up games he’d play during practice or on the blacktop. He always played to win, no matter the stakes, no matter the players, so high was his competitive enthusiasm. He said, and I paraphrase, “Give me a teammate with talent or give me a teammate with heart, and I will choose the teammate with heart every time”.

Talent will take one only so far, and relying on talent alone can often lead to excess ego. Someone with heart is usually someone who takes on any role for the team to succeed, someone who admits they don’t have all the answers, learns from others, and is hungry enough to be coached. Someone with heart has an internal fortitude to give that something extra when success is on the line. With heart comes determination and perseverance.

The photo of Jordan and Pippen embodies this sentiment. The Chicago Bulls were smart and lucky enough to build such a team around Jordan.

Building a startup requires people of similar character, especially at the founder level and with early hires. In order for a startup to grow and flourish, you need teammates willing to take on multiple roles, put in long, hard hours, and absorb and learn new things.

I don’t mean to berate talent. Talent is so important, critical for success and talented people are hard to find. But talented people with the determination and perseverance to run through walls with you are even more difficult to find. They separate the wheat from the chaff.

Data used to be the realm of scientists, researchers, and mathematicians in government institutions, universities or massive corporations (think banks). Even in the latter case, and still today, data is often buried under layers of infrastructure and protocols.

The internet and increased bandwidth/accessibility over the past decade gave birth to the democratization of data. Okay, democratization is a big word, but you get the point. Data is being generated everywhere in various forms and is being collected and analysed by anyone who desires to do so. Even governments are opening up their troves of data to the public with the goal of greater transparency and efficacy.

In order build 24/7/365 web-scale applications on top of this data, we need to think about data in a disciplined way. The system does not stand alone. Data is part of the sytsem. Testing and measuring the system’s response to different types of data and/or increased loads of data requires a framework that informs the engineering team.

To that end, here are eight parameters that should be accounted for when building systems that deal with high volume data. These are things to think about up and down the full stack, not just the “back-end” or the “front-end”.

1. Total data size. This is the amount data the system reads and writes while operating. The proportion of data read to data written varies among systems. A site like the NYTimes.com ostensibly performs many more reads (serving articles to each visitor) than writes, while an email application may need to handle more writes when accounting for spam. Some systems can generate massive amounts of secondary output such as log files. The growth of these outputs should also be taken into account.

2. Growth and change. How fast is data added to the system? Can older data be edited? Can data be removed? If so, how often do these types of transactions occur? Knowing how fast your data grows and changes is critical to understanding system performance and tackling scalability. Set up an internal testing component that simulates increasing data loads and run your system against this to see how it responds to different operations.

3. Temporal axis. Almost all data has some temporal nature, whether it’s a publication date, location timestamp, or time of last access. The system often needs to handle data differently depending on where the data lies along the time axis. For applications such as email and to an even greater extent on Twitter, older data is less frequently accessed. Conversely, users of Facebook very often look at pictures taken months or years ago. This is even more true with Timeline. Data must be stored with these access patterns in mind.

4. Network vs Disk vs RAM vs CPU. I’ve stated these in a specific order, you should know why. To borrow and mutate a common phrase — Know thy system. One of these four components will be your bottleneck as the amount of data grows. A bottleneck at the CPU indicates that your algorithms and processing is slow (perhaps you’re doing a lot of compression or maybe you’ve nested one too many for loops). Conversely you may have an “I/O bottleneck” where the system spends most of its time fetching data from disk or over the network. In this case maybe you’ve missed an index on a database table or maybe you need to think about distributing your data to increase read/write throughput. Think about whether you should be bringing the computation (“CPU”) to the data rather than trying to fetch large amounts of data to a processing component.

5. Throughput. Throughput is defined by two numbers: request size and request rate. Request size is the amount of data the applications reads or writes in a given call or user transaction. Request rate is how many transactions occur each second or each minute. Request rate is defined by the number of concurrent users and the distribution of activity levels across those users (from power users to casual users).

6. Consistency. This becomes more of an issue when dealing with distributed data. One machine or storage unit may be written to with the most up-to-date data, but the other storage units need to be updated as well. Different storage systems tackle this problem in different ways. MySQL Cluster uses the two-phase commit which has long been in use but gives up some availability (MySQL Cluster assumes some level of redundancy). The recent NoSQL movement sacrifices consistency for availability. The strategy used depends heavily on the application and use cases. In cases such as at an ATM, you expect your checking account to update immediately after withdrawing cash. In other cases, such as updating your Twitter profile, the consistency requirement is not as strong (although it does make for a frustrating experience when you still see your old profile up several minutes after editing it).

7. Latency. How fast does the application need to respond to a user’s action? When performing a friend search on Facebook I expect the friend to be returned within a second or two. When buying tickets to a show online however, I would expect this transaction to a take a few seconds longer. My tolerance for waiting is higher here. Your application should optimize for frequently accessed data, either by indexing the data or caching the data.

8. Physical and logical proximity. Data that is requested together should be stored together. When I visit a friend’s Facebook page, all of her pictures need not be loaded as well. Information that appears on a friend’s profile page such as name, profile pic, age, location, statuses should be stored close to each other so that these bits can be retrieved more efficiently. Sharding is one strategy used to keep commonly accessed pieces of data close together while simultaneously using the advantages of a distributed system.

“Big data” in and of itself is not useful. Bryce from OATV wrote a great post about “big data” and it’s shortcomings. However it’s still up to us to build the systems that can deal with this type of data and it’s up to us to then figure out what data is useful and what isn’t. To be hamstrung by your own system is not a place you want to be.

Engineer and test with discipline and you will be headed in the right direction.

…Life can be much broader once you discover one simple fact — and that is everything around you that you call life was made up by people that were no smarter than you. And you can change it, you can influence it, you can build your own things that other people can use…

I recently read this post over at Seth Godin’s blog and I liked its message so much I posted it below. I think years of schooling (almost two decades including college) train people to compete by continuously comparing and judging yourself against your classmates and peers. Even grading schemes are very often, if not always, graded on a curve. Teachers and professors set the goals (the curriculum, the assignments, the tests and exams) and you compete against your classmates to achieve the best results.

As a student you are given orders and are expected to meet them, there is little leeway in terms of setting your own measures of achievement.

In the workplace, and for the rest of life, this mindset doesn’t work. It’s up to you to create your own goals and drive yourself towards these goals. It’s what you do and what you work towards when no one else is watching, and when no one else is expecting anything, that will matter.