Search This Blog

Last week I laid some groundwork for understanding statistics that you can use everyday. Calculating averages and looking at histograms to get an idea of what kind of data you're dealing with is a great first step in statistical analysis. But there are so many more tools available, especially if your data approximates a normal distribution. A couple of those tools are standard deviations and confidence intervals, and they can help you understand the characteristics of your data in more detail.

It's important to remember that these tools work best when used with data sets that have a normal distribution because the equations of the tools describe properties of the normal distribution. The more your data set's distribution differs from a normal distribution, the less descriptive power these tools will have for your data. With that cautionary note out of the way, let's tackle some statistics.

The standard deviation is a measure of how tightly clustered a data set's samples are around the mean. As more samples occur closer to the mean, the standard deviation goes down. If the samples are more spread out, it goes up. The following graph shows what different normal distributions look like with standard deviations of 1, 2, and 0.5:

The above description of standard deviation nearly tells you how to calculate it. You could memorize a formula, but it's good to understand the reasoning behind the tools you use so that you can derive the formula even if you can't remember it. Here's how I would build up the formula for the standard deviation:

From the description, we know that we need to start with the mean of the data set and then find the distance (i.e. difference) of every sample from the mean.

We don't care if the difference is positive or negative, so we'll square each difference to remove its sign.

We're looking for a general value representing the spread of the data, so we can calculate the average of all of the squared differences to come to a single representative value of the data. This value, as calculated so far, is actually called the variance of the data.

We're almost there. The only problem is that the units of the variance are squared. To get back to the sample units, take the square root of the variance.

This set of operations—taking the sum of squared differences or the mean of squared differences—is extremely common in statistics. It's a fundamental tool for measuring the spread of data points or for estimating errors from calculations on the data. Keep it in mind. We'll be coming back to it in later posts.

At this point you may be asking why we squared the differences and then took the square root of the variance. Why not just take the absolute value of the differences, find the mean, and call it a day? If you think of every sample as a separate dimension of the variance of the data set, then the difference between the mean and each sample can be considered a vector. The way to find the magnitude of a vector is to calculate the square root of the sum of squares of its elements. That's exactly what the standard deviation formula does.

Now that we have figured out how to calculate the standard deviation of a data set, here's what it would look like in Ruby code:

One thing to note with this calculation of the standard deviation is that it is biased slightly low because the variance is found by dividing by the number of samples. This bias is negligible for sample sizes larger than about 10, but to get a slightly less biased estimate, you can divide by the sample size minus one.

The mean and standard deviation together are very useful for describing data. If you talk about a data set as the mean plus or minus two standard deviations, you get a pretty good idea of the range of values for a particular variable. For example, if you have a set of temperature data with a mean of 25°C and a standard deviation of 2.5°C, you can say that the data has a value of 25±5°C to show that it represents a range of temperature values that mostly vary between 20°C and 30°C.

Using two standard deviations to represent the range of a data set is fairly typical. Two standard deviations happens to cover slightly more than 95% of a data set with a normal distribution, so it's a good way of showing the typical values that the data are expected to take on. One standard deviation covers about 68% of the normal distribution. Three standard deviations cover nearly the entire normal distribution at 99.7%, so it's correspondingly much less likely to see values outside of three standard deviations. The following graph shows what one and two standard deviations look like on the normal distribution:

Another way to use the standard deviation is to describe how confident we are that the mean of the data lies within a certain range of values. Our confidence will also depend on the size of the data set for which we're calculating the mean. This makes sense. If you have 10,000 samples you should be much more confident of where the average is than if you only have 100 samples to work with. The range of values that should include the mean to a certain confidence level is referred to as a confidence interval.

To calculate a confidence interval, all you have to do is find the number of standard deviations that will give you the desired confidence level and divide by the square root of the number of samples in the data set. A 95% confidence level is commonly used, and that corresponds to 2 standard deviations (well, actually 1.96, but 2 is easier to remember for quick and dirty calculations). Adding to our example Ruby code, we can use the following method for calculating a 95% confidence interval:

In the previous temperature example with a standard deviation of 2.5°C, if we had 1000 samples, we could say that we are 95%
confident that the average temperature lies within 25±0.16°C. Notice that, because of the square root operation, to shrink the confidence interval by a significant amount, you have to dramatically increase your sample size. To cut the interval in half, you would have to take four times as many samples. To reduce the interval by an order of magnitude, you would have to increase your sample size a hundred fold!

We should be careful here not to say that there is a 95% chance that the average temperature lies in the calculated range. It's a subtle distinction from the above phrasing, but it's wrong. It also starts down a slippery slope to making dubious claims about the data.

The problem with saying there's a 95% chance is that you are no longer talking about confidence. You are talking about probability, and saying there is a probability that the average temperature is within that range is not correct. The average may or may not be within this confidence interval, but it is a definite value, whether you know it or not. That means once you have a confidence interval, the true mean is either in it or not. There is no probability involved. If you don't know what the true mean is—and many times you can't know because you're only sampling from a larger population—then you have a confidence level that the mean is within the range that you calculated.

There are many more details and subtleties contained within these concepts, but with these basics you can find a lot of practical applications of standard deviations and confidence intervals when working with data. Standard deviations help describe the spread of data that follows a normal distribution, and confidence intervals give you a sense of how meaningful the mean of a data set is. Both of these tools can give you a deeper understanding of the data you're analyzing.

Programmers work with lots of data, and when you work with any amount of
data, statistics will quickly come into play. A basic working knowledge
of statistics is a good tool for any programmer to have in their
toolbox, one that you will likely use regularly once you're comfortable
with it.

In this mini-series I'll go through some of the statistical tools that I use everyday, what to look for when deciding on which tool to use, and how to avoid mistakes that can mess up your results. I'll start
with the basics of the basics—the humble average and a few of the most
common distributions.

The average, formally called the mean,
is the most fundamental concept in statistics. It's the hammer in the
statistical toolbox. The trick is to use it on the right nails. An
average is easy to calculate. Given a data set, all you need to do is
add up all of the samples and divide by the number of samples in the
set. Voila, you have a value that lies in the middle of your data set.

In Ruby you could calculate a mean from an array of values with a method like this:

Even though the average is a basic tool that's super
easy to use, if you're not careful with what you use it on, it can give
misleading results. For example, you would not want to use an average to
analyze data about the growth of a web site's user base over time. The
average number of users is a meaningless number because it's strongly
related to another variable, time. Depending on the growth rate,
calculating the average number of users over a month or even a week would tell you very little about the number of users on your site. A more sophisticated tool, like a curve fit, is more
applicable in this case.

The average is best used on a
data set where the samples are independent, meaning they are not related
to each other through some dominant parameter of the system, and they are
close together relative to their magnitude. In statistics this is called
the deviation from the mean, and I'll get into deviations more in a
later post. The important thing to remember is that the smaller the
deviation is, the more meaningful the mean is as a representation of the
data set. If a certain data set is a set of temperature measurements with an average of 200°C and varies from 199°C to 201°C, the average is going to say a lot more about
the temperature of the data than if the measurements varied from 100°C
to 300°C.

The other important thing to consider when
using averages is the distribution of the data. The best case scenario
for using averages is if the data has a normal distribution.
This type of distribution results from a data set whose samples are
clustered around the mean value in a way that's not biased to one side
or the other. If you group samples into ranges of values and plot the
results as a histogram, you would get something like this for a normal distribution:

You can calculate the histogram bin counts in Ruby with a method like this:

The each block runs through each data value and calculates which bin it should be counted in using the minimum value and the step size between bins. Any value that's less than the lowest bin will be counted in that bin, and any value greater than the highest bin will be counted in that bin. The bins can be automatically calculated from the minimum and maximum data values, but if there are outliers, the bins could be very spread out, making the histogram look weird. In that case, you can specify where you want the range of the histogram to be, and any outliers will be counted in the bins on either side.

Histograms
can tell you a lot about the characteristics of a data set. They will
clearly show whether the average is a meaningful representation of the
data, how spread out the samples are within the data set, and if there
are any peculiarities hidden in the data set. If your data looks like a
normal distribution, there are a lot of statistical tools at your
disposal for analyzing it.

The mean is also well-defined.
It sits right at the peak of the histogram. Half of the samples are
larger than the mean, and half of the samples are smaller than the mean.
If the histogram is narrow and tall, it shows that the distribution is
tight, and a lot of samples are very close to the mean value. The mean
is a good value to represent the data set in this case, and its usefulness
in predicting the value of future samples with the same characteristics
is high.

When you expect your data to have a normal
distribution, but it doesn't, it's common for the histogram to show one
or more additional peaks. This is called a multimodal distribution, or in the most common case of two peaks, a bimodal distribution. A histogram of such a distribution might look like this:

There
are a lot of possibilities here, and most bimodal histograms will not look
exactly like this one. The peaks could be further apart, one could be
narrower than the other, or one could be taller than the other. In any
case, the bimodal nature of the data makes an average less
characteristic of the data than normal. Depending on how far apart the
peaks are, the mean could merely be shifted from where it should be, or
it could fall in an area of the data where there are hardly any data
points at all—not a terribly useful place for a mean to be.

What
this histogram is telling you is that something is affecting the data
in a binary way, meaning that the parameter that is affecting the data is takes on two values.
When it's one of the two values, the samples will cluster around one
mean, and when it's the other value, the samples will cluster around a
different mean. Each value of the parameter will create a separate
normal distribution, and the combination of samples at two different
values of this unknown parameter creates the bimodal distribution. In
fact, I created the above graph by adding two normal distributions
together with two different means.

If you have data
with a bimodal distribution, you can approach it in a variety of ways.
Normally, the first thing you want to do, if you don't know already, is
figure out what is causing the distribution to split. Did conditions
change while taking the data? Is there some unknown variable that you
should filter on to split the data into two sets? What can you control
better when taking the data? Can this unknown variable take on more than
two values, and if so, does that have a deterministic effect on the
data? Can you control this variable to optimize things in the system?
Exploring and answering these questions can lead to important new
insights into the behavior of the system you're analyzing.

Once
you know more about what's causing the bimodal distribution, you can
decide whether you want to eliminate the cause or use it to enhance the
system. That will depend on whether this variable has a desirable effect
on the system or not. In either case, bimodal distributions shouldn't
be feared. You can learn a lot about a system by exploring why such a
distribution exists.

One last very common distribution is a power-law distribution, also known as a Pareto distribution.
This kind of distribution shows up all the time when measuring things
with a hard one-sided limit, and it generally looks like this:

The
initial sharp decline and then gradual sloping fade away of the curve
gives the right side of this distribution the name "the long tail." Many
times there is no bounded upper limit to how large whatever is being
measured can get. Some examples of things that follow a power-law
distribution include people's incomes, internet response times, and user
scores on websites like Stackoverflow.com. Some distributions, like
people's incomes, have a peak that occurs above zero, but they all
exhibit the long tail.

The long tail does some
interesting things to the mean of the distribution. A few extremely
large data points can pull the mean towards them, giving the mean a
potentially large bias. Instead of having a mean being balanced between
two equal halves of the distribution, you'll have a distribution with less data
points that are larger than the mean and more points that are smaller than the mean.

If you are interested in the halfway point of the distribution, you'll have to calculate the median
instead. The median is the value that occurs at the exact halfway point
when the samples are sorted. The distance between the median and the
mean can be used as a measure of how skewed the distribution is towards
the tail.

While the mean is not a very useful
parameter of a power-law distribution, knowing that you have such a
distribution can guide your use of the data. If the goal is to minimize
whatever you're measuring—internet response times is a good example—it
can be quite fruitful to target the large outliers to see if they can be
eliminated. If those super long response times can't be reduced, they
will be a constant drag on the average response time, no matter how much
you reduce the faster response times. If instead your goal is to
increase the metric, finding some way to affect a large portion
of the samples that are below the median will have a correspondingly
large effect on the average.

To wrap up, the three main
types of distributions that tend to show up when analyzing data are the
normal, bimodal, and power-law distributions. When starting to analyze a
new data set, it's best to figure out what kind of distribution you're
dealing with before diving into more complicated statistical analysis.
If the data follows a normal distribution, the mean can be used to talk
about what a typical sample is like. With other distributions, you
should be much more careful when talking about the average value of the
data. That's enough for today. Now that the basics are covered, next time I'll get into standard deviations and confidence.

There are currently about 180 million active websites on the internet. Finding what you need is going to be a challenge. Finding the website that meets your needs exactly and gives you a great experience is even harder. Organizing and finding stuff on the web has become a massive industry, with Google, Facebook, and Twitter battling for your precious time to best give you what you're looking for.

It's an extremely hard problem, and Google's method works pretty well for me so I was intrigued when I came across this post by Roy Pessis on how Google is killing the web. He laments about all the awesome websites he finds and how hard it is to find them and recall them when you need them:

Every week I find at least
one site that blows my mind. I get excited about how this service could
evolve into something big, it’s potential to grow into a billion dollar
business, and how it can change the face of the Internet.

But
you won’t find these great sites on the first page of Google
results—you might not find them on the first 10. As a result, these
services, some of them genuinely life-changing, get lost in the dark
recesses of the Internet. Even when you find these gems, you probably
won’t think to access them the next time you log on. Their biggest challenge is finding a large enough audience to create a habit around their product.

It's a commendable goal to want to improve the web experience and connect people with the companies that can best help them with their needs. If a service could show me the websites that would most efficiently and effectively help me do what I want right now, that would be beyond excellent.

This article really got me thinking about how the web could be better, but then a funny thing happened. I got stuck on the enormity of the problem. There is not one, but three main challenges to overcome—challenges that the big internet companies are attacking in various ways and doing a pretty good job of solving already. Any new solution is going to have to do better at all of these issues than the solutions that are currently out there, and that's a lot more difficult than convincing people that there should be a better way to find what they're looking for on the web.

How do you find exactly what you're looking for?

Finding the handful of websites that would best help you among the 180 million websites out there is hard enough, but to do it quickly, billions of times per day for hundreds of millions of users is shockingly difficult. Every user's idea of what they're looking for has its own context. Different websites will align better with different users' needs, even when they deal with the same topics. Finding the best match for everyone has a significant amount of irreducible complexity.

Each of the major internet companies deals with this complexity in a different way. Google attempts to match people to websites with keyword search. They index the web, find the keywords you're looking for in text and links, and return a ranked list of results. The whole process is much more complicated than that, of course, but it's a logical way to look for something in such a massive amount of information.

Facebook takes a different tact. They figure you'll be interested in the same types of things that your friends are interested in. You're likely to want to read or watch the things your friends find, create, and post, so your Facebook feed attempts to show you things from your friends' posts that are likely to interest you. This is not so much directed searching as finding what you're looking for through serendipity. You can find a lot of things you're interested in this way, but not likely what you're looking for right now.

Twitter uses yet another approach. It's similar to Facebook in that you follow other people and see a feed of their posts, but it's much more transient and you see all of the posts, as well as posts by others that respond to those posts. Choosing who to follow based on what you want to see is much more important here. If you carefully select who you follow, you'll have a well-curated feed of highly relevant links, comments, and discussions related to your interests. You do have to put in the time, and like Facebook, you probably wouldn't look to Twitter as a resource for immediate problems. But you can find a lot of valuable stuff this way over time.

Things aren't completely segmented along these lines, and each of these companies uses elements of the other approaches to help you find what you're looking for. Each of them provides a markedly different experience and makes different choices for the trade-offs involved. While none of them are perfect, they all get the job done fairly effectively, and each of them works better in certain situations.

How do you remember what you've already found?

Once you find something valuable on the web, you probably want to save it for later use. If you found it through Google, you may be able to use the same search terms to find it again the next time, if you can remember how you did it. It's even harder to find old stuff on Facebook, and it's nearly impossible on Twitter.

If you want to use the web like your desktop or tablet and store things for frequent use, then you need
to "install" the websites you use most with bookmarks or a website like delicious.com. Personally, I use Firefox bookmarks, and they work pretty well. I keep them organized in folders, and I have access to them on any device that has Firefox installed. I can see how they don't scale well, though, and with hundreds of bookmarks, I'm starting to depend more on the search feature.

I don't know how to make bookmarks scale better, but desktops and tablets suffer from the same problem with installed apps. I know people who
have installed 200+ apps on their smartphone and are in the same
predicament. They can't find what they need when they need it. They need
search. Having all of your apps on your desktop, just one click away,
doesn't help if you can't find the ones you need in the sea of apps you never use. The desktop isn't really a solved
problem. It's a different problem. Trying to make the web more like the
desktop isn't going to solve any of the web's problems.

The real problem here is that once you get past a few dozen apps or bookmarks or whatever, it's hard to remember where you put them when you need them unless you've done a great job organizing them yourself. At a certain number of things, it's easier to resort to search. The web is way past that number, so the default is search.

I find that I use search more on the desktop now because it works so well for the web. I reserve the prime real estate on my taskbar for the dozen programs I use the most, and similarly, I have less than a dozen pinned tabs in Firefox for my most-used websites. Keeping more things than this available at once just isn't useful.

How do the best sites get noticed?

I'm sure we've all had the same experience of finding an awesome website, and then wondering why it was so hard to find or why we didn't find it sooner. These websites should be easy to find, right? Everyone should be using them because they're so awesome! But everyone has a different idea of what makes a great website, and there are a lot of different interests out there.

The most popular websites gained their popularity over time, and lots of websites benefit from network effects. They become more useful as more people use them. Sites like Facebook, Twitter, Amazon, and Stackoverflow depend on the sheer volume of users to make the sites better. It takes time and effort to build a site from small beginnings, and a site with lots of potential is much different than a site with millions of users. Not every awesome website is going to make that transition.

Most of what I read is the best there is to read on any given
subject. The trash is few and far between, and the average equally rare.
This, of course, is made possible by the Internet. No longer are my reading choices constrained by time and especially place.

This property applies to all websites, though. It's hard to get noticed unless you're the best because people don't have time to look at much more than a few sources for any given topic. They're going to devote their precious time to the sites that are the most likely to give them good returns on their time investment. That typically means it's the popular sites that get the traffic. To get popular, sites need to have great content and great promotion strategies, or they'll get lost in the sea of other sites.

Every once in a while a new channel comes along that allows new websites to promote themselves easily and get popular, but that only works until the new channel gets saturated. Facebook and Twitter are recent examples. It may seem like app stores are a good model that could be used to promote websites because they've worked so well for smart phone and tablet apps. They've got reviews and ratings, and if you get promoted by Apple or Google, your app can really make it big. But there's still a lot of crappy apps out there with a small amount of great apps to find. iOS now has over 1.2 million apps and Android has well over 1.3 million apps. At those numbers it's not much easier to get noticed in an app store than it is on the web, no matter what the app store is like.

I would absolutely love a better web browsing experience. I think everyone would. I would love to find the best sources on any given topic or task instantaneously without any search effort. Who wouldn't? But who is judging what "best" is? My definition of best is almost guaranteed to be different than anyone else for a large selection of things. Aggregating opinions through ratings can go a long way, but what about the websites that go unnoticed that might be perfect for me? I wouldn't know unless I tried all of the options, and I don't have the time or the inclination to do that for most things.

I'm willing to give up some choice to Google or Amazon in exchange for expediency and something that satisfies my needs—something that is good enough. Taking into account the magnitude of content that is being sifted through, the current browser experience is more than good enough. I would welcome a better solution, but it's probably not going to replace the ones that are already out there. A new solution is going to have to make its own choices on the trade-offs, and it's going to have to first figure out how to organize those 180 million websites.

Firefox is my window into the Internet Collective. It's like a second level of the operating system, and I spend probably half of my time on the computer, both at work and home, in this browser. I've tried other browsers but never felt as comfortable in them as Firefox. People have their reasons why this or that browser is better and why you should switch, but that isn't going to work on me.

My reasons for using Firefox are personal. I'm too used to how it works and how to get work done in it. I can't use Chrome or Opera or Internet Explorer (can anyone?). I'm committed to Firefox, and the main reason why is that I use plenty of shortcuts that I don't have to think about. I think of what I want to do, and my fingers type the required keys automatically. Some of these shortcuts are the same in other browsers, but not all of them. Any time the shortcut key is different or the behavior is slightly different, it screws me up and slows me down.

Here they are, the best Firefox shortcuts that I use everyday. Learn them, use them, and you'll be as dependent on Firefox as I am.

New Tab <CTRL-T> – This is the workhorse of all of the Firefox shortcuts. Open a new tab, switch the focus to this new tab, and have the cursor placed in the address bar, ready to type in the website you want to go to.

Close Tab <CTRL-W> – The compliment of opening a new tab is to close a tab. If you just opened the tab that you're closing, you'll move back to the previous tab you were at. That's very useful when you're opening a tab to look something up quickly and then closing it to get back to what you were doing. If you open more than one tab in a row, you'll move to the next tab to the right as you close them, or if you're at the end of the tabs, you'll stay at the end as you close them. That's good to know when closing a bunch of tabs. Start at the furthest left tab and hit <CTRL-W> a bunch of times to close them all.

Reopen a Closed Tab <CTRL-SHIFT-T> – Shouldn't have closed that tab? Forgot something you needed a few closed tabs back? You can reopen closed tabs with this shortcut. It's like an undo for tabs! The history is surprisingly long, too.

Open Link in New Tab <CTRL-click> – When you want to read something later, after you've finished with what you're currently reading, you can hold CTRL and click a link to open the link in a new tab, but keep the focus on your current tab. If you open more than one link this way, the new tabs will all be inserted and lined up directly to the right of your current tab in the order that you opened them.

Highlight the Address Bar <CTRL-L> – Done with the site you're viewing and want to go somewhere else? Use <CTRL-L> to highlight what's in the address bar, and start typing the new address to clear the old one away and start the new one. I also use this shortcut to quickly copy an address to make a link for blog posts or emails. Type <CTRL-L-C-W> to copy the address bar contents and close the tab all in one swift motion. You can even start with a <CTRL-T>, start typing the address you're looking for, and hit enter when it pops up in the autocomplete list to load an address into the address bar for copying. Key combos are the next level of browser Kung Fu.

A is for Amazon – This isn't a shortcut that will be the same for everyone. The address bar will learn what your most frequently visited websites are, and when you type in the first few letters, it will autocomplete the website so you can hit enter to get there. To get to my most frequently visited websites, I just have to type one letter and <Enter> to get there. To go to Amazon.com, all I have to do is type <CTRL-T>, a, <Enter>. Google.com is <CTRL-T>, g, <Enter>. You get the idea.

Google Search – Even faster than going to Google first, you can use the address bar to do a Google search. If you know exactly what you're searching for and don't need to use Google Suggest, then you can type <CTRL-T> or <CTRL-L>, depending on if you want a new tab or not, and type in your search (e.g. "ruby array" to look up Ruby's Array documentation). You'll be taken directly to a Google results page for your search.

Search This Page – Want to search the current page for a word or phrase? Just start typing. No need for a <CTRL-F> to start it. To find the next instance of the search phrase, hit <F3>. To go backwards through the search, hit <SHIFT-F3>. I love this feature!

Surf Through Open Tabs <CTRL-PgUp> and <CTRL-PgDn> – Move left and right, respectively, through your open tabs with these shortcuts. It's convenient when you're a few tabs away from something you need, and you don't want to reach for your mouse.

Back <Backspace> and Forward <SHIFT-Backspace> – Navigate through the history for the tab you're currently in without your mouse as well. As you've probably noticed, many commands have compliments that are accessed by adding a <SHIFT> to the shortcut.

Refresh <F5> or <CTRL-R> – I use <F5> out of habit to refresh a page, but <CTRL-R> works just as well. Refreshing is a surprisingly common operation on some websites.

Open Bookmarks <CTRL-B> – If you want to browse through your bookmarks to look for something you stashed away for safe keeping, you can quickly get a hierarchical view of your bookmarks with this shortcut.

Save a Bookmark <CTRL-D> – Even faster than hitting that star in the upper right area of the icon bar, this shortcut will open a drop down menu so you can change the title of the bookmark and file it in a folder right away.

View Source <CTRL-U> – When I need to look up something in a page's source, I can quickly get to it with this shortcut, then close it with <CTRL-W>. Many other shortcuts work within the View Source window, too. In particular, you can search the source by just typing what you're looking for.

Toggle Tools <CTRL-SHIFT-I> – This shortcut opens and closes the developer tools for Firefox. They're really useful when developing or debugging web pages, and each tab of the tool window has its own shortcut. This one is the best, though, because it will open and close the tools for quick uses, while the others will only open the tools.

There are many more shortcuts for all of the features of Firefox, but these are the ones I use every day. I feel slow and inept without them, and they keep me hooked on Firefox over all other browsers. What are your favorite Firefox shortcuts?