Further Reading

The answer, perhaps unsurprisingly, is: yes. It’s easy to do, and it’s revealing about what I do, when I do it, and where I go.

Like many other websites, Ars Technica employs a system of voluntary user logins. These logins allow you to do things like leave comments at the bottom of every story and engage in our user forums. Each time you log in to Ars, we record the date, time, and IP address that you logged in from. This is a common practice: nearly every website maintains similar records. Typically though, Ars only keeps one record per user of the last date, time, and IP address used. We do not keep any historical records of login data.

However, Ars lead developer Lee Aylward was kind enough to make an exception—me. For 11 days in February 2014, Ars tracked all of my logins. The working theory was that since I’m telling Ars who I am (my login name is the frequently used and obvious “cfarivar”) and loading the site multiple times per day, my logins would actually give Ars a clear idea of my actions and movements.

In turn, I sent this 11-day log along to Nicholas Weaver, a computer security researcher at the International Computer Science Institute based in Berkeley, California. It took Weaver just a short amount of time to write a Python script that converted the raw CSV data file (including Unix time notation). It would start with a line like this:

That means Ars showed I was editing a particular story for about three hours on the morning of February 14, and I was connected likely through Private Internet Access (PIA), the commercial VPN that I frequently use. Normally, for privacy reasons, I use PIA to obscure my tracks online. While I tried leaving it off for the purposes of this experiment, sometimes I left it on by accident. That turned out to be useful, allowing us to see what it looks like when online origins are obscured.

Home is where the data is

Looking at the raw data and the cleaned-up script on my own, there were a few things that seemed obvious: first, it showed when I started and ended my work day. Some days, I was logged into Ars as early as 4:14am (February 13) and was active as late as 9:30pm (February 16). But generally speaking, I was consistently online by about 7am and ended around 5pm. There was then a few hours' gap (I knew this was for dinner) and sometimes a check-in again before calling it a night.

Further Reading

NSA had reverse-engineered many of Google's and Yahoo's inner workings.

Second, the data showed physical places that I knew I visited in the Bay Area: a particular San Francisco office building, an Oakland café, and the University of California, Berkeley, campus.

But Weaver’s analysis was far cleverer than I expected.

“I assumed you worked at home, because you had a residential Comcast IP address,” Weaver said. (He’s right: like nearly all of us at Ars, I work primarily at home.)

I didn’t realize that Comcast distinguishes its IP information in the hostname of business versus residential accounts. Anything that shows up as comcast.net is a residence, while anything else that shows up at comcastbusiness.net is likely a business. (Of course, anyone can sign up for a “Business-class” account at home, like Ars editor Lee Hutchinson, but most people don’t go that route.)

Apparently, the original CSV file he used also contained URL information for which article I was viewing. “I knew what you were reading,” Weaver added. “That tells me what article you were working on, if you're reading old stuff it means you're looking for links.”

(Again he's right. If I’m pulling up the last three stories I wrote about Bitcoin, there’s a high likelihood that I was working on a new story on Bitcoin.)

“When VPN was active I could see that you were active, but not where,” he said.

"I am person X at this location."

The precision of the IP addresses was surprising.

Further Reading

Stanford research shows even when offering up metadata, it's very revealing.

In one instance, on Thursday February 6, at 9:30am, I was logged in at a particular San Francisco IP address. Looking up that IP on myip.ms turned up not only the city, but one of two possible street addresses as well. The search was again correct: on that particular day at that particular hour, I was conducting an interview with Boxbee CEO Kristoph Matthews at The Hatchery, a co-working space and startup incubator at 645 Harrison Street, in San Francisco’s South of Market district.

Weaver explained that a stronger and more persistent adversary, like the NSA, would have a much longer-term and comprehensive data set. Data sets like that would include information from plenty of sites beyond Ars.

“Facebook knows if you hit any page that has a Like button on it,” he said. “Same with TweetThis, unless the site goes out of the way to mask them, then these are specifically reporting them to social networks. This is why NSA loves it, is because they can go along for the ride.

“One thing that we know that the NSA does on their non-US wiretaps is bind usernames to cookies, so if you see a request for LinkedIn or YouTube or Yahoo, these are all sites that have user ID in the clear. All you need to do is see a request, and say I don't know who this is or I know who this is, but then you look at the HTML body and look for the username. This is why the NSA went after Google ad networks; they include user identification [broadcast] in the clear: ‘I am person X at this location.’”

Despite the vast amount of data, it's just as easy to store as it is to interpret. “It works out to only a few kilobytes per person for everyone on the planet,” Weaver added. In other words, if I had the access, it'd cost just a few thousand dollars to have enough consumer-grade storage to keep data on everyone in the United States. It would comfortably fit on my desk.

Metadata is surveillance

There was good news from this exercise. Mainly, the digital obfuscatory tools I normally run did help mask my online trail.

Further Reading

Government can still get numbers from phone company data, two hops out.

Generally speaking, I run all kinds of anti-tracking software on my browser: constant private mode, Ghostery, Disconnect, and my VPN. (I also have Tor and use it occasionally. Though the VPN, of course, concealed my location but did not conceal my activity. I was still clearly logged into Ars.) And Weaver said, yes, these tools do help to thwart tracking to some degree.

“The biggest reason why the NSA thinks Tor stinks is that it's actually really hard to link user activity to people,” he said. “Because the [Tor browser] bundle operates [by default] not storing cookies and [doesn’t allow Flash]. The browser bundle is allowed to not have linkages across sessions. Every time you exit the tor browser it looks like a new user. Normal browsers are not set with clear all cookies. The real fault lies in the architecture of the Web. The Web is designed [to allow the] business model of tracking. If you have your browser set to clear cookies every time you quit, it really helps. Tor is overkill; your single hop VPN is still bouncing all over the place.”

As many privacy activists and security researchers have long noted, free products turn the customers into products. Google and Facebook are some of the biggest companies that make billions of dollars by tracking their users' behavior and selling ads against that behavior. But even my work account would have the potential for data mining.

“[Your Ars log] didn't tell me anything new about your site, but it does tell me about your workflow. It tells me where you go and when you're active,” Weaver concluded. “This is why everybody says metadata is surveillance.”

Promoted Comments

I don't see how anyone can justify metadata as being somehow less than the data it is associated with. It's more, much more.

I can't readily use the content of your HTTP sessions to work out when you send them, from where and what site to and frankly, it's probably not interesting to me as a snoop. Likewise the content of your phone calls isn't interesting and is simply cumbersome nonsense. I want to know about them, not their content.

The important bit is indeed the metadata, the content is just noise. Where from? Who to? When? For how long? I can build up an entire picture of your life. Your search queries, another metadata term, are also very important. I can work out those little things about you that make you unique, once I have those, you're so much easier to trace.

Through metadata a full picture of your life emerges. I can infer your doctor, and probably your medical conditions. I can work out your routine, and set alerts if you deviate from it. I can tie you in with your social contacts from SMS and phone metadata, as well as rank them in order of importance. 20 minutes talking to a wedding planner service? Congratulations. 45 minutes on the phone to an employment lawyer? Guess things aren't working out too well at work.

I can run a PageRank-like algorithm over your phone records and everyone who you called, and their contacts, and get my own database of everything that makes you, you.

I may never get your name, but I will know everything about you, your job, what's happening in your life, and all your friends, acquaintances and colleagues.