Posts categorized "Know your data"

Just recently, I made a short clip about how Grubhub is extracting lead-generation fees from unsuspecting restaurant owners by setting up a network of shadow websites (and phone numbers). Diners who thought they were ordering directly from restaurants were instead shuffled through the Grubhub toll booth. Click here to see how this works.

Now, Vice discovered yet another of Grubhub's toll booths. This time, the toll booth is set up on the Yelp superhighway. When a Yelp user clicks on the restaurant's phone number on its Yelp page, a pop-up shows up, with two options, one of which redirects the user to a shadow number owned by Grubhub. These numbers are not labelled "direct" and "Grubhub" but "general questions" and "deliveries and orders". So now both Yelp and Grubhub are making lead-generation money off the restaurant owner.

***

This dispute is about causality! And because causality is tough to establish, it creates a gray zone of disagreement.

Ideally, the restaurant owner pays for orders caused by Grubhub's marketing activities. By cause, we mean the orders would not have materialized without Grubhub's marketing. In the examples we saw, it's highly unlikely that Grubhub did anything to cause those orders. That's because those diners thought they were ordering directly from the restaurants - they are quietly re-routed to the shadow websites and phone numbers set up by Grubhub, sometimes without even the restaurants' knowledge.

It is the secrecy that gives the game away. If Grubhub's causal value is clear, it can form open partnerships with the restaurants without resorting to trickery.

As I pointed out in the video, the majority of the digital marketing industry relies on similar tactics. Search engines do not typically send you directly to the webpage you clicked on - you are often rerouted through the search engine's server, so that the search engine can "track" the click and use the paper trail to receive lead-generation money. Anyone passing through this toll booth is counted as "causal" but in reality, many of these users would have found their way to the webpage even if the search engine didn't exist. (Consider going directly to Macys.com instead of typing Macys into a search engine.)

***

This is a great example of how data, algorithms and software are silently running our lives, and often to the detriment of those who don't understand what's going on. Our video series is a small effort to help you stay in front of these data-driven technologies.

The more you know, the more you can leverage its power, and avoid its harms.

The genie is out of the bottle. The press is now reporting, almost daily, new discoveries of all the unspoken things tech companies have been doing with personal data collected from our usage of their products and services.

First up, the reporter from Washington Post wondered how much surveillance is happening on the Chrome browser. He learned that the browser loaded 11 thousand trackers in just one week. Horrified, he dumped Chrome (a Google product) for Firefox.

Traditionally, the website you visit installs a cookie (a type of tracker) to remember you: when you visit a page, and it automatically remembers who you are, that's the classic use of a cookie. Almost all of the trackers nowadays are invited guests of the website you visit, third parties who just want to follow you around.

I have more to say about the surveillance industry in this video. It's not just Chrome that you have to worry about.

***

When multiple websites like Facebook and Instagram temporarilly went down last week, news came out that Facebook automatically tags any photo you upload with "labels". During the downtime, the images did not load but the labels did. People saw things like "Photo may contain: two people, smiling, a beach". These are almost surely AI-created tags.

This is nothing new actually. In the past, you can see these labels if you are on a slow internet connection.

The real concern is the disclosure that Facebook automatically tags photos with people's names, using facial recognition technology. It appears that telling Facebook you don't want to be tagged means the tagging is moved underground, behind the scenes, out of your reach. I've been calling this the "illusion of control."

***

Another instance of illusion of control came by way of Amazon, which admitted that audio recordings of Alexa users are retained even after these users delete their accounts and data.

***

Next, a new email tool called Superhuman lets users to track whether the emails they sent have been opened by their recipients. This feature was on by default. This type of tracking uses "pixels" rather than cookies. A pixel is a tiny transparent "image" that is designed to hide from human eyes. When your browser pulls the pixel from its owner's server, you have made yourself known to that server.

This tracking feature has always been part of commercial email software. Whenever Amazon or your favorite brand sends you an email, analysts can tell when you open it, or interact with it. This data allow brands to assess the effectiveness of their emails.

Remember this the next time you pretend to your friend or colleague that you haven't seen his/her email.

Another day, another story about Facebook data. Ars Technica reports that Facebook is suing a South Korean app developer Rankwave, claiming the company misused data it received from Facebook. Rankwave creates mobile apps through which it obtained Facebook user data for 10 years.

All we know about this situation came from the Facebook press release, and it's not clear what the offense is. The article cited the violation as using the Facebook data to "create and sell advertising and marketing analytics and models". That's how Facebook uses the user data, same as why Facebook partners want access to user data.

One part of the press release rings very true: Facebook admits that it does not control the data once shared with third parties. Facebook lawyers demanded Rankwave do the following:

Provide a full accounting of Facebook user data in its possession;

Identify all individuals, organizations, and governmental entities to which it had sold, or otherwise distributed, Facebook user data;

Provide a full record of the access logs and permissions it had granted third parties to access the data;

Delete and destroy all Facebook user data after returning it to Facebook;

Provide Facebook with full access to all storage and related devices so that Facebook could confirm deletion and destruction of the data through an audit.

These all sound great but would any company, even Facebook, be able to deliver the above logs, reports, devices, etc.? Given how data are spread out in big networks of servers ("data clouds", "data lakes", etc.), this wishlist sounds like a fantasy.

In this new video, I talk about the data sharing ecosystem, why it is so hard to delete anything, and how companies lose control of the data. It's the flip side of speed and convenience. What is the price we're willing to pay?

With the U.K. report on Facebook, and the stern language within it, the train on regulating data sharing may finally reach the station this year. The FTC is also likely to impose a stiff fine on Facebook for violating a consent decree.

So let's learn more about this data sharing business. If you prefer a video, the gist of this post can be heard here.

***

First, let's talk about data flows and the "cloud". Data are stored in computers that are called servers. In the cloud computing model, these servers are owned - not by the companies that collect the data - but by large tech companies like Amazon, Google, Microsoft, etc. who are responsible for managing the servers. These servers are geographically dispersed and so when data enter the cloud, they get replicated and spread to many servers. The technical benefit of such replication is recoverability of the data (allowing the use of cheaper, less reliable computers) but now, the data become much harder to delete.

Data become more telling if one combines different datasets measuring different aspects of our lives. For example, an auto insurer may have data on past claims and that data help predict your future claims. But if the auto insurer is able to get data from say an automaker about your car, e.g. how fast you drive, where you drive, etc., that data combined with past claims improve the predictive power.

Thus, a data-sharing industry has been created. Companies make agreements to share data with one another. This becomes much easier in the "cloud" as those servers are already connected to one another. These agreements may include explicit payments but even if they don't, both sides must be benefiting commercially from the arrangement, or else they would not exist.

So when company A shares data with company B, the data flow from A servers to B servers. B may also use a cloud, which then means the data would be replicated yet again, and dispersed geographically onto yet another set of servers.

And company B may also share data with company C, etc., etc.

***

An inexplicable part of the consent decree between Facebook and the FTC is the requirement that Facebook monitor what happened to the data after they are shared with third parties. I just can't figure out how that is possible. It isn't even possible within Facebook: if a user demands that his/her be deleted, it will be very hard to ensure that all copies of the data are deleted from every server, including data that might have landed in an analyst's computer. In fact, most analysts probably don't know how many replicates of data elements are being created during the analysis, and where those replicates exist!

***

The next question of general interest is all the different ways in which tech companies collect people's data without people realizing what's happening. In the video, I look at contact lists, personality tests, 2-factor authentication schemes, IOT devices, etc. in their roles as data collectors.

This is the reason why the video is called "Did you betray your friend today?"

Big Tech is in serious danger of losing our trust when it comes to our data. For the longest time, they have sold consumers on a bargain: in return for providing convenience and services, they claim our personal data and sell them to the highest bidders.

That bargain requires consumers to trust Big Tech as good custodians of the data.

A number of recent revelations has eroded that trust. And I'm not talking about Cambridge Analytica.

Here are two recent examples that have not received the attention that they deserve.

Android phones found to track users even if they turn off Location History.

The Associated Press first broke this story, which was later verified by Princeton researchers. Android users who choose to turn off Location History obviously do so intending that they not be tracked.

In the same story, it is also disclosed that iPhone users who have installed Google apps are also being tracked at all times. When these users disable location tracking for the Google app, the app stops saving locations in the folder called Location History and saves the location data in a different folder instead. (Note also: the phone itself would not know how to do this; software engineers wrote code to enable this feature.)

This revelation is highly damaging. There have always been suspicion that your phone (or TV or other devices) is spying on you even if it is turned off. There are no technical obstacles to making this capability. The only reason why customers are not worried is that they trust Big Tech when they claim their phones (or other devices) do not have that capability.

Google does not deny this is happening, and in fact, argues that it is transparent in dealing with users.

***

Facebook found to have taken users' phone numbers provided for "two-factor" authentication and sold them to advertisers.

For a number of years now, Big Tech and small tech alike have bombarded us with security warnings, and made claims that "two-factor" authentication is the ultimate solution to online security issues. Setting up two factor requires the user to provide a cell phone number, which immediately removes any semblance of online anonymity (unless you get your hands on a burner phone).

It's been known in the marketing world forever that the key to unlocking anyone's data is your phone number. That's because the cell phone is almost always a personal device.

I don't typically consent to two factor for fear that the phone number could then become marketing fodder. This story, by Gizmodo, confirms that my worst fear is real. Similar to Google above, Facebook confesses to coopting those phone numbers, and even maintains that users have consented to such use.

This revelation is also highly damaging. In fact, it damages the reputation of the security industry. How are we to trust them when they tell us to use things like phone numbers, fingerprints, eye scans, etc. for security purposes when we cannot trust that such private data would not be transferred to other entities?

In the same story, it is disclosed that Facebook also harvests phone numbers for advertisers under a variety of other pretexts. For example, when you provide your phone number to receive alerts about new log-ins to your account, that number ends up on marketing lists. Further, when you upload your contact list to Facebook, you have exposed all of your friends' phone numbers, turning them into fair game for Facebook's advertising clients!

This further relevation is even more damaging. Lots of websites ask for our phone numbers under the pretext of servicing our accounts, but we can no longer trust any of them that they would not turn our personal data over to advertisers.

***

Judging by their subsequent pronouncements when caught red-handed, we fear that Big Tech is not sensing the importance of "trust" in their business model. If the trust goes away, these businesses will be in for a nasty surprise.

On Friday, Facebook announced that hackers have gained access to personal data of at least 50 million users (see here for example). Analysts immediately connected this incident to the Cambridge Analytica scandal. How is this data breach different from the Cambridge scandal?

How the data was breached

One should take any announcement of how a data breach occurred with a grain of salt: it's obvious that companies will not publish anything that attracts lawsuits; plus, it's unclear that there is a penalty for lying about the real reasons for a data breach. There is no external auditing, and the companies control the narratives and statistics surrounding these events.

Based on what Facebook told us, the Cambridge Analytica scandal involves Facebook partners gaming the system to obtain data about Facebook users. At worst, it is a violation of community standards, i.e. claiming to be doing academic research. The research firm uses tools provided by Facebook to obtain the data. (The firm maintains that it disclosed to users that it was using the data for non-research purposes.)

In the present case, Facebook claims that a combination of coding bugs (i.e. unintended features) enabled unknown partners to gain access to the user's entire account. Not even that, but any accounts on third-party apps or websites that the user signs on to using Facebook.

How the breach was discovered

The current scandal demonstrates the value of the business intelligence/business analytics functions within companies. We were told that Facebook first realized that certain metrics were showing unusual trends, and upon investigation, they discovered the bugs.

This is entirely believable. That's what happens when you have good data reports. They surface anomalies. These then have to be investigated. These investigations are extremely tricky because all you know is the trends are different. There are a thousand reasons for the shift. The analyst's job is to establish a cause-effect. Especially since the development community adopted "agile" practices, all kinds of self-imposed changes are occurring all the time with no warnings. For a site as large and complex as Facebook, it takes a huge effort just to get a list of all site changes within some specified time window!

What complicates this situation more is that the vulnerability was traced to multiple bugs, not just one. I could imagine the twists and turns and the false alarms that were generated during the investigation.

The nature of this problem is no different from an investigator trying to chase down an e-coli outbreak, which is detailed in Chapter 2 of Numbers Rule Your World (link).

The data science community is guilty of talking down on the business intelligence function. There is a misperception that BI is for less skilled people doing boring things. The reality is there is more science in BI than in so-called data science (defined here as software engineering). Science, after all, is about figuring out why things are as they are. Engineers, by contrast, use our understanding of science to change the way things are.

You've been warned

In my debut Youtube video, released a few weeks ago, I explain how Facebook collects data about you. My biggest tip about protecting yourself is to not use Facebook to log onto other websites. Convenience is the drug that lures you into the trap.

Think about those one-password services. Instead of hundreds of password, you have one. So the thief needs only one password to assess hundreds of sites. Using Facebook to log onto other websites makes Facebook the centralized point of attack by the bad players.

Note that this is not limited to Facebook. Using Google, Yahoo, Amazon, etc. to log onto other services is no different!

If it's more convenient for you, it's also more convenient for your enemies. Remember that.

I'm continuing to see important articles as our journalists shine light on the tech companies. This Bloomberg article reveals a previously secret relationship between Google and Mastercard. It is important also as a reminder that the intense focus on Facebook is misguided - as other big tech companies engage in similar practices.

I'd recommend reading the article in full. Here I'd like to bring out a few key observations for you to chew on.

1) The payment processors have become part of thesurveillance network that has been constructed by the tech industry. In the past (when I was working in that industry almost two decades ago), payment processors played the role of toll booths that happened to be a very lucrative business collecting cash as the transactions stream through. Users' purchase data were safeguarded and remained private.

Sometime in the last 5 to 10 years, the payment processors decided that the purchase data is a goldmine. They now monetize this data through mostly secret deals with all kinds of companies. Most card-users are not aware that their purchases are being disclosed to third parties.

An analogy is Gmail and other web-based email systems. Gmail in particular pioneered the concept of reading users' emails to extract information. It is now known that Gmail allows selected partners to extract all kinds of information from emails, such as shopping receipts, boarding passes, etc. We knew this because of the scandal about the Unroll.me service in which the service provider extracted shopping information from emails under the pretense of finding email lists from which users can be unsubscribed from.

2) Next time reporters hear the tech companies talk about "anonymized" data or hiding of "personally identifiable information", they really need to press the companies for details. Look out for clever word play. It is simply not possible to get value from these datasets without access to information about each person.

In this instance, Google is merging two datasets, the clickstream (what ads people clicked on) and Mastercard purchases, with the goal of claiming that person A purchased item X after clicking on some Google ad. The Mastercard data contain your card number, and/or your name. The clickstream data contain one or more of the following: your Google "cookie", your Google user name, or your email. You can't merge the datasets unless there is a common field. In this case, it will most likely be your name or email address. How could that be "anonymous"? Good question if you're asking it. And the reporters should be pushing on this point.

Further, if the data are fully anonymized, the potential utility of the data is much lower, and the price marketers are willing to pay for them is much lower. If I don't have your name and address, how can I send you a piece of junk mail?

3) While the article points out the violation of privacy of this partnership, it inadvertently legitimizes the highly dubious science behind matching the clickstream data and the Mastercard purchase data. The reporter described the program as "powerful" and "potent" but should have talked to third-party experts to understand this science better. I'll just list several major problems with this approach:

As far as I know, Mastercard does not have line-item level details on our purchases. It may know that you spent $500 at Best Buy, but it doesn't know that that total includes an iPad and a TV, for example. So if the ad shows a TV, how does Google know that Best Buy purchase included a TV? The reporter claims that Google knows that someone bought "red lipstick" with her Mastercard. I find that dubious.

Do you remember what ads you clicked on 30 days ago? 15 days ago? 5 days ago? This system counts any purchases made on Mastercard as long as 30 days ago as "caused" by an ad click.

Does Google know what these users are doing on non-Google websites? For example, does Google know that the user who saw a Google ad 10 days ago also saw a Facebook ad for the same product 3 days ago?

Does Google know anything about the users' interactions offline? For example, the user who clicked on the Google ad 10 days ago may have heard a radio ad, seen a TV ad, talked to a friend about the product, been handed a flyer at the mall, saw a piece of direct mail, etc.

The reporter suggested that Facebook, the other digital advertising behemoth, is also pursuing similar deals with payment processors. If Facebook were to adopt this same method for measuring ad performance, then both Facebook and Google would have claimed credit for the same purchase, as each would have matched that purchase to its own clickstream data. One solution to this particular issue is to merge the Facebook and Google data, which simply merges two surveillance networks into one extra-large surveillance network.

The media have finally started to write some really nice reports on data sludge. I like this Wall Street Journal article, opening the black box of the science of secretly reading your emails.

If you use a smartphone, it is very likely that you have agreed to some app's terms and conditions, which allow them to download your emails en masse from one or more of your favorite cloud email providers, such as Gmail, Yahoo! Mail, Outlook, etc.

There was already an infamous example of this that came to light last year. Unroll.me is a service that helps you unsubscribe to unwanted mailing lists. When you set up this service, it requests access to your emails. That is the way they find out which services you can be unsubscribed from. It turns out that Unroll.me isn't really about helping you reduce email clutter - in fact, its main business is mining your inbox for shopping receipts, which can be sold to businesses which - you guessed it - want to sell you more stuff, which - you guessed it - probably means you'll receive more spam, net net. Oops. That was the sound when the company's management learned their little data sludge scheme went public.

For the Unroll.me story, I'm linking to this commentary in Venture Beat by someone who audaciously slammed Unroll.me management for audaciously claiming that their data sludge scheme was par for the course in the tech industry. This guy actually screamed: "The analogy between Unroll Me and Google or Facebook is audacious. Not to say haughty." He went on to claim that Google and Facebook keep all their data in-house. That was false in 2017, and looked even worse in light of recent revelations about data practices at those two companies.

Google, for example, has allowed, and recently expanded access by third-party developers to Gmail emails, according to the Wall Street Journal. Just like Facebook, Google has no control over how these third parties use the data. It has some language requiring these developers to agree to certain standards but those are unenforceable, and not enforced.

The WSJ article includes quotes from various participants in this data sludge industry that are false, intentionally or not.

First, they repeatedly claim they don't "read" our emails. Let's do a thought experiment here. I want to know if you are a racist. Unbeknownst to you, I got my hands on all the emails in your Gmail account stretching back 10 years. I write a computer program to look for various keywords like the N word. The program tabulates for me how many times you used each word, which days of the week you tend to say such words, which people you use those words with, the number of variations of each such word you have in your vocabulary, how many of your friends partake in such conversations and how many times they use racist terms, etc. Based on this report, I conclude that you are a racist. I may even conclude that certain friends of yours are also racist. According to various interviewees in the WSJ article, I drew that conclusion without "reading your emails."

(Lest you think the example is far-fetched, we recently heard that Facebook had tagged thousands of users with the label "treason," a segment which can be purchased by advertisers - or anyone willing to pay for this data.)

Second, the companies interviewed for the article e.g. Return Path basically claim that they have only had human beings read emails once or twice. That is simply a lie. You can't build any kind of predictive model without getting intimate with the data. Further, to understand how these models work, you have to review actual cases. Finally, when something unexpected happens, you have to look at the email contents to understand why.

These technologies have some possible benefits. If such benefits outweight the potential harm, then consumers would gladly adopt them. The data industry should be much more transparent. This ensures that the developers maximize benefits while reducing the levels of harm.

***

We've been tracking data sludge for years. For more, read this thread.

Facebook and Google have now been hoisted in front of the public, and rightfully reprimanded for their invasive data collection. (Notice, the deafening silence of our politicians and government officials on this issue.) In reality, the entire industry has been condoning and abetting these practices. You can follow my Know Your Data series of posts all the way to 2010, documenting what some of the big names have been doing in taking away every citizen's privacy.

For the author of the article, Google has over 5 gigs of data on him while Facebook has 600 megs. (We are making the assumption that we are being shown everything.)

Some of the key takeaways:

Deleting something merely disconnects you from your data. The data still exist in the corporate servers. (You should have known this when these companies tell you that you can come back and restore your old contacts, data, etc.)

The era of "anonymity" is long gone. All of the data are traced to you. Some readers may remember the days when we were told that cookies are harmless files that are not personally identifiable.

Your mistakes, bloopers, etc. are all stored. If you accidentally click on a spam ad, it is stored, and likely used by some algorithm to profile you.

Google seems to have kept not just metadata (titles of images, sent dates and recipients of emails, etc.) but all the data (photos, emails, etc.)

No one cares whether the information about you is accurate or not.

If there is one copy of this data, you can bet there are lots of copies of the data. This is called "redundancy" - you just have to keep multiple copies to recover from inevitable data losses.