Today we report yet another type of surreptitious data collection by third-party scripts that we discovered: the exfiltration of personal identifiers from websites through “login with Facebook” and other such social login APIs. Specifically, we found two types of vulnerabilities [1]:

seven third parties abuse websites’ access to Facebook user data

one third party uses its own Facebook “application” to track users around the web.

Facebook Login and other social login systems simplify the account creation process for users by decreasing the number of passwords to remember. But social login brings risks: Cambridge Analytica was found misusing user data collected by a Facebook quiz app which used the Login with Facebook feature. We’ve uncovered an additional risk: when a user grants a website access to their social media profile, they are not only trusting that website, but also third parties embedded on that site.

We found seven scripts collecting Facebook user data using the first party’s Facebook access [4]. These scripts are embedded on a total of 434 of the top 1 million sites (UPDATE: see clarification below). We detail how we discovered these scripts in Appendix 1 below. Most of them grab the user ID, and two grab additional profile information such as email and username. We are unable to determine whether first parties are aware of this particular data access [5].

The user ID collected through the Facebook API is specific to the website (or the “application” in Facebook’s terminology), which would limit the potential for cross-site tracking. But these app-scoped user IDs can be used to retrieve the global Facebook ID, user’s profile photo, and other public profile information, which can be used to identify and track users across websites and devices [6].

* OnAudience stopped collecting this information after we released the results of a previous study in the No Boundaries series, which showed them abusing browser autofill to collect user email addresses.
**(Added 2018-04-22; 5:55pm): The script loaded by Opentag tag manager includes a code snippet which accesses the Facebook API and sends the user ID to an endpoint of the form https://c.lytics.io/c/1299?fbstatus=[...]&fbuid=[...]&[...] This snippet appears to be a modified version of an example code snippet found on the lytics.github.io website with the description “Capturing Facebook Events”. The example code appears to provide instructions for first parties to collect Facebook user ID and login status.
^ Although we observe these scripts query the Facebook API and save the user’s Facebook ID, we could not verify that it is sent to their server due to obfuscation of their code and some limitations of our measurement methods.

While we can’t say how these trackers use the information they collect, we can examine their marketing material to understand how it may be used. OnAudience, Tealium AudienceStream, Lytics, and ProPS all offer some form of “customer data platform”, which collect data to help publishers to better monetize their users. Forter offers “identity-based fraud prevention” for e-commerce sites. Augur offers cross-device tracking and consumer recognition services. We were unable to determine the company which owns the ntvk1.ru domain.

Vulnerability 2: Tracking users around the web with the Facebook Login service

Some third parties use the Facebook Login feature to authenticate users across many websites: Disqus, a commenting widget, is a popular example. However, hidden third-party trackers can also use Facebook Login to deanonymize users for targeted advertising. This is a privacy violation, as it is unexpected and users are unaware of it. But how can a hidden tracker get the user to Login with Facebook? When the same tracker is also a first party that users visit directly. This is exactly what we found Bandsintown doing. Worse, they did so in a way that allowed any malicious site to embed Bandsintown’s iframe to identify its users.

We discovered that the iframe injected by Bandsintown would pass the user’s information to the embedding script indiscriminately. Thus, any malicious site could have used their iframe to identify visitors. We informed Bandsintown of this vulnerability and they confirmed that it is now fixed.

Conclusion

This unintended exposure of Facebook data to third parties is not due to a bug in Facebook’s Login feature. Rather, it is due to the lack of security boundaries between the first-party and third-party scripts in today’s web. Still, there are steps Facebook and other social login providers can take to prevent abuse: API use can be audited to review how, where, and which parties are accessing social login data. Facebook could also disallow the lookup of profile picture and global Facebook IDs by app-scoped user IDs. It might also be the right time to make Anonymous Login with Facebook available following its announcement four years ago.

Clarification (2018-04-19 2:25am): In a previous version of the post we listed three sites (fiverr.com, bhphotovideo.com, and mongodb.com) which embed scripts that match the URL patterns given above. Third-party scripts may contain different contents when loaded by different sites, even though the scripts are served from same or similar URLs. We confirmed that the Forter scripts embedded on fiverr.com and bhphotovideo.com do NOT include functionality to access Facebook data. On mongodb.com we only observed the presence of an Augur script. We have published an updated list of sites, marking the ones where we have confirmed the presence of functionality to access Facebook data.

Correction (2018-04-22; 05:55pm): In a previous version of this post, we listed a Lytics script (https://c.lytics.io/static/io.min.js) as the cause of Facebook API access. Although this script is used to send the Facebook user ID to Lytics (c.lytics.io), the code which accesses the Facebook API was served within an OpenTag script as described above. The code snippet in the OpenTag script responsible for accessing Facebook user data is likely configured by the first party, so we have removed our previous opinion that first parties are likely unaware of the data access.

Several companies stated that they do not use Facebook data for third-party tracking purposes.

[0] Steven Englehardt is currently working at Mozilla as a Privacy Engineer. He coauthored this post in his Princeton capacity, and this post doesn’t necessarily represent Mozilla’s views.

[1] We use the term “vulnerability” to refer to weaknesses arising from insecure design practices on today’s web, rather than its commonly understood sense in computer security of weaknesses arising due to software bugs.

[2] In this post we focus on websites which use Facebook Login, but the vulnerabilities we describe are likely to exist for most social login providers and on mobile devices. Indeed, we found scripts that appear to grab user identifiers from the Google Plus API and from the Russian social media site VK , but we limited our investigation to Facebook Login as it’s the most widely used social SDK on the web.

[3] When the user completes the Facebook login procedure on a website that uses Facebook’s Javascript SDK, the SDK stores an authentication token in the page. When the user navigates to a new page, the SDK automatically reestablishes the authentication token using the browser’s Facebook cookies. All third-party queries to the SDK automatically use this token.

[4] In order to better understand the level of integration a third party has with the first party, we categorize scripts based on their use of the first party’s Application ID (or AppId), which is provided to Facebook during the login initialization phase to identify the site. Inclusion of a site’s application ID and initialization code in the third-party library suggests a tighter integration—the first party was likely required to configure the third-party script to access the Facebook SDK on their behalf. While application IDs aren’t meant to be secrets, we take the lack of an App ID to imply loose integration—the first party may not be aware of the access. In fact, all of the scripts in this category take the same actions when embedded on a simple test page with no prior business relationship.

[5]. The following could indicate the first party’s awareness of the Facebook data access:

1) third-party initiates the Facebook login process instead of passively waiting for the login to happen; 2) third-party includes the unique App ID of the website it is embedded on. The seven scripts listed above neither initiate the login process, nor contain the app ID of the websites.

Still, it is very hard to be certain about the exact relationship between the first parties and third parties.

APPENDIX:

Appendix 1 — Measurement Methods

To study the abuse of social login APIs we extended OpenWPM to simulate that the user has authenticated and given full permissions to the Facebook Login SDK on all sites. We added instrumentation to monitor the use of the Facebook SDK interface (`window.FB`). We did not otherwise inject the user’s identity into the page, so any exfiltrated personal data must have been queried from our spoofed API.

As in our previous measurements, we crawled 50,000 sites from the Alexa top 1 million in June 2017. We used the following sampling strategy: visit all of the top 15,000 sites, randomly sample 15,000 sites from the Alexa rank range [15,000 100,000), and randomly sample 20,000 sites from the range [100,000, 1,000,000). This combination allowed us to observe the attacks on both high and low traffic sites. On each of these 50,000 sites we visited 6 pages: the front page and a set of 5 other pages randomly sampled from the internal links on the front page.

To spoof that a user is logged in, we create our own `window.FB` object and replicate the interface of version 2.8 of the Facebook SDK. The spoofed API has the following properties:

For method calls that normally return personal information we spoof the return values as if the user is logged in and call and necessary callback function arguments.

These include `FB.api()`, `FB.init(), `FB.getLoginStatus()`, `FB.Event.subscribe()` for the events `auth.login`, `auth.authResponseChange`, and `auth.statusChange`, and `FB.getAuthResponse()`.

For the Graph API (`FB.api`), we support most of the profile data fields supported by the real Facebook SDK. We parse the requested fields and return a data object in the same format the real graph API would return.

For method calls that don’t return personal information we simply call a no-op function and ignore any callback arguments. This helps minimize breakage if a site calls a method we don’t fully replicate.

We fire `window.fbAsyncInit` once the document has finished loading. This function is normally called by the Facebook SDK.

The spoofed `window.FB` object is injected into every frame on every page load, regardless of the presence of a real Facebook SDK. We then monitor access to the API using OpenWPM’s Javascript call monitoring. All HTTP request and response data, include HTTP POST payloads are examined to detect the exfiltration of any of the spoofed profile data (including that which has been hashed or encoded).

For both calls to `window.FB` and HTTP data, we store the Javascript stack trace at the time of execution. We use this stack trace to understand which APIs scripts accessed and when they were sending data back. For some scripts our instrumentation only captured the API access, but not the exfiltration. In these cases, we manually debugged the scripts to determine whether the data was only used locally or if it was obfuscated before being transmitted. We explicitly note the cases where we could not make this determination.

Appendix 2 — Third parties which access the Facebook API on behalf of first parties

]]>https://freedom-to-tinker.com/2018/04/18/no-boundaries-for-facebook-data-third-party-trackers-abuse-facebook-login/feed/9No boundaries for credentials: New password leaks to Mixpanel and Session Replay Companieshttps://freedom-to-tinker.com/2018/02/26/no-boundaries-for-credentials-password-leaks-to-mixpanel-and-session-replay-companies/
https://freedom-to-tinker.com/2018/02/26/no-boundaries-for-credentials-password-leaks-to-mixpanel-and-session-replay-companies/#commentsMon, 26 Feb 2018 16:25:53 +0000https://freedom-to-tinker.com/?p=13616In this installment of the “No Boundaries” series we show how wholesale collection of user interactions by third-party analytics and session replay scripts cause inadvertent collection of passwords.
By Steve Englehardt, Gunes Acar and Arvind Narayanan

Following the recent report that Mixpanel, a popular analytics provider, had been inadvertently collecting passwords that users typed into websites, we took a deeper look [1]. While Mixpanel characterized it as a “bug, plain and simple” — one that it had fixed — we found that:

Mixpanel continues to grab passwords on some sites, even with the patched version of its code.

The problem is not limited to Mixpanel; also affected are session replay scripts, which we revealed earlier to be scooping up various other types of sensitive information.

There is no foolproof way for these third party scripts to prevent password collection, given their intended functionality. In some cases, password collection happens due to extremely subtle interactions between code from different entities.

Overall, we think that the approach of third-party scripts collecting the entirety of web pages or form inputs, and attempting to filter out sensitive information is incompatible with user security and privacy.

Password leaks are not limited to Mixpanel

In our research we found password leaks to four different third-party analytics providers across a number of websites. The sources are numerous: several variants of a “Show Password” feature added by site owners, an unexpected interaction with an unrelated third-party script, unanticipated changes to page structure by browser extensions, and even a bug in privacy tools of one of the analytics libraries. However, the underlying cause is the same: wholesale collection of user input data, with protection provided by set of blacklist-based heuristics to filter password fields. We argue that this heuristic approach is bound to fail, and provide a list of examples in which it does.

This summary is provided not as an exhaustive list of all possible vulnerabilities, but rather as examples of how things can go wrong. A detailed description of each vulnerability and the vendor response is available in the Appendix.

Show Password features place passwords in unprotected fields

Many websites implement mobile-friendly password visibility toggles which make it possible to “unmask” the entered password and check it for errors. In order to implement this, the user’s password must be placed in a field that doesn’t have its “type” property set to “password”, since browsers will automatically mask any text entered into those fields. We found leaks due to several variations of this feature, including to Mixpanel (A.1), to FullStory (A.2), and to SessionCam (A.3).

Video showing the inadvertent password collection by Mixpanel:

Subtle interactions with unrelated scripts can lead to leaks

Sites may include a number of third-party scripts which alter or annotate the page in unexpected ways. In the example given in (A.4), a third-party analytics script from Adobe stored the typed password in a cookie when the password field clicked. On the same page, session replay script from Userreplay collects all cookies, effectively causing the password leak to Userreplay.

Bugs in analytics scripts can lead to password leaks that go unnoticed

Even in cases where publisher sites take an active role to prevent leaks to third parties, things can go wrong. Passwords were leaking to FullStory (A.5) due to a bug in one of their redaction features. The feature was implemented in a such a way that, when applied to a password input, the password would leak to FullStory, but would not be displayed in the resulting session recording that the publisher could later review. Thus, it would be difficult for a publisher to discover the leak.

Browser extensions can alter the page in a way that leads to leaks

In their announcement, Mixpanel explained that “[the password leak] could happen in other scenarios where browser plugins (such as the 1Password password manager) and website frameworks place sensitive data into form element attributes.” The problem is not limited to a small set of browser extensions, but rather any extension which alters the page structure. Neither the site owner nor the analytics provider can be expected to anticipate all possible structural changes an extension might perform.

As an example, we examined browser extensions which automatically make password fields visible to the user. There areatonofsuchextensions, which are collectively used by 120,000+ users. We found that users of the Unmask Password and Show Password Chrome extensions would, on some sites, have their passwords leaked to Mixpanel (A.6) and FullStory (A.7) respectively. In both cases the leaks were caused by a variant of the “Show Password” feature described above.

A better look at the Mixpanel’s Autotrack

The Autotrack feature allows sites to collect interaction events on a website, like clicks or form interactions, without needing to specify which elements to monitor. The automated collection of all interactions, including values entered into input fields, is the service’s main selling point: if a site owner ever wants to start monitoring a new input field, Mixpanel will already have the complete history of the various inputs provided by the visitors for this field.

Autotrack doesn’t appear to be designed to collect sensitive data. Instead, Mixpanel suggests sites can use the service to perform benign analytical tasks, like finding the commonly used search terms. However, sites collect all types of sensitive data through input elements: usernames and passwords, health information, banking information, and so on. Automatically determining which fields are sensitive is a difficult task, and an incorrect classification runs the risk of scooping up the sensitive user data — even if the user never submits the form [6].

Similar to the session replay scripts we studied in the past, Mixpanel implements several heuristics [7] in attempt to automatically exclude specific types of sensitive information from collection. The rules most relevant to password fields are: remove any input field which is of the password type or has a name property (or “id” property, if name does not exist) that contains the substring “pass”. Mixpanel also offers sites the ability to manually exclude parts of the page, which we discuss later in the post.

Not a bug, but a broken design

Mixpanel attributed the cause of their previous password leak to a change in another third-party library. The third-party library “placed copies of the values of hidden and password fields into the input elements’ attributes, which Autotrack then inadvertently received”. Indeed, the previous version of Autotrack only filtered password fields’ “value” property, which stores the password entered by the user. The attributes of the field, which can be used to add arbitrary metadata about the password field is left unfiltered. The fix deployed by Mixpanel changed this, filtering both the value property and all attributes from password fields. This plugs that specific hole, but as the Testbook example (A.1) shows, sites may handle sensitive data in other ways Mixpanel didn’t predict.

“Show Password” feature is commonly implemented by switching the “type” property of the password input field from “password” to “text”.

The majority of the leaks examined above were caused by the mobile-friendly “Show Password” option. This feature is commonly implemented by switching the “type” property of the password input field from “password” to “text” (see: 1, 2, 3). The feature can be implemented by the first party directly or by a browser extension. In fact, this is how the Unmask Password extension is implemented, and was the cause of the ECRent password leak (A.6). The third-party scripts we studied don’t filter generic text fields as strictly as password fields. This is also the primary cause of the leaks on Testbook (A.1), PropellerAds (A.2), and johnlewis.com (A.3).

It may be tempting to filter passwords stored in generic text fields based on other properties of the input field, such as the name, class, or id. Mixpanel’s relatively simple implementation of this filtering [8] provides a perfect case study of this mitigation: it excludes generic text fields which contain the substring “pass” in their “name” attribute (or “id”, if “name” does not exist). Using our crawl data we found that 15% of the 36,972 total password fields discovered will not match this substring filter. Indeed, the third most frequent password field name attribute “pwd” would be missed, as will common translations of “password”. A word cloud of the 50 most commonly missed terms is given in footnote [9].

Of course Mixpanel’s heuristic could be updated to include these new fields, but that would just continue the game of whack-a-mole. There will inevitably be another password field formatted or handled in a way that this new heuristic fails to handle and user passwords will continue leaking.

Manual redaction is not a silver bullet

Mixpanel offers developers a way to manually specify fields that should be excluded from collection. Developers can simply add the class `mp-no-track` or `mp-sensitive` to an element to prevent sensitive information leaks. Indeed, this was the solution Mixpanel recommended in their response to our disclosure [3]. At first glance this might seem to mitigate the problems outlined in this post — anything missed by the automatic filtering can simply be manually filtered by the site. However, our research into session replay scripts found that companies repeatedly failed to prevent data leaks through manual redaction.

In Mixpanel’s case, redaction feature is will negate the main benefit of Autotrack – collection without a manual review of the fields. We signed up for a Mixpanel account and enabled Autotrack in the dashboard [10]. During the process, we didn’t see any warnings about the risks of Autotrack, nor could we find a way to review all of the collected data. To discover the collected passwords, we needed to manually add an “event probe” to the password field or login form. This may explain why it took over 9 months for a Mixpanel user to discover the inadvertent password collection introduced back in March 2017, despite its presence on 4% of Mixpanel’s projects.

Where to go from here?

We focus on password fields in this post because they are the most constrained user input and should be the easiest to redact. Despite this, several of the major input scraping scripts are still unable to prevent password leaks. Financial, health, and other sensitive data are often collected using generic “text” fields which may have ambiguous input labels. We expect them even more difficult to filter in an automated way.

We show that the indiscriminate collection of form data is a security disaster waiting to happen. We’ve highlighted these risks before, and the analysis included in this post shows that the problem persists. We don’t view these issues as bugs to be fixed, but rather vulnerabilities inherent to this class of analytics scripts.

Appendix: Password leaks and disclosures

A1. Mixpanel continues to unintentionally collect passwords

The Autotrack feature allows sites to collect analytics on form interactions such as when checking out products or signing in to your account. Mixpanel Autotrack normally tries to exclude password fields from the collected data, but the filtering relies on fragile assumptions about page composition and markup.

Fig 1. Password collection by Mixpanel on testbook.com.

One of the password leaks occurs on testbook.com’s [2] login form when a user makes use of the “Show Password” feature, which causes the user’s password to be displayed in cleartext. Once the user takes a further action, such as editing the password or hiding it again, the password will be collected by Mixpanel. The collection happens regardless of whether the user ultimately submits the login form.

We reported the issue to Mixpanel. Their response can be found in [3].

Mixpanel currently lists the Autotrack feature as “on hold”, and appears to have disabled it for new projects. But sites that were already using Autotrack at the time of the incident are not affected by this change.

A2. Password collection due to a “Show password” feature: The PropellerAds’ login form contains an invisible text field, which also holds a copy of the typed password. When a user wants to display the password in cleartext, the password field is replaced by the text field. FullStory’s auto exclusion filters fail to recognize the password in the text field, which causes the password to be collected by FullStory.

We reported this issue to FullStory and PropellerAds. FullStory responded promptly and said that “[they] are in touch with the customer to ensure that all inappropriate data is deleted and that they update their exclusion rules to comply with our Acceptable Use Policy.” FullStory’s complete response can be found in [4].

Fig 2. Password is collected on PropellerAds’ login form by FullStory.

A3. Johnlewis leaks

The registration page of the johnlewis.com website implements the “Show password” feature in the same way as PropellerAds: a copy of the password is always stored in an invisible text field, which replaces the password field when users want to show their password. This time session replay script from sessioncam.com fails to filter out passwords in the text field and sends it to its servers. We reported the issue to SessionCam and johnlewis.com. Response from Sessioncam can be found in [5].

On both of these cases (johnlewis.com and propellerads.com) password is grabbed by the session replay scripts even if the user does not make use of the Show/Hide password feature.

Fig 3. Password leaks to Sessioncam on johnlewis.com.

A4. Password leaks due to interaction with other analytics scripts.

Passwords on Capella University’s admission login page leaks due to an unexpected interaction of different third-party scripts. When a user clicks on the password field, Adobe’s Analytics ActivityMap script stores the password in a cookie called “s_sq”. On the same page, session replay script from Userreplay collects and sends all cookies, which effectively cause passwords to be collected by Userreplay.

Fig 4. The password stored in the cookies are leaked to Userreplay on Capella.edu website.

A5. Password leaks due to a bug in redaction features

Fig 5: The WP Engine login page leaks passwords via FullStory’s keystroke logger. We decode the keyCode values sent to FullStory to demonstrate that they match what was typed into the password field.

In the screenshot above, we demonstrate how passwords entered on WP Engine’s login page leak to FullStory via their keystroke logger. From what we can tell, this leak is the result of WP Engine taking proactive steps to protect user data. Rather than relying on FullStory’s automatic exclusion of password fields, WP Engine added FullStory’s manual redaction tag (i.e., `fs-hide`) to the field.

FullStory’s documentation explains that fields hidden with the `fs-hide` tag will not be visible in recordings, and that “some raw input events (e.g., key and click events) are also redacted when they relate to an excluded element.” Through manual debugging, we observed that the use of `fs-hide` tag changed the way FullStory’s script classifies the password field, eventually causing it to collect the password. Following our disclosure, FullStory promptly fixed the issue and released a security notice to their customers, stating that the bug affected less than 1% of sites.

A6. Unmask Password browser extension causes leaks on ECRent

Fig 6. Mixpanel will collect passwords on ECrent registration page when the Unmask Password Chrome extension is in use. The password leaks in a base64 encoded query string to Mixpanel. ECrent is a rental platform that was in the Alexa top 10,000 sites at the time of measurement.

Mixpanel’s response to this issue was as the follows: “Unfortunately, there’s little we (or the website owners who are our customers) can do to detect if the end user modifies their browser to make unexpected changes to the DOM. In this case, the solution is the same as for the concerns you reported earlier: explicitly blacklist sensitive input fields using the mechanism we provide.”

A7. Show Password causes leaks on Lenovo

Fig 7. FullStory will inadvertently collect passwords when the Show Password Chrome extension is in use. Similar to the Propeller Ads example above, this extension implements the show password functionality by swapping the current password field with a new cleartext field. The new field is not excluded from FullStory’s recordings — any further edits to the cleartext will cause password to be collected by FullStory.

Endnotes

We thank Jonathan Mayer for his valuable feedback.

[1] Our rationale for publicizing these leaks is not to point fingers at specific first or third parties. Mixpanel, for example, handled their previous password incident quickly and with transparency. These aren’t bugs that need to be fixed, but rather insecure practices that should be stopped entirely. Even if the specific problems highlighted in this post were fixed, we suspect we’d be able to continue to find variants of the same leaks elsewhere. Thankfully these password leaks can’t be exploited publicly, since the analytics data is only available to first parties. Instead, these leaks expose users to an increased risk to data breaches, an increased potential for data access abuse, and to unclear policies regarding data retention and sharing.

[2] testbook.com is the 2360th most popular site globally according to Alexa; testbook.com mobile app has 1,000,000 – 5,000,000 downloads.

[3] Mixpanel’s complete response:

“Thank you for reporting this. Mixpanel takes security very seriously, and values the contributions of researchers to help our customers be more secure. Our Autotrack feature primarily relies upon the type of HTML input fields to determine whether it is sensitive or not. As you’ve noticed, it backstops that with a simple pattern to try to identify cases where a website is collecting sensitive information in non-password/hidden fields. Per our documentation, if a customer is collecting sensitive information in non-password fields, they should explicitly blacklist it for collection[.] ”

[4] FullStory’s responses:

“Thank you for writing in. We are in touch with the customer to ensure that all inappropriate data is deleted and that they update their exclusion rules to comply with our Acceptable Use Policy.Our Acceptable Use Policy makes clear in straightforward language that our goal is to avoid receiving sensitive data in the first place; we believe such data should never leave the end user’s device. Our engineering team is working on several techniques to do even more to help our customers avoid the sort of mistakes your research has highlighted. We welcome any and all additional effort that helps us protect our customers and their customers’ data.”

“Thanks for reaching back out and for sending over both disclosures. While a more complete reply is forthcoming, I wanted to send over a quick update.Our engineers have been actively looking into both disclosures throughout the morning. We expect to ship a code change to address the `.fs-hide` issue momentarily and we’ll be back in touch with a more detailed response to both disclosures after that code change ships.Again, we really appreciate your disclosing these issues to us.”

“I wanted to follow up and let you know that we fixed the bug associated with the .fs-hide issue and the fix is currently in production as of 4:00 PM EST this afternoon. HTML elements containing the `.fs-hide` selector will no longer record keystrokes. Further, we changed the functionality of the recording code so that it will no longer record the actual keys alongside keystroke data. Thus, any future regression will not run the risk of subtly sending keystroke data to FullStory.We have followed up with WPEngine and are working to follow up with any other customers who may have been affected by this particular issue. Thanks again for disclosing this issue to us. The level of detail in the disclosure was particularly helpful in bringing clarity to the diagnosis of the problem. Regarding the disclosure for the Show Password Chrome extension, we’re exploring mechanisms to mitigate this and other cases where sensitive fields are duplicated in the DOM.“

[5] SessionCam’s response: “Thank you for bringing this to our attention, we had been investigating this issue and are working on a fix. This fix will go live ASAP, in the meantime all affected sessions are being deleted. The pages in question are no longer live with SessionCam.”

[6] Through manual analysis, we found that Mixpanel sends user inputs on a field-by-field basis as soon as the field loses focus. This means that user data is sent to Mixpanel even when the user chooses not to submit a form.

Skip input element values that appear to be Social Security Numbers. This is done by checking if the value matches the following regular expression:

“ssnRegex = /(^\d{3}-?\d{2}-?\d{4}$)/;”

Input values that match these filters are excluded from collection by Mixpanel.

[8] Comparing Mixpanel’s password field detection heuristics to those of two popular password manager browser extensions (Lastpass and 1Password), we found that Mixpanel’s password detection heuristic is far less comprehensive compared to theirs. For instance, Lastpass and 1Password’s heuristics consider the translation of the word “password” in different languages such as “contraseña”, “passwort”, “mot de passe” or “密码”, when detecting password fields. This is true despite the incentives; if a password manager fails to detect a password field, it’s a usability problem; if Mixpanel fails to detect a password field, it’s a security problem.

[9] Wordcloud of 50 most common password field name/id attributes that don’t include the substring “pass”, and will thus not be excluded by Mixpanel. This data was collected from a crawl of ~50K sites (~300K page visits). “Undefined” stands for password fields without any name or id attributes. Many of these words are translations of the word “password”, such as: “senha”(Portugese), “sifre” (Turkish), or “kennwort” (German).

[10] Mixpanel’s dashboard for enabling Autotrack, before it was disabled for all new projects. Note that there are no warnings of possible sensitive data collection.

Recently we revealed that “session replay” scripts on websites record everything you do, like someone looking over your shoulder, and send it to third-party servers. This en-masse data exfiltration inevitably scoops up sensitive, personal information — in real time, as you type it. We released the data behind our findings, including a list of 8,000 sites on which we observed session-replay scripts recording user data.

As one case study of these 8,000 sites, we found health conditions and prescription data being exfiltrated from walgreens.com. These are considered Protected Health Information under HIPAA. The number of affected sites is immense; contacting all of them and quantifying the severity of the privacy problems is beyond our means. We encourage you to check out our data release and hold your favorite websites accountable.

Student data exfiltration on Gradescope

As one example, a pair of researchers at UC San Diego read our study and then noticed that Gradescope, a website they used for grading assignments, embeds FullStory, one of the session replay scripts we analyzed. We investigated, and sure enough, we found that student names and emails, student grades, and instructor comments on students were being sent to FullStory’s servers. This is considered Student Data under FERPA (US educational privacy law). Ironically, Princeton’s own Information Security course was also affected. We notified Gradescope of our findings, and they removed FullStory from their website within a few hours.

You might wonder how the companies’ privacy policies square with our finding. As best as we can tell, Gradescope’s Terms of Service actually permit this data exfiltration [1], which is a telling comment about the ineffectiveness of Terms of Service as a way of regulating privacy.

FullStory’s Terms are a different matter, and include a clause stating: “Customer agrees that it will not provide any Sensitive Data to FullStory.” We argued previously that this repudiation of responsibility by session-replay scripts puts website operators in an impossible position, because preventing data leaks might require re-engineering the site substantially, negating the core value proposition of these services, which is drag-and-drop deployment. Interestingly, Gradescope’s CEO told us that they were not aware of this requirement in FullStory’s Terms, that the clause had not existed when they first signed up for FullStory, and that they (Gradescope) had not been notified when the Terms changed. [2]

Web publishers kept in the dark

Of the four websites we highlighted in our previous post and this one (Bonobos, Walgreens, Lenovo, and Gradescope), three have removed the third-party scripts in question (all except Lenovo). As far as we can tell, no publisher (website operator) was aware of the exfiltration of sensitive data on their own sites until our study. Further, as mentioned above, Gradescope was unaware of key provisions in FullStory’s Terms of Service. This is a pattern we’ve noticed over and over again in our six years of doing web privacy research.

Worse, in many cases the publisher has no direct relationship with the offending third-party script. In Part 2 of our study we examined two third-party scripts which exploit a vulnerability in browsers’ built-in password managers to exfiltrate user identities. One web developer was unable to determine how the script was loaded and asked us for help. We pointed out that their site loaded an ad network (media-clic.com), which in turn loaded “themoneytizer.com”, which finally loaded the offending script from Audience Insights. These chains of redirects are ubiquitous on the web, and might involve half a dozen third parties. On some websites the majority of third parties have no direct relationship with the publisher.

Most of the advertising and analytics industry is premised on keeping not just users but also website operators in the dark about privacy violations. Indeed, the effort required by website operators to fully audit third parties would negate much of the benefit of offloading tasks to them. The ad tech industry creates a tremendous negative externality in terms of the privacy cost to users.

Can we turn the tables?

The silver lining is that if we can explain to web developers what third parties are doing on their sites, and empower them to take control, that might be one of the most effective ways to improve web privacy. But any such endeavor should keep in mind that web publishers everywhere are on tight budgets and may not have much privacy expertise.

To make things concrete, here’s a proposal for how to achieve this kind of impact:

Create a 1-pager summarizing the bare minimum that website operators need to know about web security, privacy, and third parties, with pointers to more information.

Create a tailored privacy report for each website based on data that is already publicly available through various sources including our own datareleases.

Build open-source tools for website operators to scan their own sites [3]. Ideally, the tool should make recommendations for privacy-protecting changes based on the known behavior of third parties.

Reach out to website operators to provide information and help make changes. This step doesn’t scale, but is crucial.

If you’re interested in working with us on this, we’d love to hear from you!

Endnotes

We are grateful to UCSD researchers Dimitar Bounov and Sorin Lerner for bringing the vulnerabilities on Gradescope.com to our attention.

[1] Gradescope’s terms of use state: “By submitting Student Data to Gradescope, you consent to allow Gradescope to provide access to Student Data to its employees and to certain third party service providers which have a legitimate need to access such information in connection with their responsibilities in providing the Service.”

[2] The Wayback Machine does not archive FullStory’s Terms page far enough back in time for us to independently verify Gradescope’s statement, nor does FullStory appear in ToSBack, the EFF’s terms-of-service tracker.

]]>https://freedom-to-tinker.com/2018/01/12/website-operators-are-in-the-dark-about-privacy-violations-by-third-party-scripts/feed/5No boundaries for user identities: Web trackers exploit browser login managershttps://freedom-to-tinker.com/2017/12/27/no-boundaries-for-user-identities-web-trackers-exploit-browser-login-managers/
https://freedom-to-tinker.com/2017/12/27/no-boundaries-for-user-identities-web-trackers-exploit-browser-login-managers/#commentsWed, 27 Dec 2017 16:35:08 +0000https://freedom-to-tinker.com/?p=13327In this second installment of the “No Boundaries” series, we show how a long-known vulnerability in browsers’ built-in password managers is abused by third-party scripts for tracking on more than a thousand sites.

by Gunes Acar, Steven Englehardt, and Arvind Narayanan

We show how third-party scripts exploit browsers’ built-in login managers (also called password managers) to retrieve and exfiltrate user identifiers without user awareness. To the best of our knowledge, our research is the first to show that login managers are being abused by third-party scripts for the purposes of web tracking.

The underlying vulnerability of login managers to credential theft hasbeenknownforyears. Much of the past discussion has focused on password exfiltration by malicious scripts through cross-site scripting (XSS) attacks. Fortunately, we haven’t found password theft on the 50,000 sites that we analyzed. Instead, we found tracking scripts embedded by the first party abusing the same technique to extract emails addresses for building tracking identifiers.

The image above shows the process. First, a user fills out a login form on the page and asks the browser to save the login. The tracking script is not present on the login page [1]. Then, the user visits another page on the same website which includes the third-party tracking script. The tracking script inserts an invisible login form, which is automatically filled in by the browser’s login manager. The third-party script retrieves the user’s email address by reading the populated form and sends the email hashes to third-party servers.

We found two scripts using this technique to extract email addresses from login managers on the websites which embed them. These addresses are then hashed and sent to one or more third-party servers. These scripts were present on 1110 of the Alexa top 1 million sites. The process of detecting these scripts is described in our measurement methodology in the Appendix 1. We provide a brief analysis of each script in the sections below.

Why does the attack work? All major browsers have built-in login managers that save and automatically fill in username and password data to make the login experience more seamless. The set of heuristics used to determine which login forms will be autofilled varies by browser, but the basic requirement is that a username and password field be available.

Login form autofilling in general doesn’t require user interaction; all of the major browsers will autofill the username (often an email address) immediately, regardless of the visibility of the form. Chrome doesn’t autofill the password field until the user clicks or touches anywhere on the page. Other browsers we tested [2] don’t require user interaction to autofill password fields.

Thus, third-party javascript can retrieve the saved credentials by creating a form with the username and password fields, which will then be autofilled by the login manager.

Why collect hashes of email addresses? Email addresses are unique and persistent, and thus the hash of an email address is an excellent tracking identifier. A user’s email address will almost never change — clearing cookies, using private browsing mode, or switching devices won’t prevent tracking. The hash of an email address can be used to connect the pieces of an online profile scattered across different browsers, devices, and mobile apps. It can also serve as a link between browsing history profiles before and after cookie clears. In a previous blog post on email tracking, we described in detail why a hashed email address is not an anonymous identifier.

Scripts exploiting browser login managersList of sites embedding scripts that abuse login manager for tracking
“Smart Advertising Performance” and “Big Data Marketing” are the taglines used by the two companies who own the scripts that abuse login managers to extract email addresses. We have manually analyzed the scripts that contained the attack code and verified the attack steps described above. The snippets from the two scripts are given in Appendix 2.

Adthink (audienceinsights.net): After injecting an invisible form and reading the email address, Adthink script sends MD5, SHA1 and SHA256 hashes of the email address to its server (secure.audienceinsights.net). Adthink then triggers another request containing the MD5 hash of the email to data broker Acxiom (p-eu.acxiom-online.com).

The Adthink script contains very detailed categories for personal, financial, physical traits, as well as intents, interests and demographics. It is hard to comment on the exact use of these categories but it gives a glimpse of what our online profiles are made up of:

The categories mentioned in the Adthink script include detailed personal, financial, physical traits, as well as intents, interests and demographics (Link to the code snippet).

OnAudience (behavioralengine.com): The OnAudience script is most commonly present on Polish websites, including newspapers, ISPs and online retailers. 45 of the 63 sites that contain OnAudience script have “.pl” country code top-level domain.

The script sends the MD5 hash of the email back to its server after reading it through the login manager. OnAudience script also collects browser features including plugins, MIME types, screen dimensions, language, timezone information, user agent string, OS and CPU information. The script then generates a hash based on this browser fingerprint. OnAudience claims to use anonymous data only, but hashed email addresses are not anonymous. If an attacker wants to determine whether a user is in the dataset, they can simply hash the user’s email address and search for records associated with that hash. For a more detailed discussion, see our previous blog post.

Is this attack new? This and similar attacks have been discussed in a number of browserbug reports and academicpapers for at least 11 years. Much of the previous discussion focuses on the security implications of the current functionality, and on the security-usability tradeoff of the autofill functionality.

Several researchers showed that it is possible to steal passwords from login managers through cross-site scripting (XSS) attacks [3,4,5,6,7]. Login managers and XSS is a dangerous mixture for two reasons: 1) passwords retrieved by XSS can have more devastating effects compared to cookie theft, as users commonly reuse passwords across different sites; 2) login managers extend the attack surface for the password theft, as an XSS attack can steal passwords on any page within a site, even those which don’t contain a login form.

How did we get here? You may wonder how a security vulnerability persisted for 11 years. That’s because from a narrow browser security perspective, there is no vulnerability, and everything is working as intended. Let us explain.

The web’s security rests on the Same Origin Policy. In this model, scripts and content from different origins (roughly, domains or websites) are treated as mutually untrusting, and the browser protects them from interfering with each other. However, if a publisher directly embeds a third-party script, rather than isolating it in an iframe, the script is treated as coming from the publisher’s origin. Thus, the publisher (and its users) entirely lose the protections of the same origin policy, and there is nothing preventing the script from exfiltrating sensitive information. Sadly, direct embedding is common — and, in fact, the default — which also explains why the vulnerabilities we exposed in our previous post were possible.

This model is a poor fit for reality. Publishers neither completely trust nor completely mistrust third parties, and thus neither of the two options (iframe sandboxing and direct embedding) is a good fit: one limits functionality and the other is a privacy nightmare. We’ve found repeatedly through our research that third parties are quite opaque about the behavior of their scripts, and at any rate, most publishers don’t have the time or technical knowhow to evaluate them. Thus, we’re stuck with this uneasy relationship between publishers and third parties for the foreseeable future.

The browser vendor’s dilemma. It is clear that the Same-Origin Policy is a poor fit for trust relationships on the web today, and that other security defenses would help. But there is another dilemma for browser vendors: should they defend against this and other similar vulnerabilities, or view it as the publisher’s fault for embedding the third party at all?

There are good arguments for both views. Currently browser vendors seem to adopt the latter for the login manager issue, viewing it as the publisher’s burden. In general, there is no principled way to defend against third parties that are present on some pages on a site from accessing sensitive data on other pages of the same site. For example, if a user simultaneously has two tabs from the same site open — one containing a login form but no third party, and vice versa — then the third-party script can “reach across” browser tabs and exfiltrate the login information under certain circumstances. By embedding a third party anywhere on its site, the publisher signals that it completely trusts the third party.

Yet, in other cases, browser vendors have chosen to adopt defenses even if necessarily imperfect. For example, the HTTPOnly cookie attribute was introduced to limit the impact of XSS attacks by blocking the script access to security critical cookies.

There is another relevant factor: our discovery means that autofill is not just a security vulnerability but also a privacy threat. While the security community strongly prefers principled solutions whenever possible, when it comes to web tracking, we have generally been willing to embrace more heuristic defenses such as blocklists.

Countermeasures. Publishers, users, and browser vendors can all take steps to prevent autofill data exfiltration. We discuss each in turn.

Publishers can isolate login forms by putting them on a separate subdomain, which prevents autofill from working on non-login pages. This does have drawbacks including an increase in engineering complexity. Alternately they could isolate third parties using frameworks like Safeframe. Safeframe makes it easier for the publisher scripts and iframed scripts to communicate, thus blunting the effect of sandboxing. Any such technique requires additional engineering by the publisher compared to simply dropping a third-party script into the web page.

Users can install ad blockers or tracking protection extensions to prevent tracking by invasive third-party scripts. The domains used to serve the two scripts (behavioralengine.com and audienceinsights.net) are blocked by the EasyPrivacy blocklist.

Now we turn to browsers. The simplest defense is to allow users to disable login autofill. For instance, the Firefox preference signon.autofillForms can be set to false to disable autofilling of credentials.

A less crude defense is to require user interaction before autofilling login forms. Browser vendors have been reluctant to do this because of the usability overhead, but given the evidence of autofill abuse in the wild, this overhead might be justifiable.

The upcoming W3C Credential Management API requires browsers to display a notification when user credentials are provided to a page [8]. Browsers may display the same notification when login information is autofilled by the built-in login managers. Displays of this type won’t directly prevent abuse, but they make attacks more visible to publishers and privacy-conscious users.

Built-in login managers have a positive effect on web security: they curtail password reuse by making it easy to use complex passwords, and they make phishing attacks are harder to mount. Yet, browser vendors should reconsider allowing stealthy access to autofilled login forms in the light of our findings. More generally, for every browser feature, browser developers and standard bodies should consider how it might be abused by untrustworthy third-party scripts.

To study password manager abuse, we extended OpenWPM to simulate a user with saved login credentials and added instrumentation to monitor form access. We used Firefox’s nsILoginManager interface to add login credentials as if they were previously stored by the user. We did not otherwise alter the functionality of the password manager or attempt to manually fill login forms. This allowed us to capture actual abuses of the browser login manager, as any exfiltrated data must have originated from the login manager.

We crawled 50,000 sites from the Alexa top 1 million. We used the following sampling strategy: visit all of the top 15,000 sites, randomly sample 15,000 sites from the Alexa rank range [15,000 100,000), and randomly sample 20,000 sites from the range [100,000, 1,000,000). This combination allowed us to observe the attacks on both high and low traffic sites. On each of these 50,000 sites we visited 6 pages: the front page and a set of 5 other pages randomly sampled from the internal links on the front page.

The fake login credentials acted as bait, allowing us to introduce an email and password to the page that could be collected by third parties without any additional interaction. Detection of email address collection was done by inspecting JavaScript calls related to form creation and access, and by the analysis of the HTTP traffic. Specifically, we used the following instrumentation:

Mutation events to monitor elements inserted to the page DOM. This allowed us to detect the injection of fake login forms. When a mutation event fires, we record the current call stack and serialize the inserted HTML elements.

Instrument HTMLInputElement to intercept access to form input fields. We log the input field value that is being read to detect when the bait email (autofilled by the built-in password manager) was sniffed.

Store HTTP request and response data, including POST payloads to detect the exfiltration of the email address or password.

For both JavaScript (1, 2) and HTTP instrumentation (3) we store JavaScript stack traces at the time of the function call or the HTTP request. We then parse the stack trace to pin down the initiators of an HTTP request or the parties responsible for inserting or accessing a form.

We then combine the instrumentation data to select scripts that:

inject an HTML element containing a password field (recall that the password field is necessary for the built-in password manager to kick in)

read the email address from the input field automatically filled by the browser’s login manager

send the email address, or a hash of it, over HTTP

To verify the findings of the automated experiments we manually analyzed sites that embed the two scripts that match these conditions. We have verified that the forms that the scripts inserted were not visible. We then opened accounts on the sites that allow registration and let the browser store the login information (by clicking yes to the dialog in Figure 1). We then visited another page on the site and verified that browser password manager filled the invisible form injected by the scripts.

Appendix 2 – Code Snippets

]]>https://freedom-to-tinker.com/2017/12/27/no-boundaries-for-user-identities-web-trackers-exploit-browser-login-managers/feed/27No boundaries: Exfiltration of personal data by session-replay scriptshttps://freedom-to-tinker.com/2017/11/15/no-boundaries-exfiltration-of-personal-data-by-session-replay-scripts/
https://freedom-to-tinker.com/2017/11/15/no-boundaries-exfiltration-of-personal-data-by-session-replay-scripts/#commentsWed, 15 Nov 2017 14:21:51 +0000https://freedom-to-tinker.com/?p=13233This is the first post in our “No Boundaries” series, in which we reveal how third-party scripts on websites have been extracting personal information in increasingly intrusive ways. [0]by Steven Englehardt, Gunes Acar, and Arvind Narayanan

Update: we’ve released our data — the list of sites with session-replay scripts, and the sites where we’ve confirmed recording by third parties.

You may know that most websites have third-party analytics scripts that record which pages you visit and the searches you make. But lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder.

The stated purpose of this data collection includes gathering insights into how users interact with websites and discovering broken or confusing pages. However the extent of data collected by these services far exceeds user expectations [1]; text typed into forms is collected before the user submits the form, and precise mouse movements are saved, all without any visual indication to the user. This data can’t reasonably be expected to be kept anonymous. In fact, some companies allow publishers to explicitlylink recordings to a user’s real identity.

For this study we analyzed seven of the top session replay companies (based on their relative popularity in our measurements [2]). The services studied are Yandex, FullStory, Hotjar, UserReplay, Smartlook, Clicktale, and SessionCam. We found these services in use on 482 of the Alexa top 50,000 sites.

This video shows the “co-browse” feature of one company, where the publisher can watch user sessions live.

What can go wrong? In short, a lot.

Collection of page content by third-party replay scripts may cause sensitive information such as medical conditions, credit card details and other personal information displayed on a page to leak to the third-party as part of the recording. This may expose users to identity theft, online scams, and other unwanted behavior. The same is true for the collection of user inputs during checkout and registration processes.

The replay services offer a combination of manual and automatic redaction tools that allow publishers to exclude sensitive information from recordings. However, in order for leaks to be avoided, publishers would need to diligently check and scrub all pages which display or accept user information. For dynamically generated sites, this process would involve inspecting the underlying web application’s server-side code. Further, this process would need to be repeated every time a site is updated or the web application that powers the site is changed.

A thorough redaction process is actually a requirement for several of the recording services, which explicitlyforbid the collection of user data. This negates the core premise of these session replay scripts, who market themselves as plug and play. For example, Hotjar’s homepage advertises: “Set up Hotjar with one script in a matter of seconds” and Smartlook’s sign-up procedure features their script tag next to a timer with the tagline “every minute you lose is a lot of video”.

To better understand the effectiveness of these redaction practices, we set up test pages and installed replay scripts from six of the seven companies [3]. From the results of these tests, as well as an analysis of a number of live sites, we highlight four types of vulnerabilities below:

1. Passwords are included in session recordings. All of the services studied attempt to prevent password leaks by automatically excluding password input fields from recordings. However, mobile-friendly login boxes that use text inputs to store unmasked passwords are not redacted by this rule, unless the publisher manually adds redaction tags to exclude them. We found at least one website where the password entered into a registration form leaked to SessionCam, even if the form is never submitted.

2. Sensitive user inputs are redacted in a partial and imperfect way. As users interact with a site they will provide sensitive data during account creation, while making a purchase, or while searching the site. Session recording scripts can use keystroke or input element loggers to collect this data.

All of the companies studied offer some mitigation through automated redaction, but the coverage offered varies greatly by provider. UserReplay and SessionCam replace all user input with an equivalent length masking text, while FullStory, Hotjar, and Smartlook exclude specific input fields by type. We summarize the redaction of other fields in the table below.

Automated redaction is imperfect; fields are redacted by input element type or heuristics, which may not always match the implementation used by publishers. For example, FullStory redacts credit card fields with the `autocomplete` attribute set to `cc-number`, but will collect any credit card numbers included in forms without this attribute.

To supplement automated redaction, several of the session recording companies, including Smartlook, Yandex, FullStory, SessionCam, and Hotjar allow sites to further specify inputs elements to be excluded from the recording. To effectively deploy these mitigations a publisher will need to actively audit every input element to determine if it contains personal data. This is complicated, error prone and costly, especially as a site or the underlying web application code changes over time. For instance, the financial service site fidelity.com has several redaction rules for Clicktale that involve nested tables and child elements referenced by their index. In the next section we further explore these challenges.

A safer approach would be to mask or redact all inputs by default, as is done by UserReplay and SessionCam, and allow whitelisting of known-safe values. Even fully masked inputs provide imperfect protection. For example, the masking used by UserReplay and Smartlook leaks the length of the user’s password

Instead, session recording companies expect sites to manually label all personally identifying information included in a rendered page. Sensitive user data has a number of avenues to end up in recordings, and small leaks over several pages can lead to a large accumulation of personal data in a single session recording.

For recordings to be completely free of personal information, a site’s web application developers would need to work with the site’s marketing and analytics teams to iteratively scrub personally identifying information from recordings as it’s discovered. Any change to the site design, such as a change in the class attribute of an element containing sensitive information or a decision to load private data into a different type of element requires a review of the redaction rules.

As a case study, we examine the pharmacy section of Walgreens.com, which embeds FullStory. Walgreens makes extensive use of manual redaction for both displayed and input data. Despite this, we find that sensitive information including medical conditions and prescriptions are leaked to FullStory alongside the names of users.

We do not present the above examples to point fingers at a certain website. Instead, we aim to show that the redaction process can fail even for a large publisher with a strong, legal incentive to protect user data. We observed similar personal information leaks on other websites, including on the checkout pages of Lenovo [5]. Sites with less resources or less expertise are even more likely to fail.

4. Recording services may fail to protect user data. Recording services increase the exposure to data breaches, as personal data will inevitably end up in recordings. These services must handle recording data with the same security practices with which a publisher would be expected to handle user data.

We provide a specific example of how recording services can fail to do so. Once a session recording is complete, publishers can review it using a dashboard provided by the recording service. The publisher dashboards for Yandex, Hotjar, and Smartlook all deliver playbacks within an HTTP page, even for recordings which take place on HTTPS pages. This allows an active man-in-the-middle to injecting a script into the playback page and extract all of the recording data. Worse yet, Yandex and Hotjar deliver the publisher page content over HTTP — data that was previously protected by HTTPS is now vulnerable to passive network surveillance.

The vulnerabilities we highlight above are inherent to full-page session recording. That’s not to say the specific examples can’t be fixed — indeed, the publishers we examined can patch their leaks of user data and passwords. The recording services can all use HTTPS during playbacks. But as long as the security of user data relies on publishers fully redacting their sites, these underlying vulnerabilities will continue to exist.

Does tracking protection help?

Two commonly used ad-blocking lists EasyList and EasyPrivacy do not block FullStory, Smartlook, or UserReplay scripts. EasyPrivacy has filter rules that block Yandex, Hotjar, ClickTale and SessionCam.

At least one of the five companies we studied (UserReplay) allows publishers to disable data collection from users who have Do Not Track (DNT) set in their browsers. We scanned the configuration settings of the Alexa top 1 million publishers using UserReplay on their homepages, and found that none of them chose to honor the DNT signal.

Improving user experience is a critical task for publishers. However it shouldn’t come at the expense of user privacy.

End notes:

[0] We use the term ‘exfiltrate’ in this series to refer to the third-party data collection that we study. The term ‘leakage’ is sometimes used, but we eschew it, because it suggests an accidental collection resulting from a bug. Rather, our research suggests that while not necessarily malicious, the collection of sensitive personal data by the third parties that we study is inherent in their operation and is well known to most if not all of these entities. Further, there is an element of furtiveness; these data flows are not public knowledge and neither publishers nor third parties are transparent about them.

[1] A recentanalysis of the company Navistone, completed by Hill and Mattu for Gizmodo, explores how data collection prior to form submission exceeds user expectations. In this study, we show how analytics companies collect far more user data with minimal disclosure to the user. In fact, some services suggest the first party sites simply include a disclaimer in their site’s privacy policy or terms of service.

[2] We used OpenWPM to crawl the Alexa top 50,000 sites, visiting the homepage and 5 additional internal pages on each site. We use a two-step approach to detect analytics services which collect page content.

First, we inject a unique value into the HTML of the page and search for evidence of that value being sent to a third party in the page traffic. To detect values that may be encoded or hashed we use a detection methodology similar to previous work on email tracking. After filtering out leak recipients, we isolate pages on which at least one third party receives a large amount of data during the visit, but for which we do not detect a unique ID. On these sites, we perform a follow-up crawl which injects a 200KB chunk of data into the page and check if we observe a corresponding bump in the size of the data sent to the third party.

We found 482 sites on which either the unique marker was leaked to a collection endpoint from one of the services or on which we observed a data collection increase roughly equivalent to the compressed length of the injected chunk. We believe this value is a lower bound since many of the recording services offer the ability to samplepage visits, which is compounded by our two-step methodology.

[3] One company (Clicktale) was excluded because we were unable to make the practical arrangements to analyze script’s functionality at scale.

[4] FullStory’s terms and conditions explicitly classify health or medical information, or any other information covered by HIPAA as sensitive data and asks customers to “not provide any Sensitive Data to FullStory.”

[5] Lenovo.com is another example of a site which leaks user data in session recordings.

[6] We used the default scripts available to new accounts for 5 of the 6 providers. For UserReplay, we used a script taken from a live site and verified that the configuration options match the most common options found on the web.