⛅ invi.sible.link, operative reporting

This page get updated near the beginning of every month, to understand the overall picture, consult the Project Plan.

Task list

Improve browser emulation and javascript sand boxing, integrating the Honeynet project Thug, technically this allows us to get a list of all the javascript functions executed going beyond just a static source code analysis

Phase:

Having a data sharing capability in every node, and look for differences between tracking code

Phase:

In browser visualization of the results, usable to monitor the trend or visually identify anomalies

Phase:

Import the browser history of a person to map their profile of exposure / support community driven input (through github files), this approach would allow a more personalized analysis, that goes beyond just looking at the Alexa top 500 sites for each country

Research into how to identify anomalies and tracking related functionality based on the dynamic code analysis provided by 1.

Phase:

Research into the privacy implications and device fingerprinting used in tracking

Phase:

Support Latin American communities running the tool, interpolating their results

Phase:

Write a research report

Phase:

Work with CodingRights in disseminating the results in Latin American communities

Phase:

Researcher visualization: the difference between this and point 3 is the amount of detail provided

Phase:

Wrapping up the project and performing last touches and cleanups

Phase:

November 2017

During November, besides the exciting OTF summit in Valentia which has open a couple of potential collaboration, is the 11th month of my fellowship and few updates from the academia involved in the privacy research have been published.

External input

Two external products confirm to me the usefulness of the invi.sible.link pipeline and they will provide some useful narrative in the final phase of this project.

Session replay is the ability to replay a visitor's journey on a web site or within a web application. Replay can include the user's view (browser or screen output), user input (keyboard and mouse inputs), and logs of network events or console logs. Its main value is to help improve customer experience[1] and to identify obstacles in conversion processes on websites.

The research of Princeton university has been published in November 2017 based on the attention which such feature could raise on website intended for a targeted audience, I've extended the scorecard I'm working on to report this evidence.

As explained by the researchers, they released a list of third party trackers offering session reply services, but this list is not complete and at the moment is a mix of domain names and script. (Wired).

Targeted attack was the main concern of this fellowship product: what if a malicious script, or injected malware, is served based on the content you are visiting or in your profile background? This has been confirmed to happen, but to do that, a media organization develop a tool, made campaign, analyze data, find results. This is a process who do not scale, and my development goals are exactly make scale for community these experts abilities.

Final stage of the fellowship

Nearly all the declared points announced on my project plan should be delivered. The research part, in which I was supposed to look for insights into the scripts behavior, has been limited by the campaigni and partnership efforts.

TODO list for December, to conclude the fellowship

Complete the expert visualization, with the goal of investigating on session reply and canvas fingerprint. It is the bare minimum

Try to lunch a collaborative analysis over World Trade Organization meeting in Argentina, more person will adopt the social extension, and more input can be analyzed.

Complete the clinics analysis, test the bot concept and outreach to other organizations. The goal is test a campaign tool.

August, September and October 2017

This report, which is supposed to be monthly, is published after three months. I collaborated with a third party organization in experimenting an analysis based on the scraping of links from social media links and trackers analysis. At the moment I'm writing, beginning of November 2017, a report is in development (is not related to trackers analysis), and the opportunity raised a new research branch in this research.

Social media observer and realtime analysis

One of the challenges in web analysis is the impossibility to test all the condition in which a user will navigate. Making a common assumption, you can end up in doing a mass study but not to spot some targeted web surveillance because the script you hope to catch will just trigger in specific content.

The state of art solutions are:

permitting to experts in providing their list (as per personal experience, it quite hard gets people self-profile their community and report back generic but detailed links)

Exporting your browsing history, in this year a good improvement has been made by EDRI fellow Sid with Hackuna metadata, I don't think I can explore more in that direction.

Monitor only the homepage (this is what me or WebTAP of Princeton University did), and increase the size of the tested sites. But this can't see when a specific tracker runs only on few selected pages of the site.

Considering how Facebook and Google are the de-facto gatekeepers of the WWW, the most appropriate solution to monitor the quality of web pages, is look at what is effectively provided by these platform. But, in order to do so, you have to observe the personalized experience of the users. With a browser extension is it possible collects the links appearing in an user timeline (or the one shared by a selected community), and analyzed them.

This approach is currently in use, and will provide output during the month.

LatAm realtime tracking experiment

A work in progress is the development of argentinin.tracking.exposed, analyzing the links shared by Argentinian media and political figures.

Analysis of javascript tracking behavior

Has been implemented the usage of Chrome with PrivacyBadger as part of the testing suite, this permit the first javascript fingerprinting and visual reporting.

This result can be fetched via API (i.e.: https://invi.sible.link/api/v1/details/itatopex ), it contains most of the javascript calls usable in browser fingerprinting.

JS details visualization

The interface wannts to show graphically what the third parties can potentially access. Every campaign is producing this outut daily, the incomplete visualization can be this example from the Cilean online clinics: Clinics-CL

July 2017

The last month has been used for advocacy, networking and personal time.

In July, I was in Cartagena (Colombia), then in Rio Magdalena meeting indigenous community during their pacification process, then in Lima (Perù) meeting some local digital rights group, then to SHA2017 hacker camp near Amsterdam.

The most exciting developments are the conferences and the meeting, in the Latin American countries, I made with CodingRights.

I explained tracking implication in the last years, with mixed result. I am pleased to see how an analysis including only a contextual group of sites enables compelling narratives.

The story line used with CodingRights, in Cartagena was following this logic:

you as a citizen could have a health issue, and health insurance is necessary for it.

In the so-called quantify society, personalized services are a mean permitting more customer exploitation. Policy and common knowledge seem not ready yet to face these offers.

If an online clinics include third party trackers, your activity on their websites has only one link of distance to your physical person.

Your navigation an online clinic website could leak some patterns: a particular exam you are looking for, symptoms, prescription.

This information can be used by the data processor, sell through data broker and ends up in an Insurance company. then used to increase the profit.

The business model of hospital, public or private, is not ADS based, third party trackers have not the same justification used, for example, in the news media debate.

This simple sequence it is worked in explaining the necessity of ad-blockers, the responsibility of websites and power dynamics of the data processor.

The outreach of I was looking for, is meeting partners who try to figure out their concrete problem, and how data broker could exploit the context they live.

If I have to make a list of the topic raised in the discussions, I recall more often:

The election, for local media and political party: for whom aware of Cambrige-Analytica, it is clear that political interest can be used to frame a personalized message, and this can be abused for political marketing.

The activist websites: because is implicitly a leakage of your political interest

The gambling/pornography websites: because they can leak to addictions or embarassing details in certain society.

I am exploring a theory: every connected human, belongs to many social context.

Moreover, we as humans are not vulnerable in all of this environment, you can even belong to thug gang, and nobody in your town will harm you, but your risk not finding any insurance coverage because a multinational denies your health care.

This approach tries to simplify the creation of campaign intended to the life aspect in which a person is vulnerable. The content produced is designed to speak to a group of a person which feel themselves at risk.

Imagine two characters: "A political opponent in Iran" and "a poker addict looking for a new job." They face different risks, your empathy on the situation is probably different and the assistance they deserve too. InviSibleLink is a framework, can be used by two separate group, one speaking Persian and the other talking to addicts because massive web profiling can harm both of them.

The "nothing to hide" narrative want to be addressed making many websites. The hopes are anybody will find the one who speaks to the part in which is vulnerable. Few persons feel completely safe, and they are not the target.

This approach has been confirmed and would define the cultural inheritance left after the fellowship, speaking of which, I have to run a little bit now to catch up with the deliverable planned.

June 2017

The month has been used for reseaerch and advocacy more than development.

IACAP conference and presentation

I had my first presentation about third party trackers analysis, the content collected and the experience done would be reused in future presentations, a blogpost explaining the context will be published as soon as I make new visualizations with Tableau. I'm doing data investigation with that tool because it is much more efficient than developing my visualizations before have understood the complexity of the data. I've written a blogpost to test a simplify communication on algorithms, profiling and political impact: profiling, algorithm surveillance and religious freedom.

General software improvement for the campaign checking

A simple approach to monitor the trend on how website are doing has been implemented (a random example), using the interface of last activitivies me and others partecipants check the trends.

urlscan.io

I get in touch and obtain a key of urlscan.io, it is a service which monitor websites inclusion from their own infrastructure. Can be useful as comparison. A driver to use the service has still to be implemented.

WebTAP and their publications

WebTAP is The Web Transparency and Accountability Project, of Princeton University. I subscribed to get access to of their data as researcher. I didn't yet have a chance to try these. The research team has released some inspiring paper, too, I had the opportunity to read this month:

PhantomJS will be unmaintained

This is not a big deal, considering the probes diversification I have in mind, but considering the capability of collecting OpenGraph data, probably I have to replace the support of phantomJS with nightmarejs, but in general, I'll look forward to integrate once for all Thug.

Experimental visualizations

May 2017

In May 2017 the first two campaign got released, mostly I worked as a facilitator, text editor, visualization revision, double checking the results and using this achievement to display a vision for partners.

Results publications to a broad audience

The month of May hasn't lead to any particular improvement. Rather a stabilization of the interfaces and the workflow. The month has got the two releases expected and the progress with CodingRights about our presentation in July.

In specific, the episode of the TV show using the analysis website has got 20k unique IP access and this spike in the web traffic:

Deflect.ca offers the CDN and the technological interface to query the users.

At the moment, the experience done has been useful to stabilize a tangible result for a broader audience. Also, minor initiatives are running, and currently I'm in Istanbul to make progress on an analysis in Turkey.

April 2017

Most of the time in April 2017 has been dedicated in separating the analysis content from the campaing content. Campaign has to be delegated as much as possible to the campainer (the local community aware of the social and digital issues), and this has required some polishment from my side. An example campaign, with 100% HTML and zero code, is here implemented.

Organizing documentation

With the first adoption of the technology, I've started to organize documentation and define how the project might be integrated into other project doing the same analysis. The README on the campaigns is the reference for them and is currently kept reference by who's organizing those.

Academy and outreach

Three important events involving my research in the fellowship are going to happen between May and July.

I'm working with an investigative journalism team to explain, for a (very) large and basic audience third party trackers, privacy and security implications. In the second half of May might be done.

Big Data for the South, in Cartagena, has accepted the application of Me and Joana of CodingRights, about third party analysis in Latin America compared with other Western countries.

In the first and third point above, The potential outcome is a quite large visibility over the code repository, in the hope some open source developer with free time take interest in the project.

Experiment with OpenWPM

Princeton University, after webXray, improves their technology with OpenWPM it is a nice developed tool that might represent a valid integration and extension to my analysis. It uses a different format, support much more interaction with a non-headless browser and is less orchestrated.

March 2017

RightsCon and the research of local supporters

The month of March (and the first days of April) I attend in RighsCon and at the International Journalism Festival. My presence in there was justified by some talks I gave about algorithm accountability, I had some meetings with teams from different countries and contexts. Discussions are proceeding further, in order to begin an analysis campaign.

The countries of interests are Iran and Turkey. Finding local supporters is getting more vital, and I'm expanding the side of the project intended to communicate the results. The goal is split my technical analysis and the graphs with the advocacy material. Having a clear separation of duties would make, in theory, my and local supporters work for the same goal without blocking each other. A clear separation between the technical analysis and the local declination intended to be done.

Human Rights Researcher and Internet Policy Analyst has been my target to get in touch with.

Side life

I started a trip to and around Europe, to meet a certain number of potential collaborators, I'm traveling in these months and my update schedule is getting some delay. The third party trackers analysis keep raising political (for example, the Sleeping Giant) and technical interests as highlighted below:

Research plan with CodingRights

Me and CodingRights team apply for a conference, we'll do in the next months a comparate analysis of sensitive (and less sensitive) websites among latin American and western country (as a comparison). Is expected to be one of the core results of this fellowship, or at least, a lasting example of the analysis method. In the meantime, analytics and deeper script analysis with Thug will be supported. Results are scheduled to be available in June 2017.

February 2017

Manage a campaign based website

I'm realizing three campaign with a Western audience in mind. These campaigns do not strictly fits with my fellowship goals, but are useful steps for outreach and early feedback.

I realize a certain effort is in the website selection . It requires a deep knowledge of the target audience, as any non-generalistic communication. Having a collaborator belonging to the environment you want to disseminate is a requirement. In contexts in which the target audience is technical, this is less important though.

Raise awareness about third party trackers and the implications for users and companies security, require also some paragraphs on remediations, alternatives and script blocking. This might be delgated to the many translated tutorial around the web.

Translation/Localisation are optional but welcomed. Is important for this point, to implement a technical structure that permits such customization.

In RightsCon, by the end of March 2017, I'll meet partners from Turkey, Iran and other countries to discuss how to begin some analysis campaign.

As technical imporvement, I've extended the campaign manager in importing CSV: this enables collaborator to work with a spreadsheet and github without dealing with more technically complex format (I use JSON natively). Also, it might be edited directly in github, lowering the entance barrier.

Campaigns in progress and testing of the workflow

In the three campaign I'm testing, the d3 plugin sankey is helping in the generation of a graphical appealing scalable visualization, like:

Joint application with Coding Rights

A research paper to compare 10 South American countries and 5 Western countries is a work in progress, we'll apply for academic conference.

Stable monitoring of the analysis

The monitoring pipeline is proved to be stable, the statistics are available here, they are showing the last two days and there the last 20 days (might take a while to load the graphs)

below you can see a strange pattern that happens only to the machine located in Washington.

Details: I'm using three boxes. Washington, Amsterdam and Hong-Kong. They share the same software and they execute the same command at the same time. This is done to reduce the differencies across tests. In a specific website under investigation, a different code is sent to the Washington box. Has the side effect of freeze phantomjs and keep it running. From the load average graph I spot this first anomaly:

In the next months, with the integration of Thug, will be easier perform javascript inspection and investigate on the reasons.

January 2017

An update on the vision

Inspired by this title Hacking the attention economy, Has become clear that my production has not just to be a website full of results, because a website has these limits:

one, or maybe more than one, targets. Every target audience require (in theory) appropriate terminology, design, knowledge of their context.

every target need to be notified on the existence of invi.sible.link projects, and this is stupid: invi.sible.link is a technical tool, nobody that can't run the code would ever to look these pages

a translation process is expensive and something useless. does it make any sense if I translate in Italian, the investigation made with CodingRights in LatAa? No.

This is apipeline, a series of tools that constantly process input to produce output. My output has to be:

targeted: specific audience that care of their own stuff, of their content, of their business. Because the subject of the campagin would be a list of websites near the audience

simple: you can know how tracking works if you want, if you don't, at least you would know that someone in your digital environment is treating their user bad. and this feeling has to be the goal of the communication. making website responsible for their third party inclusions.

like a bot: act on notification via social media, not mobile first, but: bot first!

Therefore, in this post-prototypel phase, campaign pipeline has to be the priority. This might permit to experience since the beginning how reach out to different social circles, and will force the prject in keeping an operating workflow despite the tecnical challenges that are going to be faced later.

This approach is in experimentation in February, for the first targeted campaign. The goal of such campaing is getting visibility, constructive criticism, and see the overall reaction about this kind of monitoring approach.

CodingRights campaign progress

We made a meeting in CodingRights planning the Chupadados campaign, and currently I'm running a prototypal experiment outside the fellowship scope, in order to test the infrastructured and the content production pipeline

Long term monitoring it is working properly

The stats page is working smoothly since time, I'm using it to keep in check the multiple operation performed, new graph might be add apply to specific campaign. Study which kind of graph is still a work in progress, but I've already done successiful experiment in integrating: rawgraphs.io.

December 2016

web Crawling and Orchestration works for me

The structure running is simple and easily to be distributed. It involved few componenents.

Vigile (central authoritity)

At 5AM GMT, a command is executed, it creates a list of tasks that has to be completed. This list is derived by the list of Subjects under analysis, and can be reached publicly via API:

This model has technical properties that helps me in the orchestration:

the field needName specify the needs. At the moment, the only need is namedbasic and means, crawl with phantomjs. This permit specialisation in distribution, because, if the vantage point don't support that test, can just skip to the next need.

the fields HK, AMS and Aname have boolean values. It mean if the vantage point (specified in the request) has absolved or not the task. The value false means the VP has only got the task, if the value is true means has solved the task and confirmed the execution.

b start and end describe the window of time in which the task can be absolved.

Chopsticks

quick and dirty approach: every two minutes contrab call for tasks to be done. ask for 30 to be exectuted 10 per time. maximum time 30 seconds, after 35 is killed. I'll measure performances and failure ratio later on.

it saves and import in mongodb results

Above you can see the level of detail experimented now. Having many descriptive fields will helps finding correlation, trend, pattern.

Exposer

make the results available for who need these, referenced by the promiseId. It display some basic graph of the data stored. was it working with only 1 day of results, with more than one, require an optimization of the analytic, because is too big.

As part of this improvement, the component machete will be completed.

Visulisation with Raw and c3

Work in progress is integrating RAW, framework, and c3To begin with a decent visualisation of technical results

Logical workflow of the pipeline

A decent distribution and resiliency is going on with the designed pipeline, here the scheduled tasks. the next component will fetch from the result and complete the pipeline.

November 2016

Components design

Define the components that should to work together in order to accomplish the pipeline. The goal is pretty ambitious, because the system has to operate on many vantage point, and be centrally coordinated. Enable the analyst to get easily results, and enable CodingRights and me to setup declined campaign without effort. The current schema is composed by 6 components each of it, with a small dedicated tasks. I have not used the prototype named Trackography-2, because the risk was a complexity increment. The reason in the components splitting, is to keep the design "simple and stupid" as possible.

Component "storyteller": is the on running in the public website https://invi.sible.link, will contain information for technical audience and all the research tool developed this year. Will serve the results as open data, enabling third parties like CodingRights to integrate the data in their advocacy.

Component "machete": aggregate the results from the vantage point and perform analysis, correlation, high level function to produce results. For example: rank the most invasive trackers, find correlations among the last day result and the last month. Will be the tool operating over the database and producing data-driven-insights.

Component "vigile": will orchestrate the test on the vantage point, the analysis of machete, and keep track of the infrastructure performances

Component "chopstick": inheritance from the Littlefork pipeline in which I worked following Christo's of TacticalTech directions. Is the component wrapping the execution of phantomjs and Thug, being a specialized micro-service on the vantage point.

Component "exposer": The technical service needed to export the results from the vantage point to machete

Component "social pressure": as the name evoke, is one of the key experiment of this project. A components containing the libraries, API keys to be a simple social media bots feed by machete.

Setup boxes infrastructure

Thanks to the OTF cloud, I setup easily four boxes to have the components runs, and a situation in which, box less or box more, the system can continue to operate and easily migrated, if other organization show interest in maintain the project after the fellowship, or just to run their own set of tests.

I recovered the lists I was using to do the previous experiment, they are nicely visualized with DataTables here:

chupadados.na.tracking.exposed

CodingRights has launch at the end of November a campaign website targeting Latin America communities, the name is Chupadados it is a campaign exploring different narratives to raise awareness on data surveillance, government and corporate, for Spanish and Portuguese audience. The firstdeclination of invi.sible.link will be on a selected list of Brazilian websites, all related to the sexual health services. This would be an experiment to advocate on a target community outside our common audience.

Test webXray on OTF cloud

The tool webXray has many things in common with this project; I start to assess if code base can be re-used. As first, I tested webXray on the three vantage points on OTF cloud, it is worked smoothly with a low effort. It is an interactive tool, therefore some of the assumptions behind the architecture might be different from my needs, still, looking in the internals:

useful the PhantomJS script is inspired to my same source. I have spent a certain amount of time in managing the network deadlock that website might create on the analyzing pipeline, that's is considered an "hard problem" me and webXray solved differently

not useful the backend use a strict SQL schema, that might work nicely for a defined research project, but in my case, I'm preferring a Document Oriented DB (Mongo). This permit a more flexible usage of such data, and considering I'm using node+mongo stack intensively in the last 12 months, I prefer don't change it now.

different The subject under investigation, the websites, are aggregated as "1 million Alexa" also because do not support a fine clustering approach. Considering we address a "profiling issue", we need to keep in account different clusters of website, separated by their content, target audience, political and geographical impact. In this direction is also the point 4 in my task list.

troublesome one of the problem in third party trackers is the unreliability and source changes of the ADV content. The tool, any tool, might hang for infinite time and I've not yet found a discrete "exit condition". In order to optimize resources, webXray, kills the browser emulation after 20 seconds, no matter of what is happening. In Trackography-2 I have experimented a smart solution, but being the deadlock an hard problem, this has to be refined.

In my current design these blocking operation are a small engineering problem. In the examples below you'll see the effect of don't managing such blocks. Without a manual intervention the pipeline remain blocked forever, and spotting all the possible conditions is a complex problem.

When you see the 7 dayand 7 minutes is because I killed the process manually. webXray solved this problem with the hardcoded time limit, probably I'll use the same, if a smarter solution keep failing.