Who Watches the Wealthy

These sites collect signals from a public high-frequency radio band, primarily for use by air-traffic control systems and other aircraft, known as the ADS-B system. ADS-B signals can be monitored with any appropriately tuned antenna – you can make one at home (see below)!

These methods allow anyone to follow common flights with only some basic identifying information - unless the owner of the plane is powerful, wealthy, or motivated enough to have their aircraft data redacted.

If you’re a journalist, and the subject of your story has done exactly that - then what?

OCCRP has a solution.

Our data team has coordinated with ADSB-Exchange and C4ADS, which jointly collects and manages the world’s largest source of unfiltered flight data, to examine the travels of a host of oligarchs and political leaders.

Here is how it works.

The Data Trail in Five Minutes

Receivers in this co-operative network record the aircraft position, speed, altitude, and timestamp, all signed with an ‘address’ known as an ICAO address or Mode-S code, which uniquely identifies the aircraft.

The receivers have a maximum range of around four hundred kilometers and are located mostly on densely populated landmasses. This also means that planes flying over places like the Atlantic or the Sahara, for example, will sometimes leave the network’s range entirely.

Other obstacles to assembling a clear picture occur when an aircraft descends under the effective altitude for signal transmission — predominantly to land — or when overlapping signals are detected by multiple receivers.

Any attempt to piece together a picture of the travels by hand will be complex and painstaking, since one aircraft can easily generate over fifty thousand position readings over a two year period.

To assemble readings that constitute distinct flights, OCCRP’s data team has designed a small collection of data analysis algorithms which, among other things, group observations of any aircraft across multiple antennae by timestamp and infer takeoff & landing through a combination of first & last position relative to major airports and aircraft altitude at the time.

For example, if a collection shows an aircraft’s final position reading for (at least) many hours near London Heathrow at a low altitude, our algorithm assumes it’s landed there.

If we find the last reading at a high altitude above the North Atlantic off the Irish coast, with readings resuming after a couple of hours above Iceland, we stitch the two sets of positions together.

Flight data from this network should not be considered complete. In many parts of the world, including many potential regions of interest, the available receivers may be few and far apart, and the sparse data can force a ‘best guess’ methodology for projecting what airport the aircraft is departing from or arriving to.

In some cases, an aircraft’s signal is seen to disappear over an ocean and isn’t seen for weeks due either to staying grounded, out of the network’s coverage area or potentially deactivating its’ transponder.

After observations are algorithmically grouped into distinct flights, we can zoom in on interesting travels by ‘geofencing’ destinations of interest, selecting all flights by aircraft of interest whose flight endpoints fall within range of select airports.

In one example, Cyprus, long described as a gateway into Europe for Russian money, has seen many interesting visitors (each oligarch has his own color!... see below for details).

While the wealthy and powerful travel European capitals and the exclusive resorts of the ultra-rich quite frequently, insight into their travels goes well beyond Monaco and Sochi, and visits to more tropical climates may prove of greater interest.

OCCRPs’ initial analysis has centered on sunny money laundering hubs including the British Virgin Islands, Cyprus, or Panama City, as well as the considerably less sunny Isle of Man. As a brief introduction to how these destinations might be of interest, look no further than OCCRP's own reporting on two huge global operations, The Panama Papers and The Russian Laundromat.

Come for Company Formation, Stay for the Beaches

Let’s look deeper at just one example of how our tracking works.

Roman Abramovich’s expensive tastes is famous and it would be uncouth for such a man to travel on a commercial airline. As might be expected, Mr. Abramovich maintains a sleekly decorated Boeing 767, among other jets, to carry him in comfort and style.

Like most aircraft, Mr. Abramovich’s Boeing regularly, though not always, transmits its location along with various other data on the so-called ADS-B system which the usual tracking websites collect.

Curious about his travels, we can input the tail number of the 767, visible from photographs as ‘P4-MES’, into one of those sites but we would find no records of the aircraft's travels, as presumably they have been redacted.

If we checked in at the right time, we can monitor his plane in realtime on the ADSBExchange site. Many times a month, Abramovich’s Boeing flies between Moscow and London - unsurprising given his well-known business interests in both capitals. He frequents St. Petersburg and New York as well and makes regular appearances in Monaco, Sochi, and the Black Forest in Germany. Recently, in pursuit of his newly acquired Israeli citizenship, he’s flown multiple times to and from Tel Aviv.

Among these more common travels, we know Mr. Abramovich’s jet visited the area of the British Virgin Islands several times in early 2017, including this trip (pictured right) from New Jersey on February 17 2017. He stayed there until February 27. Similar trips occured on January 3, and March 24, and December 22.

OCCRP alleges no illegal activity by Mr. Abramovich or any of the other billionaires detailed below, only that they are frequent fliers to certain locations.

Indeed, are billionaires not allowed to soak in long sea-breeze days on the beach along with the rest of us who can afford to step foot in these exclusive communities? One imagines these evenings, enjoying a bottle of 1780 Barbados Private Estate, watching a small army of dolphins trained for exclusive entertainment from the comfort of a glass-walled swimming pool levitating peacefully above a peach-hued bay.

It might just be of interest to see if any companies were registered during these afternoons.

Boyars in Paradise (and the Irish Sea)

OCCRP has initially examined 35 aircraft of interest flying more than 7.5 million kilometers over some 5800 flights. This analysis has uncovered trips to known money laundering resorts from a number of notable figures including:

Alexander Abramov : Russian steel magnate, chairman of Evraz Group.

Alexander Mashkevitch : Kazakh-Israeli businessman and major shareholder in Eurasian National Resources Corporation.

Or his visit to the Iranian island oil and gas outpost of Sirri on January 28 2018...or perhaps the government of Equatorial Guinea ferrying honest public servants to Singapore, Mykonos, Mytilini, Winnipeg, Los Angeles, and Rio de Janeiro might interest you.

These findings are only the beginning. We encourage interested journalists to contact data@occrp.org with research requests and feedback on our analysis. We want to hear from you! We also encourage you to reach out to ADSB-Exchange (https://www.adsbexchange.com/contact/) or C4ADS (info@c4ads.org) for more information about the flight data network and how you can help.

Food for thought:

With the help of OCCRP Research, Data Team has compiled a long list of 'interesting' aircraft from the oligarchs above to the Saudi Royals. Let us know who you're looking for.

Should we publish a weekly digest of the movements of select aircraft or travel into certain locations?

Can we identify events whose airborne attendees we could study?

If you had access to a web resource which allowed you to browse the travels of select aircraft or ownership organizations (think ‘Azerbaijan Airlines’ or other pseudo-governmental carriers), how would you use it?

The Organized Crime and Corruption Reporting Project (OCCRP) Data Team
has developed new features on OCCRP Data in the past six months and
brought together more than 200 different datasets. Its new software
is now configured to let reporters search all of those at once.

OCCRP Data, part of the Investigative Dashboard, offers journalists a shortcut to the deep web. It now has over 170 public sources and more than 100 million leads for public search – news archives, court documents, leaks and grey literature encompassing UK parliamentary inquiries, companies and procurement databases, NGO reports and even CIA rendition flights, among other choice reading. (All this is publicly available. If you’re associated with OCCRP, you’ll have access to more than 250 million items).

Uniquely, the database also contains international sanctions lists detailing persons of political or criminal relevance.

The new platform makes searching diverse types of objects, such as emails, documents and database entries from corporate or land registries into a unified user experience, with an appropriate way to display each type of data.

Here are seven tips to help you get the most out of OCCRP Data:

Browse Directly on Your Screen

OCCRP Data has emails, PDF and Word documents, contracts, old news archives, even Rudyard Kipling poems (from Wikileaks, to be fair). Its brand new interface makes it easier for you to view documents, search within them and preview them in the browser without having to download and open them, making research a faster and more seamless process.

New Search Filter Options

OCCRP Data lets you filter search results by sources, document type, as well as emails, phone numbers, addresses, entity names, countries and more on its left-hand column, after you’ve run your search.

Highlight Connections

You can explore structured data in new ways because OCCRP Data uses entity extraction on documents and emails to find phone numbers, names of people and companies, addresses, ID numbers and other key linkage details of interest. Just click on an entity and see the “Tags” option in the preview screen.

Do Bulk Comparisons

OCCRP Data is capable of cross-referencing the information on two lists; it also ranks data that closely matches and lets you compare the information. Click on a source and then click on the “Cross Reference” option to choose another source with which to do the comparison.

Monitor Search Terms, Receive Alerts

OCCRP Data now has an alerts feature that allows you to monitor a search term so when new information is added to the database you will receive a notification. Simply switch on the bell icon right next to your search query.

Language Support

OCCRP Data now supports multiple languages. The interface is translated and supports Russian and Bosnian-Serbo-Croatian. Search results on the database can also be filtered by language. The data team is working on adding other languages, such as German and Spanish.

Advanced Search Operators

You can use complex search operators to do things such as proximity searches, exact term searches, take into account spelling errors and combine queries.

Any Questions?

Anyone accessing OCCRP Data can check out the the Aleph Wiki where the data team covers its uses, function and development roadmap. Journalists and technologists alike can read the user manual or contact data@occrp.org to give us feedback.

Investigating the (often dubious) dealings of businessmen and politicians,
our reporters need access to documents and databases from all over the
world.

To make their searches better, we're developing tools that make large
amounts of data accessible with a single keystroke. We have built a set
of crawlers that combine data from governments, corporations and other
media into a search engine.

However, these crawlers need to deal with uncooperative websites in
different languages, formats and structures and they often break when
pages are updated.

After experimenting with some existing solutions, we decided to make a
tool that encapsulates our experience with web crawling. The result is a
lightweight open source framework named memorious
(GitHub).

memorious is simple and yet allows you to create and maintain a fleet
of crawlers, while not forcing too much specific process.

Schedule crawlers to run at regular intervals (or run them ad-hoc as you need).

Keep track of error messages and warnings that help admins see which crawlers
are in need of maintenance.

Lets you use familiar tools like requests, BeautifulSoup, lxml or
dataset to do the actual scraping.

Distribute scraping tasks across multiple machines.

Maintain an overview of your crawlers' status using the command line or a
web-based admin interface.

For common crawling tasks, memorious does all of the heavy lifting. One
of our most frequent objectives is to follow every link on a large website and
download all PDF files. To achieve this with memorious, all you need to write
is a YAML file
that plugs together existing components.

A web-based admin interface allows you to keep track of the status of all
of your crawlers.

Each memorious crawler is comprised of a set of different stages that call each
other in succession (or themselves, recursively). Each stage either executes a
built-in component, or a custom Python function, that may fetch, parse or store a
page just as you like it. memorious is also extensible, and contains lots of
helpers to make building your own custom crawlers as convenient as possible.

These configurable chains of operations have made our crawlers very modular, and
common parts are reused efficiently. All crawlers can benefit from automatic
cookie persistence, HTTP caching and logging.

Within OCCRP, memorious is used to feed documents and structured data into
aleph via an API, which means documents
become searchable as soon as they have been crawled. There, they will also
be sent through OCR and entity recognition. aleph aims to use these extracted
entities as bridges that link a given document to other databases and documents.

For a more detailed description of what memorious can do, see the
documentation and check out our
example project.
You can try memorious by running it locally in development mode,
and, of course, we also have a Docker setup for robust production deployment.

As we continually improve our crawler infrastructure at OCCRP, we'll be adding
features to memorious for everyone to use. Similarly, we'd love input from the data
journalism and open data communities; issues
and PRs are welcome.

“…the solitary and lucid spectator of a multiform, instantaneous and almost
intolerably precise world…” (Funes the memorious, Jorge Luis Borges)

Piwik is an awesome self-hosted analytics service. We've been relying on it for a very long time and we were always satisfied with what it brought to us. During our work, however, we have accidentally allowed the traffic from our own servers to appear in Piwik, and some general traffic to be counted more than once, therefore we have accidentally boosted our own stats.

Now, since Piwik itself is open sourced, we see no reason why we should not be able to delete the artificially inflated stats ourselves and by doing so making sure that our journalists see the stats as precisely as possible.

Since the informations for this process were not as clear as we wanted them to be, I've decided to write this blog post so we could make the job easier to anyone else who tries to do the same. To follow this tutorial, you're going to need a Piwik installation (obviously), access to the command line on the server and some SQL-fu.

Step 1: Finding the records that you want to delete

This seems like a simple thing, but it turned out to be much harder. We had the list of couple of IP addresses that we wanted to exclude from Piwik, but after about half an hour of me searching through Piwik's interface, I was not able to find a way how to see the entire traffic that originated from a specific IP address. Luckily, I stumbled upon this short post which gave me every information I needed. To see the traffic from a specific IP, you have to manually tweak the URL you are visiting to:

Bear in mind that Piwik shows 500 actions per visit as a maximum, so if the requested IP made over 500 actions in a single visit (for example, if it was a bot, or if somebody tried to scrape your website), you're only going to see the very first 500 actions that were requested by that IP.

Step 2: Finding and deleting records from the database(s)

The second step would be to find the records in the database as well. To do this for the IP you're interested in, you're going to have to convert the IP address to the HEX numeral system. Of course, everyone who finished two IT college courses should be able to convert the number to its HEX value by hand, but if you feel too lazy, just use this online tool to do so. Or use python:

print hex() # repeat for each IPv4 byte

Once you have the HEX equivalent of the IP in question, log into MySQL/MariaDB and execute the following command to get the count of rows (or: pageviews) that will be affected:

SELECT COUNT(*)
FROM piwik_log_visit AS log_visit
LEFT JOIN piwik_log_link_visit_action as log_link_visit_action
ON log_visit.idvisit = log_link_visit_action.idvisit
LEFT JOIN piwik_log_action as log_action
ON log_action.idaction = log_link_visit_action.idaction_url
WHERE log_visit.location_ip=UNHEX("\{\{ HEX_VALUE_GOES_HERE \}\}");

As you can see, Piwik stores the relevant visitor info into three separate MySQL databases: piwik_log_visit, piwik_log_link_visit_action and piwik_log_action.

If you skip one of them, you'll encounter some unexpected results. For example, initially, we've tried removing the data from piwik_log_visit and piwik_log_link_visit_action, but once we've re-computed the logs, we've noticed that the IP was still there and the visit time was still being shown, even though we have successfully deleted the actions associated with that visit.

0 Action - 42 min 59s

This is why it's important to delete the data from all three of the databases.

To delete the necessary entries from all the databases, you need to tweak the command above like this (for IP-based pruning):

DELETE log_visit, log_link_visit_action
FROM piwik_log_visit AS log_visit
LEFT JOIN piwik_log_link_visit_action as log_link_visit_action
ON log_visit.idvisit = log_link_visit_action.idvisit
LEFT JOIN piwik_log_action as log_action
ON log_action.idaction = log_link_visit_action.idaction_url
WHERE log_visit.location_ip=UNHEX("\{\{ HEX_VALUE_GOES_HERE \}\}");

You can verify that the visits/pageviews are gone from the db by using the SELECT statements again, of course.

Step 3: Re-compute the reports

If you have successfully completed the first two steps, your last step should be re-computing the reports. If you skip this step, you won't accomplish anything because the traffic will still be visible in the reports, even though the traffic has been removed from the databases.

To do so, I highly recommend you to take a careful look at Piwik's documentation. Specifically, you should pay a close attention to these two posts:

Developers have a habit (one I've been guilty of) of committing API keys and other secrets to our repositories. It’s easy to do it if you’re tired, if you’re hurried, if you’re “moving fast and breaking things”.

This, unfortunately, has been too difficult to prevent for too long. In the interest of security there’s luckily been a big push to stop this practice lately; Rails has had the Figaro gem, but recently in version 4.1 they’ve built in a “secrets.yml” file. Heroku has a config ENV screen to /attempt/ to coerce developers into keeping secret keys out of production. Apple’s iOS and its keychain helps with this on iPhones and iPad.

As far as I’ve been able to tell, Android has been terrible at this.

I’ve scoured documentation, searched for hours across StackOverflow and questioned friends who are much better at Android that I am. After taking bits and pieces, I think I’ve figure a good way to do this. I’m probably not the first, but there doesn’t seem to be a comprehensive write up of this technique anywhere, so I'm hoping these steps help fellow Android devs up their security a bit.

Note: this does not secure credentials in the wild. It will not stop someone from decompiling your ADK and pulling the string. Everything ends up in the compiled app. What it does do is keep someone from going through your Github account and copy/pasting your secrets out of it.

Note: For these steps I’m assuming you’re using Android Studio.

Recognize what needs to be kept secret.

Anything that’s unique to your deployment of the software.

If you use a key for Google Analytics, or if you keep have an encryption key that needs to be hardcoded, these should never be committed to a repository.

Don’t put these into the source code, ever, even for brief testing purposes.

Things get a bit hairy, however, when the source and destination cluster versions differ wildly. Say, like 0.90.3 and 1.7.3. And when you don't happen to have any admin access to the source cluster (only via HTTP and transport interfaces).And when the source cluster is misconfigured in just a slight but annoying manner...

What did not work

elasticdump

ElasticDump was the first and obvious thing to try, but apparently it only supports migrations between clusters running ElasticSearch 1.2 and higher. So, that's a no-go.

What did work

Please keep in mind that this procedure worked for us, but it doesn't have to work for you. Specifically, if the source is a cluster of more than one node, you might need to do some fancy shard allocation to make sure that all shards and all indices are copied over to the new node.

1. Create a new node

Why not cluster with that source ES server (running a single-node cluster) by creating a new node that we do control, and thus get the data? Getting the docker container to run an ES version 0.90.3 was just a bit of manual fiddling with the Dockerfile. Changing the versions everywhere worked well, but here's hint: up until 1.0 or so, elasticsearchran in background by default. Thus, the docker container stopped immediately after starting elasticsearch, for no apparent reason...

So if you're dealing with a pre-1.0 ES just add a -f (for "foregroud") to the command in Dockerfile to save yourself a bit of frustration.

2. Cluster with the source server

Once we have this running, it's time to cluster with the source node. What could be easier? Disable multicast discovery, enable unicast with a specified host and we're good, right?

Wrong. Remember the "misconfigured in just a slight but annoying manner" thing? The source IP server turned out to be behind a NAT and the IP that we could connect to differed from the IP the server published. Hence, our new node discovered the source node as master, but then -- based on the info gotten from it -- tried connecting to it via its internal (10.x.y.z) IP address. Which obviously did not work.

3. iptables to the rescue

As we had no way of changing the configuration of the source node, the only thing we could do was mangle the IP packets, so that packets going to 10.x.y.z would have the destination address modified to the public IP of the source node (and those coming from the public IP of the source node would get modified to have the source address of 10.x.y.z, but iptables handled that for us automagically):

6. Migrate the data to the production node

Since we were doing all this on a new node, created only to get the data off of the source ES 0.90.3 server, we needed to shunt the data off of it and into our production server (changing the index name in the process for good measure). This is where we turned back to elasticdump, and using a simple script were able to successfuly migrate the data off of the new node and onto our production ES 1.7.3 server.

Of course things could not got smooth and easy here, either. The dump kept being interrupted by an EADDRNOTAVAIL error; a quick work-around was to use the --skip command-line argument of elasticdump to skip rows that have already been migrated.

Around this time a year ago, OCCRP's Tech Team consisted of a single
webmaster in Travnik and a few loyal friends who would be called upon to
assist when new challenges arose. Since then, we've expanded to an
excellent group of people, sitting mostly in Sarajevo and doing some pretty
amazing things. This new blog is to tell you of our adventures.

We are working with a lot of different technologies, managing some
legacy systems, building new software and making sense of a lot of
structure. The core aim of all of our activities is simple: shorten the
time to journalism (TTJ).

In practice, this means a lot of stuff, from the simple-looking task of
making sure it's possible to publish stories, with whatever visuals and
interactives are needed, to serving tools for sharing and collaboration
between journalists, to constructing gadgets that allow journalists to
work more effectively, and designing and building new systems that allow
journalists to work better with large volumes of data.

On this blog we'll try to go a bit deeper into the thoughts of what
we're doing and why, explaining our thought processes and announcing
exciting progressions.