OCCRP technology team2018-06-20T19:27:42+00:00https://tech.occrp.org/Organized Crime and Corruption Reporting ProjectCome for the company formation. Stay for the beaches!2018-06-11T00:00:00+00:00https://tech.occrp.org//blog/2018/06/11/adsb<h3>Who Watches the Wealthy</h3>
<p>Tracing the realtime movements of most ordinary commercial and private aircraft can be straightforward through ‘tracking’ websites :</p>
<ul>
<li><a href="https://planefinder.net/">Plane Finder</a></li>
<li><a href="https://www.flightradar24.com/">Flight Radar</a></li>
<li><a href="https://flightaware.com/">Flight Aware</a></li>
</ul>
<p><figure>
<img src="https://www.adsbexchange.com/wp-content/uploads/Everythingoncan-225x300.jpg" align="right" text-align="right"></p>
<!-- <caption>The bleeding edge of aircraft detection. Credit: www.adsbexchange.com</caption> -->
<p></figure></p>
<p>These sites collect signals from a public high-frequency radio band, primarily for use by air-traffic control systems and other aircraft, known as the ADS-B system. ADS-B signals can be monitored with any appropriately tuned antenna – you can make one at home (see below)!</p>
<p>These methods allow anyone to follow common flights with only some basic identifying information - unless the owner of the plane is powerful, wealthy, or motivated enough to have their aircraft data redacted.</p>
<p>If you’re a journalist, and the subject of your story has done exactly that - then what?</p>
<p>OCCRP has a solution.</p>
<p>Our data team has coordinated with <a href="https://www.adsbexchange.com/">ADSB-Exchange</a> and <a href="http://www.c4ads.org/">C4ADS</a>, which jointly collects and manages the world’s largest source of unfiltered flight data, to examine the travels of a host of oligarchs and political leaders.</p>
<p>Here is how it works.</p>
<p>&nbsp;</p>
<h3>The Data Trail in Five Minutes</h3>
<p>Receivers in this co-operative network record the aircraft position, speed, altitude, and timestamp, all signed with an ‘address’ known as an ICAO address or Mode-S code, which uniquely identifies the aircraft.</p>
<p><figure>
<img src="/assets/images/2018-06/600_si_fl_ads-b_22_fa2_flat_rev_5-15.jpg" align="right" width="600px" text-align="right"></p>
<!-- <caption>Credit: National Air and Space Museum, Smithsonian Institution.</caption> -->
<p></figure></p>
<p>The receivers have a maximum range of around four hundred kilometers and are located mostly on densely populated landmasses. This also means that planes flying over places like the Atlantic or the Sahara, for example, will sometimes leave the network’s range entirely.</p>
<p>Other obstacles to assembling a clear picture occur when an aircraft descends under the effective altitude for signal transmission — predominantly to land — or when overlapping signals are detected by multiple receivers.</p>
<p>Any attempt to piece together a picture of the travels by hand will be complex and painstaking, since one aircraft can easily generate over fifty thousand position readings over a two year period.</p>
<p>To assemble readings that constitute distinct flights, OCCRP’s data team has designed a small collection of data analysis algorithms which, among other things, group observations of any aircraft across multiple antennae by timestamp and infer takeoff &amp; landing through a combination of first &amp; last position relative to major airports and aircraft altitude at the time.</p>
<p><figure>
<img src="/assets/images/2018-06/48434F2016-07-01_Moscow_London.png" align="right" width="200px" text-align="right">
</figure></p>
<p>For example, if a collection shows an aircraft’s final position reading for (at least) many hours near London Heathrow at a low altitude, our algorithm assumes it’s landed there.</p>
<p>If we find the last reading at a high altitude above the North Atlantic off the Irish coast, with readings resuming after a couple of hours above Iceland, we stitch the two sets of positions together.</p>
<p><figure>
<img src="/assets/images/2018-06/48434F2016-08-14_UNKNOWN_London.png" align="right" width="200px" text-align="right">
</figure></p>
<p>Flight data from this network should not be considered complete. In many parts of the world, including many potential regions of interest, the available receivers may be few and far apart, and the sparse data can force a ‘best guess’ methodology for projecting what airport the aircraft is departing from or arriving to.</p>
<p>In some cases, an aircraft’s signal is seen to disappear over an ocean and isn’t seen for weeks due either to staying grounded, out of the network’s coverage area or potentially deactivating its’ transponder.</p>
<p>After observations are algorithmically grouped into distinct flights, we can zoom in on interesting travels by ‘geofencing’ destinations of interest, selecting all flights by aircraft of interest whose flight endpoints fall within range of select airports.</p>
<p>In one example, Cyprus, long described as a gateway into Europe for Russian money, has seen many interesting visitors (each oligarch has his own color!... see below for details).</p>
<p><figure>
<img src="/assets/images/2018-06/blog_cyprus_1.png" align="center" width="900px">
</figure></p>
<p>&nbsp;</p>
<p>While the wealthy and powerful travel European capitals and the exclusive resorts of the ultra-rich quite frequently, insight into their travels goes well beyond Monaco and Sochi, and visits to more tropical climates may prove of greater interest.</p>
<p>OCCRPs’ initial analysis has centered on sunny money laundering hubs including the British Virgin Islands, Cyprus, or Panama City, as well as the considerably less sunny Isle of Man. As a brief introduction to how these destinations might be of interest, look no further than OCCRP's own reporting on two huge global operations, <a href="https://www.occrp.org/en/panamapapers/">The Panama Papers</a> and <a href="https://www.occrp.org/en/laundromat/">The Russian Laundromat</a>.</p>
<p>&nbsp;</p>
<h3>Come for Company Formation, Stay for the Beaches</h3>
<p>Let’s look deeper at just one example of how our tracking works.</p>
<p>Roman Abramovich’s expensive tastes is famous and it would be uncouth for such a man to travel on a commercial airline. As might be expected, Mr. Abramovich maintains a sleekly decorated Boeing 767, among other jets, to carry him in comfort and style.</p>
<p>Like most aircraft, Mr. Abramovich’s Boeing regularly, though not always, transmits its location along with various other data on the so-called ADS-B system which the usual tracking websites collect.</p>
<p>Curious about his travels, we can input the tail number of the 767, visible from photographs as ‘P4-MES’, into one of those sites but we would find no records of the aircraft's travels, as presumably they have been redacted.</p>
<p><figure>
<img src="/assets/images/2018-06/roma_VG.png" align="right" width="200px" text-align="right">
</figure></p>
<p>If we checked in at the right time, we can monitor his plane in realtime on the ADSBExchange site. Many times a month, Abramovich’s Boeing flies between Moscow and London - unsurprising given his well-known business interests in both capitals. He frequents St. Petersburg and New York as well and makes regular appearances in Monaco, Sochi, and the Black Forest in Germany. Recently, in pursuit of his newly acquired Israeli citizenship, he’s flown multiple times to and from Tel Aviv.</p>
<p>Among these more common travels, we know Mr. Abramovich’s jet visited the area of the British Virgin Islands several times in early 2017, including this trip (pictured right) from New Jersey on February 17 2017. He stayed there until February 27. Similar trips occured on January 3, and March 24, and December 22.</p>
<p>OCCRP alleges no illegal activity by Mr. Abramovich or any of the other billionaires detailed below, only that they are frequent fliers to certain locations.</p>
<p>Indeed, are billionaires not allowed to soak in long sea-breeze days on the beach along with the rest of us who can afford to step foot in these exclusive communities? One imagines these evenings, enjoying a bottle of 1780 Barbados Private Estate, watching a small army of dolphins trained for exclusive entertainment from the comfort of a glass-walled swimming pool levitating peacefully above a peach-hued bay.</p>
<p>It might just be of interest to see if any companies were registered during these afternoons.</p>
<p><a href="https://eservices.gov.vg/gazette/">British Virgin Islands Company Gazette</a></p>
<p>&nbsp;</p>
<h3>Boyars in Paradise (and the Irish Sea)</h3>
<p>OCCRP has initially examined 35 aircraft of interest flying more than 7.5 million kilometers over some 5800 flights. This analysis has uncovered trips to known money laundering resorts from a number of notable figures including:</p>
<ul>
<li>Alexander Abramov : Russian steel magnate, chairman of Evraz Group.</li>
<li>Alexander Mashkevitch : Kazakh-Israeli businessman and major shareholder in Eurasian National Resources Corporation.</li>
<li>Alexei Mordashov : Russian owner of Severstal.</li>
<li>Alisher Usmanov : Uzbek-Russian industrial magnate and investor.</li>
<li>Andrey Guryev : Dominant in the Russian fertilizer industry; beneficial owner of PhosAgro.</li>
<li>Dmytro Firtash : Ukranian businessman and ally of the Yuschenko &amp; Yanukovich administrations. US currently seeking extradition.</li>
<li>Farkhad Akhmedov : Azeri-Russian businessman and politician.</li>
<li>Igor Makarov : Turkmen-Russian businessman, owner of ARETI International Group.</li>
<li>Len Blavatnik : Soviet-born British-American diversified businessman. Active in American politics and longtime friend of Viktor Vekselberg.</li>
<li>Leonid Mikhelson : Owner of Russian gas company Novatek and subject to US sanctions.</li>
<li>Oleg Tinkov : Russian diversified businessman and founder of Tinkoff Bank.</li>
<li>Roman Abramovich : Russian diversified businessman, philanthropist, and owner of Chelsea FC.</li>
</ul>
<table>
<thead>
<tr>
<th> Aircraft Owner </th>
<th> Date </th>
<th> Origin </th>
<th> Destination </th>
</tr>
</thead>
<tbody>
<tr>
<td> Alexander Abramov </td>
<td> 12/5/2016 </td>
<td> Van Nuys, USA </td>
<td> Anegada, British VG </td>
</tr>
<tr>
<td> Alexander Abramov </td>
<td> 2/6/2017 </td>
<td> Houston, USA </td>
<td> Anegada, British VG </td>
</tr>
<tr>
<td> Alexander Abramov </td>
<td> 2/23/2017 </td>
<td> Miami, USA </td>
<td> Anegada, British VG </td>
</tr>
<tr>
<td> Alexander Mashkevitch </td>
<td> 12/16/2017 </td>
<td> Isle of Man </td>
<td> London, UK </td>
</tr>
<tr>
<td> Alexander Mashkevitch </td>
<td> 2/25/2018 </td>
<td> London, UK </td>
<td> Panama City, Panama </td>
</tr>
<tr>
<td> Alexander Mashkevitch </td>
<td> 4/11/2018 </td>
<td> Panama City, Panama </td>
<td> Punta Cana, Dominican Republic </td>
</tr>
<tr>
<td> Alexei Mordashov </td>
<td> 5/10/2017 </td>
<td> Isle of Man </td>
<td> London, UK </td>
</tr>
<tr>
<td> Alisher Usmanov </td>
<td> 2/20/2018 </td>
<td> Larnaca, Cyprus </td>
<td> Mulhouse, France </td>
</tr>
<tr>
<td> Andrey Guryev </td>
<td> 8/13/2016 </td>
<td> Unknown in Greece </td>
<td> Paphos, Cyprus </td>
</tr>
<tr>
<td> Dmytro Firtash </td>
<td> 3/25/2017 </td>
<td> Unknown in South Pacific </td>
<td> Auckland, NZ </td>
</tr>
<tr>
<td> Farkhad Akhmedov </td>
<td> 1/17/2017 </td>
<td> Spanish Town, British VG </td>
<td> Miami, USA </td>
</tr>
<tr>
<td> Igor Makarov </td>
<td> 7/14/2016 </td>
<td> Paphos, Cyprus </td>
<td> Unknown in France </td>
</tr>
<tr>
<td> Igor Makarov </td>
<td> 10/12/2016 </td>
<td> Unknown, North Atlantic </td>
<td> Nicosia, Cyprus </td>
</tr>
<tr>
<td> Igor Makarov </td>
<td> 10/15/2016 </td>
<td> Cairo, Egypt </td>
<td> Larnaca, Cyprus </td>
</tr>
<tr>
<td> Len Blavatnik </td>
<td> 12/29/2016 </td>
<td> San Juan, Puerto Rico </td>
<td> St. Thomas, Virgin Islands </td>
</tr>
<tr>
<td> Len Blavatnik </td>
<td> 4/9/2017 </td>
<td> Alice Town, Bahamas </td>
<td> Anegada, British VG </td>
</tr>
<tr>
<td> Leonid Mikhelson </td>
<td> 9/25/2016 </td>
<td> Akrotiri, Cyprus </td>
<td> Strasbourg Neudorf, France </td>
</tr>
<tr>
<td> Oleg Tinkov </td>
<td> 4/27/2018 </td>
<td> Florence, Italy </td>
<td> Larnaca, Cyprus </td>
</tr>
<tr>
<td> Roman Abramovich </td>
<td> 1/3/2017 </td>
<td> New Jersey, USA </td>
<td> Anegada, British VG </td>
</tr>
<tr>
<td> Roman Abramovich </td>
<td> 3/24/2017 </td>
<td> Denver, USA </td>
<td> Anegada, British VG </td>
</tr>
</tbody>
</table>
<p>Investigative curiosities from the jet-setting frontier don’t stop at money laundering. Witness Mr. Abramovich on 3-9-18, flying (it appears) from Crimea / Rostov-on-Don area, amusingly looping the hundreds of kilometers all the way round Ukrainian airspace, to London.</p>
<p><figure>
<img src="/assets/images/2018-06/48434F2018-03-09_Taganrog_Beccles.png" align="center" width="900px">
</figure></p>
<p>Or his visit to the Iranian island oil and gas outpost of Sirri on January 28 2018...or perhaps the government of Equatorial Guinea ferrying honest public servants to Singapore, Mykonos, Mytilini, Winnipeg, Los Angeles, and Rio de Janeiro might interest you.</p>
<p>These findings are only the beginning. We encourage interested journalists to contact data@occrp.org with research requests and feedback on our analysis. We want to hear from you! We also encourage you to reach out to ADSB-Exchange (https://www.adsbexchange.com/contact/) or C4ADS (info@c4ads.org) for more information about the flight data network and how you can help.</p>
<p>Food for thought:</p>
<ul>
<li>With the help of OCCRP Research, Data Team has compiled a long list of 'interesting' aircraft from the oligarchs above to the Saudi Royals. Let us know who you're looking for.</li>
<li>Should we publish a weekly digest of the movements of select aircraft or travel into certain locations?</li>
<li>Can we identify events whose airborne attendees we could study?</li>
<li>If you had access to a web resource which allowed you to browse the travels of select aircraft or ownership organizations (think ‘Azerbaijan Airlines’ or other pseudo-governmental carriers), how would you use it?</li>
<li><a href="https://global.adsbexchange.com/VirtualRadar/desktop.html">A live version of the ADSB Exchange tracking network is always viewable (sceen below)</a></li>
<li>Additionally, if you would like to check out historical flight data for an aircraft of interest, be sure to check out ADSB-Exchange’s <a href="https://flight-data.adsbexchange.com/">historical flight viewer</a>.</li>
</ul>
<p><figure>
<img src="/assets/images/2018-06/adsb_exchange_screencap.jpg" align="center" width="900px">
</figure></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>Thanks to:</p>
<p><figure>
<a href="http://www.c4ads.org/"><img src="/assets/images/2018-06/c4logoblack-e1482209290563.png" align="center"></a>
<a href="https://www.adsbexchange.com/"><img src="/assets/images/2018-06/ADSB-Exchange Logo.png" align="center"></a>
</figure></p>
<p><figure></p>
<p></figure></p>
<ul>
<li>ADSB 'Can' Receiver courtesy www.adsbexchange.com</li>
<li>ADSB System Illustration courtesy The National Air and Space Museum, Smithsonian Institution</li>
<li>All aircraft-tracking maps generated with 'Basemap' https://github.com/matplotlib/basemap</li>
</ul>
Journalism’s Deep Web: 7 Tips on Using OCCRP Data2018-04-13T00:00:00+00:00https://tech.occrp.org//blog/2018/04/13/relaunch-data<div class="captioned">
<img src="/assets/images/2018-04/occrp-database.jpg" class="img-responsive">
</div>
<p><em>This article was first published <a href="https://gijn.org/2018/04/03/deep-web-journalists-7-tips-using-occrp-data/">on GIJN.org</a>.</em></p>
<p>The <a href="https://occrp.org">Organized Crime and Corruption Reporting Project</a> (OCCRP) Data Team
has developed new features on OCCRP Data in the past six months and
brought together more than 200 different datasets. Its new software
is now configured to let reporters search all of those at once.</p>
<p>OCCRP Data, part of the <a href="https://investigativedashboard.org">Investigative Dashboard</a>, offers journalists a shortcut to the deep web. It now has over 170 public sources and more than 100 million leads for public search – news archives, court documents, leaks and grey literature encompassing UK parliamentary inquiries, companies and procurement databases, NGO reports and even CIA rendition flights, among other choice reading. (All this is publicly available. If you’re associated with OCCRP, you’ll have access to more than 250 million items).</p>
<p>Uniquely, the database also contains international sanctions lists detailing persons of political or criminal relevance.</p>
<p>The new platform makes searching diverse types of objects, such as emails, documents and database entries from corporate or land registries into a unified user experience, with an appropriate way to display each type of data.</p>
<p>Here are seven tips to help you get the most out of OCCRP Data:</p>
<h2>Browse Directly on Your Screen</h2>
<p>OCCRP Data has emails, PDF and Word documents, contracts, old news archives, even <a href="https://data.occrp.org/search?q=Rudyard%20Kipling#facet%3Acollection_id=10&amp;page=1&amp;preview%3Aid=10233625&amp;preview%3Amaximised=true&amp;preview%3Atype=document">Rudyard Kipling poems</a> (from Wikileaks, to be fair). Its brand new interface makes it easier for you to view documents, search within them and preview them in the browser without having to download and open them, making research a faster and more seamless process.</p>
<div class="captioned">
<img src="/assets/images/2018-04/email-preview.png" class="img-responsive">
</div>
<h2>New Search Filter Options</h2>
<p>OCCRP Data lets you filter search results by sources, document type, as well as emails, phone numbers, addresses, entity names, countries and more on its left-hand column, after you’ve run your search.</p>
<div class="captioned">
<img src="/assets/images/2018-04/data-sources.png" class="img-responsive">
</div>
<h2>Highlight Connections</h2>
<p>You can explore structured data in new ways because OCCRP Data uses entity extraction on documents and emails to find phone numbers, names of people and companies, addresses, ID numbers and other key linkage details of interest. Just click on an entity and see the “Tags” option in the preview screen.</p>
<div class="captioned">
<img src="/assets/images/2018-04/tags.png" class="img-responsive">
</div>
<h2>Do Bulk Comparisons</h2>
<p>OCCRP Data is capable of cross-referencing the information on two lists; it also ranks data that closely matches and lets you compare the information. Click on a source and then click on the “Cross Reference” option to choose another source with which to do the comparison.</p>
<div class="captioned">
<img src="/assets/images/2018-04/crossreference.png" class="img-responsive">
</div>
<div class="captioned">
<img src="/assets/images/2018-04/compare-datasets.png" class="img-responsive">
</div>
<h2>Monitor Search Terms, Receive Alerts</h2>
<p>OCCRP Data now has an alerts feature that allows you to monitor a search term so when new information is added to the database you will receive a notification. Simply switch on the bell icon right next to your search query.</p>
<div class="captioned">
<img src="/assets/images/2018-04/alerts.png" class="img-responsive">
</div>
<h2>Language Support</h2>
<p>OCCRP Data now supports multiple languages. The interface is translated and supports Russian and Bosnian-Serbo-Croatian. Search results on the database can also be filtered by language. The data team is working on adding other languages, such as German and Spanish.</p>
<div class="captioned">
<img src="/assets/images/2018-04/bosnian.png" class="img-responsive">
</div>
<h2>Advanced Search Operators</h2>
<p>You can use <a href="https://github.com/alephdata/aleph/wiki/How-to-search">complex search operators</a> to do things such as proximity searches, exact term searches, take into account spelling errors and combine queries.</p>
<div class="captioned">
<img src="/assets/images/2018-04/complex-search-syntax.png" class="img-responsive">
</div>
<h2>Any Questions?</h2>
<p>Anyone accessing OCCRP Data can check out the the <a href="https://github.com/alephdata/aleph/wiki">Aleph Wiki</a> where the data team covers its uses, function and development roadmap. Journalists and technologists alike can read the <a href="https://github.com/alephdata/aleph/wiki/User-manual">user manual</a> or contact <a href="mailto:data@occrp.org">data@occrp.org</a> to give us feedback.</p>
Introducing memorious, a web crawling toolkit2017-11-21T00:00:00+00:00https://tech.occrp.org//blog/2017/11/21/memorious<p>Investigating the (often dubious) dealings of businessmen and politicians,
our reporters need access to documents and databases from all over the
world.</p>
<p><img src="/assets/images/2017-11/funes.png" class="img-right"></p>
<p>To make their searches better, we're developing tools that make large
amounts of data accessible with a single keystroke. We have built a set
of crawlers that combine data from governments, corporations and other
media into a <a href="https://data.occrp.org">search engine</a>.</p>
<p>However, these crawlers need to deal with uncooperative websites in
different languages, formats and structures and they often break when
pages are updated.</p>
<p>After experimenting with some existing solutions, we decided to make a
tool that encapsulates our experience with web crawling. The result is a
lightweight open source framework named <code>memorious</code>
(<a href="https://github.com/alephdata/memorious">GitHub</a>).</p>
<p><code>memorious</code> is simple and yet allows you to create and maintain a fleet
of crawlers, while not forcing too much specific process.</p>
<ul>
<li>Schedule crawlers to run at regular intervals (or run them ad-hoc as you need).</li>
<li>Keep track of error messages and warnings that help admins see which crawlers
are in need of maintenance.</li>
<li>Lets you use familiar tools like <code>requests</code>, <code>BeautifulSoup</code>, <code>lxml</code> or
<code>dataset</code> to do the actual scraping.</li>
<li>Distribute scraping tasks across multiple machines.</li>
<li>Maintain an overview of your crawlers' status using the command line or a
web-based admin interface.</li>
</ul>
<p>For common crawling tasks, <code>memorious</code> does all of the heavy lifting. One
of our most frequent objectives is to follow every link on a large website and
download all PDF files. To achieve this with <code>memorious</code>, all you need to write
<a href="https://github.com/alephdata/memorious/blob/master/example/config/simple_web_scraper.yml">is a YAML file</a>
that plugs together existing components.</p>
<div class="captioned">
<img src="/assets/images/2017-11/memorious-ui.png" class="img-responsive">
<div class="caption">
A web-based admin interface allows you to keep track of the status of all
of your crawlers.
</div>
</div>
<p>Each <code>memorious</code> crawler is comprised of a set of different stages that call each
other in succession (or themselves, recursively). Each stage either executes a
built-in component, or a custom Python function, that may fetch, parse or store a
page just as you like it. <code>memorious</code> is also extensible, and contains lots of
helpers to make building your own custom crawlers as convenient as possible.</p>
<p>These configurable chains of operations have made our crawlers very modular, and
common parts are reused efficiently. All crawlers can benefit from automatic
cookie persistence, HTTP caching and logging.</p>
<p>Within OCCRP, <code>memorious</code> is used to feed documents and structured data into
<a href="https://github.com/alephdata/aleph">aleph</a> via an API, which means documents
become searchable as soon as they have been crawled. There, they will also
be sent through OCR and entity recognition. <code>aleph</code> aims to use these extracted
entities as bridges that link a given document to other databases and documents.</p>
<p>For a more detailed description of what <code>memorious</code> can do, see the
<a href="https://memorious.readthedocs.io">documentation</a> and check out our
<a href="https://github.com/alephdata/memorious/tree/master/example">example project</a>.
You can try <code>memorious</code> by running it locally in <a href="https://memorious.readthedocs.io/en/latest/installation.html#development-mode">development mode</a>,
and, of course, we also have a Docker setup for robust production deployment.</p>
<p>As we continually improve our crawler infrastructure at OCCRP, we'll be adding
features to <code>memorious</code> for everyone to use. Similarly, we'd love input from the data
journalism and open data communities; <a href="https://github.com/alephdata/memorious/issues">issues</a>
and <a href="https://github.com/alephdata/memorious">PRs</a> are welcome.</p>
<hr />
<blockquote><p>&ldquo;&hellip;the solitary and lucid spectator of a multiform, instantaneous and almost
intolerably precise world&hellip;&rdquo; (<em>Funes the memorious, Jorge Luis Borges</em>)</p></blockquote>
Manually deleting visits from Piwik by their IP (or URL, or...)2016-02-11T00:00:00+00:00https://tech.occrp.org//blog/2016/02/11/deleting-piwik-visits<p><a href="http://piwik.org/">Piwik</a> is an awesome self-hosted analytics service. We've been relying on it for a very long time and we were always satisfied with what it brought to us. During our work, however, we have accidentally allowed the traffic from our own servers to appear in Piwik, and some general traffic to be counted more than once, therefore we have accidentally boosted our own stats.</p>
<p>Now, since Piwik itself is open sourced, we see no reason why we should <em>not</em> be able to delete the artificially inflated stats ourselves and by doing so making sure that our journalists see the stats as precisely as possible.</p>
<p>Since the informations for this process were not as clear as we wanted them to be, I've decided to write this blog post so we could make the job easier to anyone else who tries to do the same. To follow this tutorial, you're going to need a Piwik installation (obviously), access to the command line on the server and some SQL-fu.</p>
<h2>Step 1: Finding the records that you want to delete</h2>
<p>This seems like a simple thing, but it turned out to be much harder. We had the list of couple of IP addresses that we wanted to exclude from Piwik, but after about half an hour of me searching through Piwik's interface, I was not able to find a way how to see the entire traffic that originated from a specific IP address. Luckily, I stumbled upon <a href="http://blog.onlineinstitute.com/traffic-analytics/how-to-search-for-ip-addresses-within-piwiks-logs/">this short post</a> which gave me every information I needed. To see the traffic from a specific IP, you have to manually tweak the URL you are visiting to:</p>
<pre><code>https://piwik.example.com/index.php/?module=CoreHome&amp;action=index&amp;idSite=1&amp;period=year&amp;date=2016#module=Live&amp;action=getVisitorLog&amp;idSite=1&amp;segment=visitIp==\{\{ IP ADDRESS GOES HERE \}\}
</code></pre>
<p>Bear in mind that Piwik shows 500 actions per visit as a maximum, so if the requested IP made over 500 actions in a single visit (for example, if it was a bot, or if somebody tried to scrape your website), you're only going to see the very first 500 actions that were requested by that IP.</p>
<h2>Step 2: Finding and deleting records from the database(s)</h2>
<p>The second step would be to find the records in the database as well. To do this for the IP you're interested in, you're going to have to convert the IP address to the HEX numeral system. Of course, everyone who finished two IT college courses should be able to convert the number to its HEX value by hand, but if you feel too lazy, just use <a href="http://www.miniwebtool.com/ip-address-to-hex-converter/">this online tool</a> to do so. Or use <code>python</code>:</p>
<pre><code class="python">print hex() # repeat for each IPv4 byte
</code></pre>
<p>Once you have the HEX equivalent of the IP in question, log into MySQL/MariaDB and execute the following command to get the count of rows (or: pageviews) that will be affected:</p>
<pre><code class="sql">SELECT COUNT(*)
FROM piwik_log_visit AS log_visit
LEFT JOIN piwik_log_link_visit_action as log_link_visit_action
ON log_visit.idvisit = log_link_visit_action.idvisit
LEFT JOIN piwik_log_action as log_action
ON log_action.idaction = log_link_visit_action.idaction_url
WHERE log_visit.location_ip=UNHEX("\{\{ HEX_VALUE_GOES_HERE \}\}");
</code></pre>
<p>As you can see, Piwik stores the relevant visitor info into three separate MySQL databases: <code>piwik_log_visit</code>, <code>piwik_log_link_visit_action</code> and <code>piwik_log_action</code>.</p>
<p>If you skip one of them, you'll encounter some unexpected results. For example, initially, we've tried removing the data from <code>piwik_log_visit</code> and <code>piwik_log_link_visit_action</code>, but once we've re-computed the logs, we've noticed that the IP was still there and the visit time was still being shown, even though we have successfully deleted the actions associated with that visit.</p>
<pre><code>0 Action - 42 min 59s
</code></pre>
<p>This is why it's important to delete the data from all three of the databases.</p>
<p>To delete the necessary entries from all the databases, you need to tweak the command above like this (for IP-based pruning):</p>
<pre><code class="sql">DELETE log_visit, log_link_visit_action
FROM piwik_log_visit AS log_visit
LEFT JOIN piwik_log_link_visit_action as log_link_visit_action
ON log_visit.idvisit = log_link_visit_action.idvisit
LEFT JOIN piwik_log_action as log_action
ON log_action.idaction = log_link_visit_action.idaction_url
WHERE log_visit.location_ip=UNHEX("\{\{ HEX_VALUE_GOES_HERE \}\}");
</code></pre>
<p>You can verify that the visits/pageviews are gone from the db by using the <code>SELECT</code> statements again, of course.</p>
<h2>Step 3: Re-compute the reports</h2>
<p>If you have successfully completed the first two steps, your last step should be re-computing the reports. If you skip this step, you won't accomplish anything because the traffic will still be visible in the reports, even though the traffic has been removed from the databases.</p>
<p>To do so, I highly recommend you to take a careful look at Piwik's documentation. Specifically, you should pay a close attention to these two posts:</p>
<ul>
<li><a href="http://piwik.org/faq/how-to/faq_59/">How do I force the reports to be re-processed from the logs?</a></li>
<li><a href="http://piwik.org/faq/how-to/faq_155/">How do I record tracking data in the past, and tell Piwik to invalidate and re-process the past reports?</a></li>
</ul>
<p>Make sure that you invalidate data for the particular sites and dates affected, as processing time is directly dependant on this.</p>
<h2>Bonus -- what about URLs?</h2>
<p>Notice what we've put after the <code>WHERE</code> keyword in step number two. You can do all sorts of crazy thing there. For example:</p>
<pre><code class="sql">[...] WHERE log_action.name LIKE 'example.com/wp-content/themes/%'
</code></pre>
<p>...will remove the traffic that hit the files associated with the WordPress theme you are using.</p>
Keeping Your Android Secrets Out of Git2016-01-19T00:00:00+00:00https://tech.occrp.org//blog/2016/01/19/android-secrets<p>Developers have a habit (one I've been guilty of) of committing API keys and other secrets to our repositories. It’s easy to do it if you’re tired, if you’re hurried, if you’re “moving fast and breaking things”.</p>
<p>This, unfortunately, has been too difficult to prevent for too long. In the interest of security there’s luckily been a big push to stop this practice lately; Rails has had the <a href="https://github.com/laserlemon/figaro">Figaro</a> gem, but recently in version 4.1 they’ve built in a <a href="http://daniel.fone.net.nz/blog/2013/05/20/a-better-way-to-manage-the-rails-secret-token/">“secrets.yml”</a> file. Heroku has a config <a href="https://devcenter.heroku.com/articles/config-vars">ENV</a> screen to /attempt/ to coerce developers into keeping secret keys out of production. Apple’s iOS and its <a href="http://www.raywenderlich.com/92667/securing-ios-data-keychain-touch-id-1password">keychain</a> helps with this on iPhones and iPad.</p>
<p>As far as I’ve been able to tell, Android has been terrible at this.</p>
<p>I’ve scoured documentation, searched for hours across StackOverflow and questioned friends who are much better at Android that I am. After taking bits and pieces, I think I’ve figure a good way to do this. I’m probably not the first, but there doesn’t seem to be a comprehensive write up of this technique anywhere, so I'm hoping these steps help fellow Android devs up their security a bit.</p>
<p>Note: this does not secure credentials in the wild. It will not stop someone from decompiling your ADK and pulling the string. Everything ends up in the compiled app. What it does do is keep someone from going through your Github account and copy/pasting your secrets out of it.</p>
<p><em>Note: For these steps I’m assuming you’re using Android Studio.</em></p>
<ol>
<li><p>Recognize what needs to be kept secret.</p>
<ul>
<li>Anything that’s unique to your deployment of the software.</li>
<li>If you use a key for Google Analytics, or if you keep have an encryption key that needs to be hardcoded, these should never be committed to a repository.</li>
<li>Don’t put these into the source code, ever, even for brief testing purposes.</li>
</ul>
</li>
<li><p>Create a Gradle file just for your keys</p>
<ul>
<li>In Android Studio’s Project Navigator expand your “Gradle Scripts” drop down.</li>
<li>Right click anywhere below the “Gradle Scripts” icon and hover over “New” and then click “File”.</li>
<li>Name this file “safe_variable.gradle” (or whatever you want, just make note of it if it’s different).</li>
</ul>
</li>
<li><p>Add this file to your .gitignore.</p>
<ul>
<li><p>We don’t want to accidentally add it to the repository so add the following line the bottom of your .gitignore file in the project:</p>
<p> <code>/app/safe_variables.gradle</code></p></li>
</ul>
</li>
<li><p>Commit your .gitignore file</p>
<p> <code>git commit .gitignore -m “Added secrets file to git ignore”</code></p></li>
<li>Add your keys to the new secrets file.
<ul>
<li>This example uses two modes, debug and production. The names of the keys are arbitrary and you can put in whatever you need to keep secret.</li>
</ul>
<pre><code class="`"> buildTypes {
debug{
resValue "string", "server\_url", "https://dev.example.com"
resValue "string", "hockey\_key", "\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*"
}
release {
resValue "string", "server\_url", "https://production.example.com"
resValue "string", "hockey\_key", "\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*"
}
}
//This line is only necessary if your app is using localization files for the strings.
//There doesn't seem to be a way to add the strings to multiple langauges from Gradle.
lintOptions {
disable 'MissingTranslation'
}
</code></pre></li>
<li><p>Include this file into your Gradle file</p>
<ul>
<li><p>On the “build.gradle (Module: app)” file, add the following line right after the “android” block</p>
<pre><code class="``"> apply from: 'safe_variables.gradle', to: android
</code></pre></li>
</ul>
</li>
<li><p>Commit your project again</p></li>
<li>Build your project
<ul>
<li>“Build” menu -> “Make Project”</li>
<li>This will automatically add the files to compiled variable so you can reference it in your code.</li>
</ul>
</li>
<li><p>Reference your API keys and other secrets where you need them.</p>
<ul>
<li>In an Activity you can reference it like so:</li>
</ul>
<pre><code class="`"> getResources().getString(R.string.server_url)
</code></pre></li>
<li><p>In a fragment the following syntax can be used:</p>
<pre><code class="``"> getActivity().getApplicationContext().getString(R.string.server_key)
</code></pre></li>
</ol>
<p>That should do it. The only thing to remember is that if you’re switching machines or adding a new teammate they’ll have to recreate steps 2 and 5 on their machine as well.</p>
<p>If there’s an easier way to do this or perhaps a way to do it without having to turn off the translations error please feel free to get in contact at <a href="mailto:cguess@gmail.com">cguess@gmail.com</a> or on Twitter at <a href="https://www.twitter.com/cguess">@cguess</a>.</p>
Migrating ElasticSearch across versions and clusters2016-01-13T00:00:00+00:00https://tech.occrp.org//blog/2016/01/13/migrating-elasticsearch<p>Migrating data between ES clusters might seem like a simple thing -- after all, there are <a href="http://tech.taskrabbit.com/blog/2014/01/06/elasticsearch-dump/">dedicated tools</a> for that. Or one could <a href="https://stackoverflow.com/questions/17884581/elasticsearch-how-to-copy-data-to-another-cluster/26295832#26295832">use logstash with a simple config</a>.</p>
<p>Things get a bit hairy, however, when the source and destination cluster versions differ wildly. Say, like <code>0.90.3</code> and <code>1.7.3</code>. And when you don't happen to have any admin access to the source cluster (only via <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-http.html"><code>HTTP</code></a> and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html"><code>transport</code></a> interfaces).And when the source cluster is misconfigured in just a slight but annoying manner...</p>
<h2>What did not work</h2>
<ul>
<li><code>elasticdump</code></li>
</ul>
<p><a href="http://tech.taskrabbit.com/blog/2014/01/06/elasticsearch-dump/">ElasticDump</a> was the first and obvious thing to try, but apparently it only supports migrations between clusters running ElasticSearch 1.2 and higher. So, that's a no-go.</p>
<ul>
<li><code>logstash</code></li>
</ul>
<p>Apparently <a href="http://stackoverflow.com/a/26295832">one can use logstash</a> to migrate data between clusters. Unfortunately <a href="https://discuss.elastic.co/t/migrating-data-off-of-es-0-90-3-and-into-es-1-7-x/35471/">this solution did not work, either</a>.</p>
<h2>What did work</h2>
<p>Please keep in mind that this procedure worked for us, but it doesn't have to work for you. Specifically, if the source is a cluster of more than one node, you might need to do some fancy <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/allocation-filtering.html">shard allocation</a> to make sure that all shards and all indices are copied over to the new node.</p>
<h3>1. Create a new node</h3>
<p>Why not cluster with that source ES server (running a single-node cluster) by creating a new node that we do control, and thus get the data? Getting the <code>docker</code> container to run an ES version <code>0.90.3</code> was just a bit of manual fiddling with the Dockerfile. Changing the versions everywhere worked well, but here's hint: up until <code>1.0</code> or so, <code>elasticsearch</code> <a href="https://github.com/elastic/elasticsearch/issues/4392">ran in background by default</a>. Thus, the docker container stopped immediately after starting <code>elasticsearch</code>, for no apparent reason...</p>
<p>So if you're dealing with a pre-<code>1.0</code> ES just add a <code>-f</code> (for "foregroud") to the command in Dockerfile to save yourself a bit of frustration.</p>
<h3>2. Cluster with the source server</h3>
<p>Once we have this running, it's time to cluster with the source node. What could be easier? Disable <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html">multicast discovery</a>, enable unicast with a specified host and we're good, right?</p>
<p>Wrong. Remember the <em>"misconfigured in just a slight but annoying manner"</em> thing? The source IP server turned out to be behind a NAT and the IP that we could connect to differed from the IP the server published. Hence, our new node discovered the source node as master, but then -- based on the info gotten from it -- tried connecting to it via its internal (<code>10.x.y.z</code>) IP address. Which obviously did not work.</p>
<h3>3. iptables to the rescue</h3>
<p>As we had no way of changing the configuration of the source node, the only thing we could do was mangle the IP packets, so that packets going to <code>10.x.y.z</code> would have the <a href="http://linux-ip.net/html/nat-dnat.html">destination address modified</a> to the public IP of the source node (and those coming from the public IP of the source node would get modified to have the source address of <code>10.x.y.z</code>, but <code>iptables</code> handled that for us automagically):</p>
<pre><code>iptables -t nat -I PREROUTING -d 10.x.y.z -j DNAT --to-destination &lt;external_IP_of_the_source_node&gt;
</code></pre>
<p>We love one-liners, don't you?</p>
<h3>4. Get the data</h3>
<p>Once we had this working and <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html">confirmed that the cluster is now two nodes</a> (souce node and our new node), we just sat back and watched the ~9GiB of data flow in. Once that was done, it was time to down the new node, disabled discovery altogether, and up it again to <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html">verify that we now have the data in there</a>, on a node that we actually control.</p>
<h3>5. Upgrade</h3>
<p><a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-upgrade.html">Full cluster restart upgrade</a> is what we had to do next, but for our single-node cluster that's just a fancy way of saying:</p>
<ul>
<li>down the node;</li>
<li>upgrade ES to whatever version needed (<code>1.7.3</code> in our case);</li>
<li>up the node;</li>
<li>verify everything is AOK.</li>
</ul>
<h3>6. Migrate the data to the production node</h3>
<p>Since we were doing all this on a new node, created only to get the data off of the source ES <code>0.90.3</code> server, we needed to shunt the data off of it and into our production server (changing the index name in the process for good measure). This is where we turned back to <code>elasticdump</code>, and using a simple script were able to successfuly migrate the data off of the new node and onto our production ES <code>1.7.3</code> server.</p>
<p>Of course things could not got smooth and easy here, either. The dump kept being interrupted by <a href="https://github.com/taskrabbit/elasticsearch-dump/issues/138">an EADDRNOTAVAIL error</a>; a quick work-around was to use the <code>--skip</code> command-line argument of <code>elasticdump</code> to skip rows that have already been migrated.</p>
First Post2016-01-13T00:00:00+00:00https://tech.occrp.org//blog/2016/01/13/hello-world<p>Around this time a year ago, OCCRP's Tech Team consisted of a single
webmaster in Travnik and a few loyal friends who would be called upon to
assist when new challenges arose. Since then, we've expanded to an
excellent group of people, sitting mostly in Sarajevo and doing some pretty
amazing things. This new blog is to tell you of our adventures.</p>
<p>We are working with a lot of different technologies, managing some
legacy systems, building new software and making sense of a lot of
structure. The core aim of all of our activities is simple: shorten the
time to journalism (TTJ).</p>
<p>In practice, this means a lot of stuff, from the simple-looking task of
making sure it's possible to publish stories, with whatever visuals and
interactives are needed, to serving tools for sharing and collaboration
between journalists, to constructing gadgets that allow journalists to
work more effectively, and designing and building new systems that allow
journalists to work better with large volumes of data.</p>
<p>On this blog we'll try to go a bit deeper into the thoughts of what
we're doing and why, explaining our thought processes and announcing
exciting progressions.</p>
<p>But first, some resources:</p>
<ul>
<li><a href="https://occrp.org">OCCRP's website</a></li>
<li><a href="https://github.com/occrp">OCCRP on Github</a></li>
<li><a href="https://investigativedashboard.org">Investigative Dashboard</a></li>
</ul>
<p>That's all for now. Watch this space!</p>