Dark Data, BigQuery, and Ransomware!

Dark Data, BigQuery, Ransomware and Pie Charts are some of the terms that were discussed at the BigDataExpo 2016 conference I attended last week. This is a free conference held at the Jaarbeurs Utrecht conference center near Amsterdam in the Netherlands, which drew a few thousand visitors who came to see presentations delivered in English, Dutch or even both sometimes! As the event’s tagline -- "Develop your Big Data strategy" -- suggests, the conference was targeted more towards business people and less towards developers.

There were a few presentations I was interested in seeing, and the one that interested me most was titled, "More insights with visualization". Unfortunately, a lot of other attendees were interested as well. The room was packed when I arrived, and it was impossible for me to get in. The presenter was prof.dr.ir. Jack van Wijk who is the professor of Visualization at the TU/e (Technical University of Eindhoven). He is also the group leader of the visualization group of the TU/e, which has been the source for two commercial spin-offs: MagnaView and SynerScope. At the SynerScope booth I saw a nice demo where the tool was used to gain insights from all kinds of data: "structured and unstructured as well as Dark Data (data which is stored and never analyzed because of technological and cost restraints)".

Querying 1 PB (PetaByte) under 3 minutes

The presentation I was most impressed by was a talk by two people from Google. They noted that it took years after Google's initial paper on MapReduce before Hadoop took off and was widely adopted by others. To prevent another technology gap like that from happening again, Google is hoping to use the Google Cloud Platform to make their technology available to others immediately.

During their talk they also showed two SQL queries executed with BigQuery on 1 PetaByte of Data (they showed a slide that put 1 PB in perspective (1PB = 1024 TeraByte, 1 TB = 1024 GiB) as the amount of data that would take 27 years to download with your phone over the 4G network). To show off, they executed a deliberately naive "SELECT * WHERE (X = Y...)" query which ran in 2 minutes and 40 seconds. Another query did a regular expression /G.*o.*o.*s/ on another large table containing books against all book titles. The query touched 4.4 TeraByte and ran in 26 seconds. I tried this same regex on an Intel Core i7-6700HQ, 16GiB DDR4, 512GiB NVMe SSD on a similar 1 GiB file which took 2.7 seconds. To compare 4.4 TB would take more than three hours to complete. At least 461 laptops would have been needed to match the 26 seconds time on BigQuery. In fact, it would take even more than that, as there is also overhead involved scaling to multiple machines.

Some Interesting Trivia!

From Google:

20% of searches via mobile are voice searches. You can search through your own Google Photos for locations or characteristics like "El Capitan Mountains", and it will find photos you've taken there even though you never tagged anything, thanks to Machine Learning.

From Microsoft:

The Next Rembrandt is a convincing 3D-printed painting created by an algorithm that learned how to draw like Rembrandt using actual Rembrandt paintings as its training data. The materials (paint and frame) cost only €60. however, that doesn’t include the cost of the 2 year research effort that led to the creation of the painting.

From Sophos:

Regarding Ransomware (Malicious software that locks you out and demands money to let you back in), "To pay or not to pay? That is the question." and the non-nuanced answer is yes, but preferably make off-site backup(s) so you don't need to (Sophos).

From LinkedIn:

Talent seems to be spread "democratically", or evenly spread out across the world, but the ability to apply talent doesn't. LinkedIn’s mission is to remove this "friction".

From Cisco:

An estimated 37 billion new things will be added to the "Internet of Everything" by the year 2020. Today, there are 1.5 million additions per week, including cars and vending machines.

On Security:

Every minute 2.5 million cyber threats happen, and it takes an average of 200 days(!) for companies to discover they have been hacked.

“There are two kind of organization: those who've been hacked and those who don't know they've been hacked!”