Recently we have seen a significant rise in the amount of untruthful data and false data creation.

Data veracity (defined as false or inaccurate data) is often overlooked yet may be as important as the 3 V's of big data: volume, velocity and variety. Here data may be intentionally, negligently or mistakenly falsified. Data veracity may be distinguished from data quality, usually defined as reliability and application efficiency of data, and sometimes used to describe incomplete, uncertain or imprecise data.

Traditional data warehouse / business intelligence (DW/BI) architecture assumes certain and precise data considering unreasonably large amounts of human capital that would be required to be spent on data preparation, ETL/ELT and master data management.Yet the big data revolution forces us to rethink the traditional DW/BI architecture to accept massive amounts of both structured and unstructured data at great velocity. By definition, unstructured data contains a significant amount of false as well as uncertain and imprecise data. For example, social media data is inherently uncertain and contains many falsehoods.

For many data science projects about one half or more of time is spent on "data preparation" processes (e.g., removing duplicates, fixing partial entries, eliminating null/blank entries, concatenating data, collapsing columns or splitting columns, aggregating results into buckets...etc.). I suggest this is a "data quality" issue in contrast to false or inaccurate data that is a "data veracity" issue.

Considering variety and velocity of big data, an organization can no longer commit time and resources on traditional ETL/ELT and data preparation to clean up the data to make it certain and precise for analysis. While there are tools to help automate data preparation and cleansing, they are still in the pre-industrial age.

Data veracity issues can wreck a data science project: if the data is objectively false then any analytical results are meaningless and creates an illusion of reality that may cause bad decisions and fraud - sometimes with civil liability or even criminal consequences.As a result, I strongly advise data scientists to assign a data veracity score and ranking for specific data sets to avoid making decisions based on analysis of false data.See: http://bit.ly/1kXsipc

I completely misunderstood Kay Ousterhout's paper Making Sense of Performance in Data Analytics Frameworks to be the comparison on the left. Through a series of e-mails, Kay Ousterhout clarified that the comparison her team drew, regarding the disk performance section of their paper, was (if I now understand correctly) what I've drawn on the right. She wrote to me:"The 19% improvement refers to only the improvement from moving compressed, serialized data from on-disk to in-memory. When you store data in-memory natively with Spark, Spark decompresses and deserializes the data into Java objects, resulting in a much larger improvement; this deserialized and decompressed format is usually what people refer to when they say "In-Memory Spark". (The improvement for the big data benchmark from on-disk Spark to in-memory Spark is quantified here: https://amplab.cs.berkeley.edu/benchmark/). So, you could say "Disk I/O is not the bottleneck for on-disk Spark" or "On-disk Spark could only improve by 19% from optimizing disk I/O". It would also be correct to say that our results imply that Spark using flash would only be at most 19% faster than on-disk Spark (because in that case, the data would still be serialized and compressed)."

So the motivation behind their work seems to be, given you have a Spark cluster, what hardware (disk and network) and scheduler/task restructuring changes could you make to improve performance?Spark is still fast -- many times faster than Hadoop. But the reason it's fast is what is the surprising (to me at least) result determined by the work behind this paper. Spark is fast because data is already deserialized and decompressed. In iterative computations, Spark avoids the serialization/deserialization and compression/decompression round-trips that Hadoop goes through -- at least for data that doesn't go through shuffles. Spark shuffles of course are serialized, and are by default also compressed unless spark.shuffle.compress is set to true.I apologize to Kay Ousterhout et al for mischaracterizing their results.

I completely misunderstood Kay Ousterhout's paper Making Sense of Performance in Data Analytics Frameworks to be the comparison on the left. Through a series of e-mails, Kay Ousterhout clarified that the comparison her team drew, regarding the disk performance section of their paper, was (if I now understand correctly) what I've drawn on the right. She wrote to me:

"The 19% improvement refers to only the improvement from moving compressed, serialized data from on-disk to in-memory. When you store data in-memory natively with Spark, Spark decompresses and deserializes the data into Java objects, resulting in a much larger improvement; this deserialized and decompressed format is usually what people refer to when they say "In-Memory Spark". (The improvement for the big data benchmark from on-disk Spark to in-memory Spark is quantified here: https://amplab.cs.berkeley.edu/benchmark/). So, you could say "Disk I/O is not the bottleneck for on-disk Spark" or "On-disk Spark could only improve by 19% from optimizing disk I/O". It would also be correct to say that our results imply that Spark using flash would only be at most 19% faster than on-disk Spark (because in that case, the data would still be serialized and compressed)."

So the motivation behind their work seems to be, given you have a Spark cluster, what hardware (disk and network) and scheduler/task restructuring changes could you make to improve performance?

Spark is still fast -- many times faster than Hadoop. But the reason it's fast is what is the surprising (to me at least) result determined by the work behind this paper. Spark is fast because data is already deserialized and decompressed. In iterative computations, Spark avoids the serialization/deserialization and compression/decompression round-trips that Hadoop goes through -- at least for data that doesn't go through shuffles. Spark shuffles of course are serialized, and are by default also compressed unless spark.shuffle.compress is set to true.

I apologize to Kay Ousterhout et al for mischaracterizing their results.