Featured Author

Nicolas A Perez

Software engineer, IPC

Nicolas is a software engineer at IPC, an independent SUBWAY® franchisee-owned and operated purchasing cooperative, where I work on their Big Data Platform. Very interested in Apache Spark, Hadoop, distributed systems, algorithms, and functional programming, especially in the Scala programming language.

In the past, I have done a lot of programming and engineering in C# on the .NET Framework, an environment where I feel very comfortable and knowledgeable. Past work includes payment processing systems, POS systems, and mobile systems. All of them have allowed me to grow professionally in different areas of expertise.

Author's Posts

The Apache Spark community is thriving, and they have put a lot of effort into extending Spark. Recently, we have been interested in transforming an XML dataset into something that's easier to query. Our main interest is being able to do data exploration on top of billions of transactions that we get every day. In this blog post, I'll walk you through how to use an Apache Spark package from the community to read any XML file into a DataFrame.

Logging in Apache Spark is very easy to do, since Spark offers access to a logobject out of the box; only some configuration setups need to be done. In a previous post, we looked at how to do this while identifying some problems that may arise. However, the solution presented might cause some problems when you are ready to collect the logs, since they are distributed across the entire cluster.

In this blog post, you'll learn how to do some simple, yet very interesting analytics that will help you solve real problems by analyzing specific areas of a social network. Using a subset of a Twitter stream was the perfect choice to use in this demonstration...

In my last post, we explained how we could use SQL to query our data stored within Hadoop. Our engine is capable of reading CSV files from a distributed file system, auto discovering the schema from the files and exposing them as tables through the Hive meta store. All this was done to be able to connect standard SQL clients to our engine and explore our dataset without manually define the schema of our files, avoiding ETL work.

An important part of any application is the underlying log system we incorporate into it. Logs are not only for debugging and traceability, but also for business intelligence. Building a robust logging system within our apps could be use as a great insights of the business problems we are solving.