CRUNCH is a use case heavy conference for people interested in building the finest data driven businesses. No matter the size of your venture or your job description you will find exactly what you need on the two-track CRUNCH conference. A data engineering and a data analytics track will serve diverse business needs and levels of expertise.

If you are a Data Engineer, Data Scientist, Product Manager or simply interested how to utilise data to develop your business, this conference is for you. No matter the size of your company or the volume of your data, come and learn from the Biggest players of Big Data, get inspiration from their practices, from their successes and failures and network with other professionals like you.

Here's a short video overview of Crunch 2015:

Speakers

Keynotes

Mike Olson

Board Chairman and Chief Strategy Officer, Cloudera

Big Data in the Real World: Technology and Use Cases

Over the last ten years, the big data ecosystem has changed significantly, and users have learned a great deal about how data can be applied to hard problems. In this talk, I will describe some of the key technologies that dominate the big data landscape today, and where they are headed. I will cover a number of interesting real-world use cases and describe how they use those technologies. The original components of Apache Hadoop -- HDFS and MapReduce -- play a role, but I will also cover newer real-time components like Apache Spark, Apache Kafka, Apache Impala and more.

Bio

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Prior to Cloudera, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine. Mike spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike has a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Andy Cotgreave

Technical Evangelist, Tableau

The Beautiful Science of Data Visualization

Seeing and understanding data is richer than creating a collection of queries, dashboards, and workbooks.You will see how visual and cognitive science explain what makes data visualization so deeply satisfying. Why does a collection of bars, lines, colors, and boxes become surprisingly powerful and meaningful? How does fluid interaction with data views multiply our intelligence? Three decades of research into the beautiful science of data visualization explain why history have converged at this moment, and why interactive data visualization has brought us to the verge of an exciting new revolution.

Bio

Andy Cotgreave is a visual analytics expert who has been with Tableau in various roles since 2011 ranging from product consultant to social content manager. He shares his passion for visual analysis and technology with his writing, (eg Computerworld, on tableau.com and his own blog), speaking at industry conferences like SXSW and Tableau’s own events. He's also active on Twitter @acotgreave. Andy’s role at Tableau gives him the opportunity to work with the media, analysts and customers across all industries to help them understand the trends in visual analytics and develop their own data-discovery skills.

Casey Stella

Principal Architect, HortonWorks

Data Preparation for Data Science: A Field Guide

Any data scientist who works with real data will tell you that the hardest part of any data science task is the data preparation. Everything from cleaning dirty data to understanding where your data is missing and how your data is shaped, the care and feeding of your data is a prime task for the working data scientist.

I will describe my experiences in the field and present an open source utility written with Apache Spark to automate some of the necessary but insufficient things that I do every time I'm presented new data. In particular, we'll talk about discovering missing values, values with skewed distributions and discovering likely errors within your data.

Bio

I am a committer and PMC member on the Apache Metron project in the engineering team at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry.

I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or any thing mathematical.

Ben Yoskovitz

Founding Partner, Highline BETA

Product Management: Data + Guts

Every product starts with an idea, a gut instinct that tells us there’s a problem worth solving and we know the answer. But so often, the problems that we think are real, turn out to be unimportant, and the solutions we provide miss the mark. Guts alone, in the world of product management and building products people want, isn’t enough.

That’s where data comes in.

But swing too far towards data analysis and what gets lost? Products that capture people’s attention and genuinely add value to people’s lives have a bit of heart and soul mixed into them as well.

In this talk, I will discuss how to marry guts (+ qualitative feedback) and data together in order to successfully build great products. Data tells us what is happening, qualitative feedback tells us why. Data says, “You should go look at this thing, pay attention here!” But our guts and instincts are what keep us up at night and spark inspiration.

Bio

Benjamin Yoskovitz is an entrepreneur, investor and author. He recently launched Highline BETA, a startup co-creation company. Previously he was VP Product at VarageSale and GoInstant (acq. $CRM). He’s made 15+ angel investments, and founded an accelerator, Year One Labs. Ben is the co-author of Lean Analytics (published by O’Reilly), a book that combines Lean Startups and analytics to help startups and large companies build better businesses and products faster. He’s an active blogger at http://instigatorblog.com. You can also find Ben on Twitter @byosko.

Dedicated talks to data engineers and architects

Danny Yuan

Software Engineer, Uber

Realtime Stream Processing @Uber

This talk will discuss how stream processing is used in Uber's realtime system to solve a wide range of problems, including but not limited to revealing and visualizing dynamics of Uber's marketplace, performing complex computation on geospatial temporal data, and extracting patterns from data streams. This talk will also present the architecture of the stream processing pipeline with a focus on how and why the architecture has evolved into its current form.

Bio

Danny Yuan is a software engineer in Uber. He's currently working on data systems for Uber's logistics platform. Prior to joining Uber, he worked on building Netflix's cloud platform. His work includes predictive autoscaling, distributed tracing service, real-time data pipeline that scaled to process hundreds of billions of events every day, and Netflix's low-latency crypto services.

Shirshanka Das

Principal Staff Software Engineer, Linkedin

Big Data Infrastructure @ LinkedIn

Ever wonder how LinkedIn ingests, organizes and analyzes the massive amounts of data generated by the world’s largest professional network? In this talk, Shirshanka will describe LinkedIn’s Big Data Infrastructure and its evolution through the years. The talk will also cover the motivations and architecture of LinkedIn’s latest open source contributions to Big Data: Gobblin for ingest, Pinot for querying and WhereHows for metadata.

Bio

Shirshanka is a Principal Staff Software Engineer and the architect for LinkedIn’s Data & Analytics team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team on simplifying the big data analytics space at LinkedIn through a multitude of mostly open-source projects: Pinot, a high-performance distributed OLAP engine; Gobblin, a data lifecycle management platform for Hadoop; WhereHows, a data discovery and lineage platform and Dali, a data virtualization layer for Hadoop.

Sergii Khomenko

Lead Data Scientist, Stylight

From Data Science to Production - deploy, scale, enjoy

Data Science is quite a young field. One of the definitions of Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician. Hence, it's quite important to talk not only about best practices of feature generation and not overfitting but also about more of software engineering topics.

The talk is based on our experience of Data Science developments at Stylight, an international fashion e-commerce company, that operates in 15 countries worldwide. We refer to our Data Applications written in R and Python, Scala; but the content is not limited to mentioned languages and applicable others.

The talk consists three main parts. A first part introduces best practices of development. How to structure your development, make deployment easy and reproducible, how to make Continuous Integration and commit triggered deployments. The second part covers production deployment to AWS stack, in particular focusing on concepts of immutable infrastructure and infrastructure as code. The last part about using serverless architecture for data applications. We introduce an example of our outlier detection system, that automatically scales based on such approach.

Bio

Data scientist at one of the biggest fashion communities, STYLIGHT. Data analysis and visualisation hobbyist, working on problems not only in working time, but in free time for fun and personal data visualisations.

Ex-deputy director/lecturer at HP International Institute of Technology, Kiev.

Mike Elsmore

Developer Advocate, IBM Cloud Data Services

NoSQL is a lie

NoSQL is a term on the rise, and it’s a lie. NoSQL is a catch-all term and I will point out why a catch all means missing tools that may help solve your problems. Going through a few popular DB’s we will walk through the use cases and why they’re good at what they do. For the last decade NoSQL has become a term everyone in development has heard of and thought it to be a mysterious black box where JSON goes in and wonderful data comes out. And with such bewitching names as Redis, MongoDB & Hbase we find it difficult to differentiate between the different databases and what they do. This talk will take you through the rant to demystify NoSQL as a term, why the databases evolved in the way they did, what kinds of databases exit and finally some reasons you’d use them.

Bio

Mike spends his days as Developer Advocate at IBM Cloud Data Services, using his time to share knowledge on rapid development and different databases. Most of the time he can be found in the middle of a prototype in some combination of JavaScript, server tech and odd API’s.

Alex Dean

Co-founder, Snowplow Analytics

Asynchronous micro-services and the unified log

The unified log enabled by Apache Kafka and Amazon Kinesis has been mostly understood as a better data processing architecture, replacing traditional data warehousing techniques. But the unified log also enables a new way of building transactional software, by enabling asynchronous micro-services. In this talk, Alexander Dean will show how event-driven micro-services designed around Kafka or Kinesis resolve many of the issues associated with traditional monolithic and synchronous micro-service based architectures.

Dirk Duellmann

Section Leader Analysis and Design, CERN

Understanding the computing for the Large Hadron Collider at CERN

The physics community at CERN analyses since many decades large volumes of physics data.
More recently statistical methods and machine learning are also applied to computing infrastructure metrics to better understand and optimise the complex and distributed computing systems used for the Large Hadron Collider.
This presentation will give an overview of established and new techniques and tools for supporting these analysis activities.

Bio

Dirk Duellmann leads the Analytics and Development section of CERN's Storage group. He is responsible for the design and evolution of CERN's high performance disk pools for physics data analysis and he chairs the working group for Infrastructure Analytics of CERN's IT department. Previously he lead the Worldwide LHC Computing Grid (WLCG) projects for persistency framework development and for distributed database deployment.

Dirk joined CERN in 1995 after receiving a PhD in High Energy Physics from the University of Hamburg. Before he worked in several software companies on the development of database management systems and applications.

Wouter de Bie

During the presentation we will have a look at Spotify's move of 12.000+ servers from 4 data centers to Google's Cloud Platform. Even though Spotify is still in the midsts of the migration, we already have a ton of learnings that we can share. Obviously we will look at the "why", "what" and "how" of this enormous migration.

Bio

Wouter started his career at an early age as a Linux consultant in the Netherlands during the dot-com era. In 2009 Wouter decided to move to Sweden for personal reasons and worked as a Ruby developer and system administrator at Delta Projects, one of Sweden’s biggest online ad serving companies, before he decided to join Spotify in 2011.
Wouter is currently working as a data architect at Spotify where he helps teams in building the next iteration of the big data platform and migrating from Spotify's on-premise infrastructure to Google's Cloud Platform.

Yash Nelapati

Behind of the scenes of Pinterest

A journey into behind of the scenes of Pinterest, the product and engineering. How a company that saw insane amount of growth went through engineering changes and evolved a data informed company.

Bio

Yash Nelapati is the founding engineer of Pinterest. He built the initial version of Pinterest and later scaled it to millions of users over the last 5 years. Over these 5 years journey he worked on various problems related to user growth, infrastructure scalability and product design. Outside of work he spends a lot of time shooting landscapes.

Sean Braithwaite

SoundCloud

Mechanics of data pipelines

This talk focused the topic on how to model data pipelines as retroactive, immutable data structures. Essentially it covers the topic of how do you build a data pipelines for a growing organization where different teams depend on each others data and need to be able to re-process data when errors occur upstream.

Bio

Sean Braithwaite is a data engineer based in Berlin. For the past 8 years he's been using data for everything from data driven art installations to real time ad bidding. Most recently he's been responsible for scaling SoundClouds data pipeline to handle billions of events per day.

Dedicated talks to data scientists and managers

Marton Trencseni

Data Engineer, Facebook

Data Science in Facebook Product Teams

At Facebook, “data makes decisions”. I will talk about the role of Data Science and Data Engineering in how Facebook builds and ships product. How data scientists and engineers work with their product teams, how metrics are used (and what are they), the role of dashboards and the art of building a good dashboard, when and how to use different data sources.

Bio

Marton is a Data Engineer at Facebook and leads data engineering on the Facebook at Work product. Previously he was Director of Data at Prezi, and worked on building the data infrastructure and the analytics team. Marton speaks regularly on data infrastructure, team building and A/B testing related topics. He holds degrees in Computer Science and Physics.

Dan McKinley

Data Driven Products Now!

How do you decide what to build? There’s the rub. Many successful companies start out as single person moved by the Muse. They gain traction on the back of this and possibly a few adjacent inspired ideas. But this very rapidly gets way out of hand. This is the story of how we learned to move beyond prioritizing by gut instinct as we scaled Etsy from a dozen people to over 700.

Bio

After starting his career in finance, Dan McKinley freaked out and moved to Brooklyn. He stumbled into a fledgling Etsy.com in 2007, and spent his first years there trying to stop overwhelming traffic from reducing the site to its constituent elements. In the long summer that followed he worked on activity feeds, search, recommendations, experimentation, and analytics. Dan worked at Stripe for a while, before moving on to co-found Skyliner.io along with Coda Hale and Marc Hedlund.

Elena Verna

VP of Growth and Analytics, SurveyMonkey

Pricing Page Optimization

Pricing page belongs to one of the most important funnels on your site that should be closely monitored and optimized. Learn what you need to know about the user behavior on the pricing page to know where to focus your AB testing resources.

Bio

Elena is SVP of Growth for SurveyMonkey. She is responsible for Acquisition, Conversion, Retention metrics which includes managing the billing platform, CRM, and SEO/SEM channels. Elena also runs the Growth Hacking, AB testing, and Analytics teams.

Jeroen Janssens

Assistant Professor of Data Science, Tilburg University

The Polyglot Data Scientist

A polyglot is a person who knows and is able to use several languages. It’s generally good advice to stick to one programming language or one computing environment. The code will most likely be more consistent, more stable, and easier to maintain. However, sometimes, especially for exploratory data science projects, it can be more effective or efficient to mix and match. For instance, consider the situation where you want to make use of a fast machine-learning library. It turns out that this library is written in C++, but you work in R, and there are no language bindings available yet. Or consider the situation where you know how to solve a particular sub-problem in R, but your collaborator is using another language.
Jeroen Janssens discusses three approaches to become a polyglot data scientist. Jeroen first explores Beaker Notebook, which allows you to use multiple languages (Python, R, JavaScript, Julia, etc.) in one notebook. He then looks at several language-specific ways of combining programming languages (e.g., how to load R data into MATLAB, how to use a MATLAB package in Python, and how to call Python functions from R). This list of combinations is not exhaustive, but it will give you a good idea of the possibilities. Finally, Jeroen explains how to write your own reusable command-line tools and employ command-line tools directly from Python and R. The command line is language agnostic, which means that you can combine tools written in just about any language. With a few simple steps, it’s possible to turn your existing code into a command-line tool.

Bio

Jeroen Janssens is an assistant professor of data science at Tilburg University. As an independent consultant and trainer, Jeroen helps organizations make sense of their data. Previously, he was a data scientist at Elsevier in Amsterdam and the startups YPlan and Outbrain in New York City. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He is the author of Data Science at the Command Line, published by O’Reilly Media. He blogs at jeroenjanssens.com and tweets as @jeroenhjanssens.

Michael Hunger

Head of Developer Relations, Neo4j

Enabling the Panama Papers Investigations with Open Source Tools

The biggest leak in journalistic history has not only been mind blowing for everyone but also challenging for the team of journalists and developers known as the ICIJ. With more than 11M documents sizing 2.6TB of information, it is truly impressive that a small team of 3 developers could support more than 400 journalists in a year’s worth of investigative work. This became possible through the efficient use of open source technology for scanning and extracting text and metadata from the documents. The biggest difference though made the power of a graph database to connect the people, companies and accounts revealed in the investigation.

Especially for the non-technical journalists, the ability to unearth all those connections "was like magic". Collaborating in research they benefited from each other’s work to see the bigger picture grow more interesting every day.

In this talk I want to detail the process and the technologies used by the journalists for their investigative work, including Apache Solr Apache Tika, and Neo4j. Then I will focus on their work with Neo4j, the data model they developed and the types of queries and interactions that helped them to grow their understanding. We discuss how tools for visual graph exploration and search enable even non-technical users to benefit from working with large amounts of connected data.

Using the officially published dataset of 3.4M records we demonstrate how new insights in your existing, disconnected data are just one graph query away.

Bio

Michael Hunger has been passionate about software development for a very long time.

For the last few years he has been working with Neo Technology on the open source Neo4j graph database filling many roles. As caretaker of the Neo4j community and ecosystem he especially loves to work with graph-related projects, users and contributors.

As a developer Michael enjoys many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles.

Szilard Pafka

Chief Data Scientist, Epoch

No-Bullshit Data Science

While extracting business value from data has been performed by
practitioners for decades, the last several years have seen an
unprecedented amount of hype in this field. This hype has created not
only unrealistic expectations in results, but also glamour in the usage
of the newest tools assumably capable of extraordinary feats. In this
talk I will apply the much needed methods of critical thinking and
quantitative measurements (that data scientists are supposed to use
daily in solving problems for their companies) to assess the
capabilities of the most widely used software tools for data science. I
will discuss in details two such analyses, one concerning the size of
datasets used for analytics and the other one regarding the performance
of machine learning software used for supervised learning.

Bio

Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program.

Jonathan Magnusson

Analytics Team Lead, King

Data is King

This talk will cover how data from a network of over 300 million players have helped King optimise some of the largest mobile games in the world. Key learnings will be shared, along with examples of how data have supported difficult decisions.

Bio

Jonathan leads a team of data scientists at King, helping the Candy Crush Franchise team improve some of the largest mobile games in the world. His main focus is evaluating product features through hypothesis driven testing, supporting the game teams in delivering the best quality and player experience possible.

Colin McFarland

Head of Experimentation, Skyscanner

Experimentation at Skyscanner

Colin will share Skyscanner’s philosophy for experimentation in a broader sense, before delving into intuition and bias, the common pitfalls observed scaling to hundreds of concurrent experiments, and how changing the attitude to failure is crucial to succeed. You can read about his work at Skyscanner on CodeVoyagers.com.

Bio

Colin McFarland is Head of Experimentation at Skyscanner. He leads the development of Skyscanners in-house experimentation platform (Dr Jekyll) as well as working to foster a data-driven culture at Skyscanner. Prior to Skyscanner he initiated and scaled experimentation at Shopdirect and Rentalcars. In the past he has spoken at (amongst others) Design It Build It, Lean Conf, Conversion Conference, Spotify Design+Data, Microsoft One Analyst, and Imperial College London.

Dharmesh Desai

Technology Evangelist, Qubole

Creative Data Science Trends that Transform Marketplaces

As more and more companies are using data as a resource to gain insights and competitive advantage in the marketplace, data science and its applications have been at the core of this trend with data scientists leading the way. Besides being able to leverage technology and tools, deep sense of curiosity and asking the right questions are other important traits of a successful initiative.

In this talk, Dash Desai, Technology Evangelist at Qubole, will briefly discuss how enterprises are using data science in creative ways. He will then highlight top reasons why On-Prem Big Data deployments fail and therefore get in the way of innovation. Followed by reviewing the advantages of migrating to the Cloud and how it speeds up and simplifies data science applications in a cost-effective manner.

Bio

As a technology evangelist, he is passionate about evaluating new ideas and trends, and helping articulate how a technology would address a given business problem. He also enjoys writing and being hands-on to create compelling demos and applications.

As a full-stack developer, he has participated in all phases of software development life cycle. He started his career working for global enterprises developing applications for the city of Nashville, TN and the state of Colorado. Since then, he has been working as an engineer and solutions architect in agile environments--for tech startups in the Bay Area in varying verticals such as VoIP, Online Gaming, Digital Health, NoSQL database, and Big Data platform.

He lives in Willow Glen-San Jose, California with his partner Claire and daughter Eva. He enjoys spending time with family and friends and his other passions are: high-altitude hiking, downhill skiing, travel, and photography.

Jeff Magnusson

Director of Data Platform, Stitch Fix

Engineers Shouldn’t Write ETL: Optimizing Data Science Teams

Often data science teams are organized in such a way that data scientists and engineers hand off work at various phases - data acquisition (ETL), research and prototyping, and productionisation and support. What happens instead when data scientists and engineers are instead given full autonomy and ownership of their work? This talk will explore the benefits and drawbacks of such an approach; the platform, frameworks, and infrastructure that are required to successfully support it; and the relationship between data scientists and engineers in such an organization.

Bio

Jeff Magnusson is the Director of Data Platform at Stitch Fix where he leads a team of engineers in creating a robust, scalable, cutting-edge data platform on AWS using tools like Scala, Spark, Presto, Elastic Search, and Kinesis. The platform supports a staff of around 80 data scientists in data acquisition, research, analysis, and deployment of algorithms and ideas that influence nearly all aspects of the business. Prior to that, Jeff was a Manager of Data Platform Architecture at Netflix. While there, he led a team whose mission was to make big data as easy and efficient to use as possible across the organization.

Miklos Peter Mader

Be a magenta stormtrooper!

You are those guys who understand, use and develop services from data stream. We are those who believe that cool products can be done from telco data. Would you like to collaborate with us?

Bio

Miki would like to flare Magyar Telekom’s product portfolio with unique and innovative services. Miki spends his days by identification and implementation of mainly non-core business opportunities. Startups, trends universities, R&Ds, partners, customer needs, new technologies and financial feasibility are the main elements of this task.

Dharmesh Desai

Technology Evangelist, Qubole

Qubole Data Service Demo

In this hands-on and interactive session, you will get an overview of Qubole Data Service (QDS)—the leading Big Data platform in the cloud. We will see how quickly you can start analyzing data stored in the cloud and run workloads using multiple Big Data processing engines.

No downloads. No installations. No data migrations.

Three key session takeaways:

Learn how to create unified Hive metastore in QDS

Learn how to run Hive, Presto, and SparkSQL queries using unified metastore in QDS

János Zrak & Ákos Horváth

GE Healthcare

GE Health Cloud MDT Meeting

GE Health Cloud represents an ecosystem of cloud-based applications with high scalability, robust security, and solid reliability designed to offer advanced clinician collaboration, seamless device integration, and comprehensive end-user controls and data management. GE Health Cloud MDT Meeting is a web-based solution built on top of Centricity Case Exchange and is used to prepare and support the meeting(s) of a group of professionals from one or more clinical disciplines who together make decisions regarding recommended treatment of individual patients.

Workshops5 Oct, Wed

Lean Analytics: How Data Drives Business Success

Ben Yoskovitz, Founding Partner, Highline BETA

In this hands-on workshop, Ben Yoskovitz will take participants through a process of first identifying what makes a good metric, and then how good metrics, their measurement and use can be applied to different types of businesses. Participants will learn the basic principles of Lean Analytics, including the Lean Analytics Stages, Lean Analytics Cycle and more. Participants will be actively involved in mapping business models, discussing what metrics matter, and how data helps drive business success.

Bio

Benjamin Yoskovitz is an entrepreneur, investor and author. He recently launched Highline BETA, a startup co-creation company. Previously he was VP Product at VarageSale and GoInstant (acq. $CRM). He’s made 15+ angel investments, and founded an accelerator, Year One Labs. Ben is the co-author of Lean Analytics (published by O’Reilly), a book that combines Lean Startups and analytics to help startups and large companies build better businesses and products faster. He’s an active blogger at http://instigatorblog.com. You can also find Ben on Twitter @byosko.

Crunching Data at the Command Line

We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring (in other words: crunching) the required data.

The command line, although invented decades ago, is an amazing environment for performing such data science tasks. By combining small, yet powerful, command-line tools you can quickly explore your data and hack together prototypes. New tools such as parallel, jq, and csvkit allow you to use the command line for today's data challenges. Even if you're already comfortable processing data with, for example, R or Python, being able to also leverage the power of the command line can make you a more efficient data scientist.

Turning existing code, such as Python or R, into reusable command-line tools

Computing aggregate statistics

Creating data visualizations

This workshop is aimed at data scientist, data engineers, data journalists, and everyone else who has an affinity with data. We will make use of the Data Science Toolbox, which is a free, open-source virtual environment that contains all the necessary command-line tools. The Data Science Toolbox runs not only on Linux, but also on Mac OS X and Microsoft Windows, so everybody is able to follow along. Whether you're entirely new to the command line or already dreaming in shell scripts, by the end of this workshop you will have a solid understanding of how to integrate the command line in your data science workflow.

Bio

Jeroen Janssens is an assistant professor of data science at Tilburg University. As an independent consultant and trainer, Jeroen helps organizations make sense of their data. Previously, he was a data scientist at Elsevier in Amsterdam and the startups YPlan and Outbrain in New York City. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He is the author of Data Science at the Command Line, published by O’Reilly Media. He blogs at jeroenjanssens.com and tweets as @jeroenhjanssens.

Apache Spark Essentials

(official Databricks workshop)

Zoltan C. Toth, Spark instructor, Databricks

Apache Spark Essentials will help you get productive with the core capabilities of Spark, as well as provide an overview and examples for some of Spark’s more advanced features. This full-day course features hands-on technical exercises so that you can become comfortable applying Spark to your datasets. In this class, you will get hands-on experience with ETL, exploration, and analysis using real world data.

Prerequisites:
This class doesn't require any Spark knowledge. Some experience in Python and some familiarity with big data or parallel processing concepts is helpful.

A Big Data adventure in Google Cloud Platform

Csaba Kassai & Sub Szabolcs Feczak & Lajos Gathy

If you work with a huge amount of data, from either the analyst or the developer side, and you are always looking at what’s next regarding Big Data technologies, come join us and explore Google Cloud Platform’s comprehensive Big Data solution at Doctusoft’s full-day workshop.

Learn how to build, execute, and visualize your Big Data projects more easily and in less time with Google Cloud Platform’s four main Big Data products: BigQuery, Pub/Sub, Dataflow, and Datalab.

See how these products link to each other by following an example project participants work on together.

Learn about real business use cases and project experiences.

Get the full picture about how these Big Data products differ from other well-known solutions and know which one to choose to suit your business needs or the technological requirements you work with.

Whether you come from a small start-up or a big multinational company, this workshop is useful for anyone who wants to learn first-hand how to deal with a Big Data project on Google Cloud Platform.

Participants should have a technology background, a basic understanding of their current business model, and be open to sharing their thoughts and questions.

The workshop is not only a brief introduction to Google’s Big Data solutions, but it also covers several topics on Google Cloud Platform - Qualified Data Analyst (CPE201) certificate exam.

Participants will need to bring their own laptops and have a Google account. Further information about the technical environment will be communicated after registration.

Csaba Kassai

Csaba has been a software architect at Doctusoft ‒ the only Google Cloud Platform partner in Hungary ‒ for 5 years. He has participated in several Big Data projects, solving the problems of different retail, telecommunication, and start-up companies using Google and Hadoop technologies. He has also worked for one of the biggest banks in Hungary on Big-Data-focused projects such as optimizing the query time of the transaction history database with ElasticSearch. Csaba’s main professional interests are Google’s Big Data products and their related programming languages and database technologies.

Sub Szabolcs Feczak

“Sub” started to study computer science at the age of 14 and maintained an avid interest in the field ever since. At Google, Sub focuses on Cloud Products as a Technical Solutions Engineer and a DataFlow product engagement lead being the voice of the customers in front of product management and software engineering. Before joining Google, Sub made a voice over IP start-up to scale from zero to six thousand calls per month using an infrastructure designed by him. Sub has an interest in sustainability, global economics and social graph analysis.

Lajos Gathy

Lajos Gathy is co-founder of Doctusoft and responsible for the company's professional activity. He graduated from Budapest University of Technology and Economics with honors in Computer Engineering and has gained national and international expertise as a developer and as a software architect. He is interested in scalable data processing, distributed software design, and cloud computing. Lajos is committed to performance optimization and code reuse. As CTO of Doctusoft, the only Google Cloud Platform (GCP) partner in Hungary today, he is first in line to adopt the latest features of GCP in various internal and partner projects. His primary goal is to make BigQuery and Cloud Dataflow popular with Doctusoft's clients.