Posts

This is the presentation for Rapid Cluster Computing with Apache Spark session I did in Oracle Week few weeks ago.

I wrote about the Oracle Week conference in a previous post so I won’t go over that again – this was my 3rd session of that week.

Although Oracle Week was for years about Oracle related products, this year they decided to open it up for other technologies as well. They had NoSQL sessions, Hadoop sessions, and even open stack sessions (including ElasticSearch and others). I was fortunate enough to be accepted to give this session which was about Apache Spark.

Apache Spark is the ad-hoc solution for every new Big Data project we encounter in the last year or so. Spark is a cluster solution which uses the Map Reduce paradigm without the need for a Hadoop cluster. It is based on handling the different map-reduce function in-memory and orchestrate everything internally. If you do have a Hadoop deployment, it can interact with it very easily using its internal master or using YARN instead.

This seminar is an introduction level for Oracle DBAs and other database developers. It’s main goal is to get to know this amazing technology. In the session we go over Spark Core, RDDs, how to develop for clusters and some behind the scenes for better understanding. Some of the presentation does require programming backgrounds (Java, Scala, Python) but I tried to cut it to the minimum.

To my surprise, this seminar had around 35 percipient (which was way more than I expected), and got 4.91/5 in the feedback. I presented a similar session in another conference back in March but this is an updated version of that slide deck.

I’ve been meaning to write about Apache Spark for quite some time now – I’ve been working with a few of my customers and I find this framework powerful, practical, and useful for a lot of big data usages. For those of you who don’t know about Apache Spark, here is a short introduction.

Apache Spark is a framework for distributed calculation and handling of big data. Like Hadoop, it uses a clustered environment in order to partition and distribute the data to multiple nodes, dividing the work between them. Unlike Hadoop, Spark is based on the concepts of in-memory calculation. Its main advantages are the ability to pipeline operations (thus breaking the initial concept of a single map-single reduce in the MapReduce framework), making the code much easier to write and run, and doing it in an in-memory architecture so things run much faster.

Hadoop and Spark can co-exist, and by using YARN – we get many benefits from that kind of environment setup.

Of course, Spark is not bulletproof and you do need to know how to work with it to achieve the best performance. As a distributed application framework, Spark is awesome – and I suggest getting to know with it as soon as possible.

I will probably make a longer post introducing it in the near future (once I’m over with all of my prior commitments).
In the meantime, here is a short explanation about how to connect from Spark SQL to Oracle Database.

Update: here is the 200 long slides presentation I made for Oracle Week 2016: it should cover most of the information new comers need to know about spark.Read more →

Cookie and Privacy Settings

How we use cookies

We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

Essential Website Cookies

These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

Because these cookies are strictly necessary to deliver the website, you cannot refuse them without impacting how our site functions. You can block or delete them by changing your browser settings and force blocking all cookies on this website.

Google Analytics Cookies

These cookies collect information that is used either in aggregate form to help us understand how our website is being used or how effective our marketing campaigns are, or to help us customize our website and application for you in order to enhance your experience.

If you do not want that we track your visist to our site you can disable tracking in your browser here:

Click to enable/disable google analytics tracking.

Other external services

We also use different external services like Google Webfonts, Google Maps and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

Google Webfont Settings:

Click to enable/disable google webfonts.

Google Map Settings:

Click to enable/disable google maps.

Vimeo and Youtube video embeds:

Click to enable/disable video embeds.

Privacy Policy

You can read about our cookies and privacy settings in detail on our Privacy Policy Page.