Understand how to use R and Hadoop to manage big data

This course will give you access to a virtual environment with installations of Hadoop, R and Rstudio to get hands-on experience with big data management. Several unique examples from statistical learning and related R code for map-reduce operations will be available for testing and learning.

Those with basic knowledge in statistical learning and R will better understand the methods behind and how to run them in parallel using map-reduce functions and Hadoop data storage. At the end of the course you will get access to RHadoop on a supercomputer at University of Ljubljana.

0:25Skip to 0 minutes and 25 secondsNearly every historical period may be said to have had sources of data that were considered big for that time. Books, documents, drawings, maps and paintings are examples of such data. Yet it is only today that we have to deal with really big data. Luckily, more and more data is digital, but expressed in different formats. Large-scale scientific instruments, social network platforms, cloud solutions, digital cultural heritage are only a few examples of sources of huge amount of text, photo, video and audio materials which are considered big data.

0:55Skip to 0 minutes and 55 secondsBut questions related to data have not changed much: how to store and maintain it, how to understand and how to learn from the data for an improved response in the future. These issues necessarily involve the use of high performance computers. Distributed storage and parallel computing need be considered to avoid loss of data and to make computations efficient.

Knowledge about statistical learning to instances of data provided by edcators

How to do big data management with RHadoop on real supercomputer provided by Universiy of Ljubljana

Who is the course&nbspfor?

This course is for people with basic experiences with linux, bash and R, who can download and run virtual machine. You might be interested in data science, computational statistics and machine learning and have basic experiences with them.

It will be also useful for advanced undergraduate students and first year PhD students in data analysis, statistics or bioinformatics, who wish to understand how to manage big data with Hadoop using R programming language.

What software or tools do you&nbspneed?

All software needed to actively participate the course is provided within the virtual machine that you need to download and run on your local machine. No extra software is needed.

You will need a modest local machine with 15GB free disk space and 2GB free RAM. You can get access to big data RStudio on a real HPC cluster after completing two weeks of exercises.