What are some of the most popular data science tools, how do you use them, and what are their features? In this course, you'll learn about Jupyter Notebooks, RStudio IDE, Apache Zeppelin and Data Science Experience. You will learn about what each tool is used for, what programming languages they can execute, their features and limitations. With the tools hosted in the cloud on Cognitive Class Labs, you will be able to test each tool and follow instructions to run simple code in Python, R or Scala. To end the course, you will create a final project with a Jupyter Notebook on IBM Data Science Experience and demonstrate your proficiency preparing a notebook, writing Markdown, and sharing your work with your peers.
LIMITED TIME OFFER: Subscription is only $39 USD per month for access to graded materials and a certificate.

Enseigné par

Polong Lin

Data Scientist

Transcription

Welcome to Zeppelin for Scala. In this video, we'll run through the Zeppelin tutorial for Scala, which reads data from a comma separated,.csv file, and uses Spark to convert it into a Spark DataFrame. Then we query data using a SQL command and visualize it. To get started, just click the Zeppelin Notebook button on the main page. The Zeppelin welcome page opens. Then, click the Tutorial for Scala link. This launches the Notebook that we'll run through. The first part of the tutorial describes the Interpreter Binding settings, namely for Spark, %spark is the default, md, angular, and sh. These are the kernels or interpreters that you select to be available in your Notebook. As a user, you can simply click on Save. The first step in the tutorial downloads the bank.csv file. You'll need to import the packages that are needed for this tutorial. The second step converts the.csv file to RDD by running the script. The third step cleans the data using map and filter, which basically does three things. First, it creates an RDD of tuples from the original bank text. Second, it creates the schema or class represented to define the name and type of column. And third, it applies the schema to RDD. The fourth step creates the DataFrame using the toDF function. A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database, or a DataFrame in RPython, but with richer optimizations under the hood. The fifth step registers the DataFrame as a temporary table. The sixth step retrieves the data. Now, the bank table can be easily queried with SQL commands. And finally, the seventh step visualizes the data. Some basic charts are already included in Zeppelin. Please, feel free to create your own Notebook in the Notebook menu. Getting some practice is always helpful. This brings us to the end of this video. Thanks for watching.