ADVANCED

In this tutorial, we will show you how to create interactive Jupyter notebooks (external link).

One way to think of Jupyter notebooks is to consider them as digital lab notebooks. By interweaving rich markdown text, code, and graphics in a single notebook file, users can create a record of their data analysis workflow. The notebook file can then be shared with collaborators, allowing them to follow along with the original analysis or add to the work. A thorough description of the features and benefits of Jupyter notebooks can be found in this excellent article (external link).

Creating Your First Jupyter Notebook on DNAnexus

Initial setup

Creating a Jupyter notebook on DNAnexus is easy. Currently, as notebooks must be started from the command line, the first step is to make sure you have installed the DNAnexus SDK, dx-toolkit.

Once the DNAnexus toolkit is installed, you will need to set up SSH keys for your DNAnexus account. To generate SSH keys for your DNAnexus account, run:

$ dx ssh_config

Note: You only have to run dx ssh_config once, so if you have already established SSH keys for your DNAnexus account (in order to SSH into jobs or run a cloud workstation, for instance), you don't need to do this step.

Launching a Jupyter notebook

After completing the initial setup, creating a notebook is straightforward. First, select the project in which you would like to launch your notebook.

$ dx select project-xxxx

Next, start the notebook by running:

$ dx notebook jupyter

A new worker instance will launch to host your Jupyter notebook session. Once the worker is running, your default web browser will open and display the Jupyter notebook.

Command-line options

There are a few command-line options available to you to configure your Jupyter notebook session. You can see a full list of the available options by running:

We'll provide a quick description of the options below:
1. Notebook type: The only required option is the specification of the type of notebook to be launched. There are two notebook types currently available.

jupyter_notebook (can also be specified as just jupyter): This is the traditional Jupyter notebook interface.

jupyter_lab: Jupyter lab (external link) is a preview Jupyter environment that provides an interface familiar to fans of the R-Studio environment.

notebook_files: Any files specified will be automatically downloaded to the worker running your notebook environment. Don't worry if you forget to specify a file when you launch your notebook, though - you can always download additional files from within the Jupyter environment.

spark: This is primarily an experimental flag. If you have a need to run a notebook session on a Spark cluster, please feel free to contact us.

port: By default, the session is set up to communicate to the notebook server over port 2001. This should be fine for the vast majority of users, but if you find that this port is in use or blocked in your network, you can specify an alternative port to use with this option.

snapshot: During a notebook session, you will have the ability to create snapshots of your current worker, capturing the files and modules placed on the worker since the Jupyter notebook began. By providing a snapshot at the launch of your Jupyter notebook, your notebook will begin with all of the same files and modules present when the original snapshot was taken.

timeout: By default, the notebook is configured to automatically shut down after 1 hour. If you know that you'll want to keep the session open for a longer or shorter time period, you can specify that here. This option accepts the following suffixes: (s)econd, (m)inute, (h)our, (d)ay, (w)eek, (M)onth, and (y)ear. You can always modify timeout while in a notebook. See below for more details.

version: When you run dx notebook, you will run the most recent version of the notebook app. If you should ever need to run an older version of the app, you can specify the version number with this option.

instance_type: By default, your Jupyter notebook session will run on a mem1_ssd1_x4 instance type, but you can use this option to select any other instance type.

Controlling Your Dx Notebook Session

Each notebook session is configured to terminate after a fixed period of time. By default, the timeout is set to be one hour, although that value can be modified through the --timeout command line option. It is not unusual however to want to modify the timeout value while working inside of a notebook. Users may decide to extend their notebook session while working on a particular analysis or they may opt to terminate the session early. Either way, users can modify the timeout value by running dx-set-timeout. For instance, if a user decides to extend their timeout by a day, she can simply run:

$ dx-set-timeout 1d

The current time until the session terminates can be determined by running:

$ dx-get-timeout

When a notebook session is terminated, the worker that is hosting the server is shut down and all information about the server configuration and data under analysis is lost. However, it is easy to save the current server state so that you can start another notebook session at some later point in time with the same server configuration and data in place. You can save this state by using snapshots. While in a notebook session, run:

$ dx-snapshot

The current state of the server is captured and uploaded to the project as a special "snapshot" file. When a notebook server is initialized, the app makes a note of all files installed on the system. Then, when the user creates a snapshot, the app compares the current set of files to the initial set of files and creates an archive of any files that are new or have been modified since the notebook session was launched. The result is that any new packages that have been installed, any data downloaded to the server, and any notebook files created will be captured in the snapshot file. When a new notebook session is created and a snapshot is provided as input, the snapshot file is extracted on to the new server instance so that all packages and data files are, once again, in place as you left them.

By default, the snapshot file will be named based on the current time and date. Users can also specify a name and provide additional options to the dx-snapshot command. To find out more, while in a worker,run:

$ dx-snapshot -h

Available Kernels and Packages

Jupyter notebooks can be used with many different languages through the use of different Jupyter "kernels". The current Jupyter notebook app on the DNAnexus platform has installed 4 kernels:
1. Python 2
2. Python 3
3. bash
4. R

Users select which kernel to use for a given notebook once inside of the notebook session. In fact, in a given session, users can create many separate notebooks and can choose different kernels for each notebook. The image below shows an example of selecting which kernel to use for a notebook in the a) standard Jupyter notebook environment, or b) the new Jupyter lab environment.

Each Jupyter notebook server comes pre-installed with a number of popular data packages:

numpy

scipy

pandas

matplotlib

seaborn

Additional packages can easily be installed during a notebook session. For instance, users can start a bash notebook and install new packages via apt-get, pip, cran, git, etc. By creating a snapshot, users can then start subsequent sessions with the new packages pre-installed by providing the snapshot as input.