notebook

Limit Notebook Resource Consumption by Culling Kernels

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to iteratively build analytic applications, Jupyter Notebooks have proven to be the tool of choice for data scientists.

Inherent in Jupyter Notebooks is the ability to dynamically adjust and tune analytic functions, enabling the data scientist quick turnaround towards accomplishing their task. Eventually, the notebook content evolves to a series of calls against the back-end cluster, each of which can take significant amounts of time while consuming valuable resources across the cluster. These calls are made to the notebook’s kernel – a library or application that implements the Jupyter Message Protocol - that is essentially the programmatic interface exposed in a Notebook’s cell. This behavior, coupled with the fact that data scientists are human, often failing to explicitly shutdown the kernel, can lead to an infinite consumption of resources that, heretofore, could not be regained without administrative intervention.

To programmatically address this human behavior, Jupyter Notebook has been extended with the ability to cull idle kernels (where a notebook kernel is considered idle if there is no activity performed by the notebook). As such, and indicative of the long-running nature of data analytic method calls, the idle timeout period in which it is safe to assume culling can take place is typically on the order of 12 or 24 hours.

Culling Configuration

Off by default, idle kernel culling is enabled by setting --MappingKernelManager.cull_idle_timeout to a positive value representing the number of seconds a kernel must remain idle to be culled (default: 0, recommended: 43200, 12 hours). Positive values less than 300 (5 minutes) will be adjusted to 300.

The interval with which kernels are checked for their idle timeouts is also configurable. Intervals are adjusted by setting --MappingKernelManager.cull_interval to a positive value. If not set or non-positive, the default value will be used (default: 300 seconds).

As evidenced by the nature of data analytic models, what if a given cell’s execution happens to take longer than the cull idle timeout period? That is, what if the kernel is in a busy state for the duration of the culling period? This behavior is addressed by the --MappingKernelManager.cull_busy parameter (default: False). By default, long-running kernel cells will not be culled.

Another use-case for Jupyter Notebooks is to configure the notebook as a sort of “kiosk” where users wish to leave their notebooks in a connected state. In this configuration, a notebook has an associated browser tab open to it and, although the kernel may be idle, the kernel shouldn’t necessarily be culled, while other idle kernels not currently connected to a browser will be culled. This behavior can be configured by setting the following parameter to False --MappingKernelManager.cull_connected (default: False).

Availability

This functionality has been added to the next (post 5.0) release of Jupyter Notebook. As a result, you’ll need to clone the repo, build Notebook from the master branch, and install.

Co-existence with Jupyter Kernel Gateway

To leverage this functionality from Jupyter Kernel Gateway, it's best to install Jupyter Kernel Gateway, then install your build of Jupyter Notebook – otherwise Jupyter Kernel Gateway will include the previous version of Notebook. Use of an older Notebook is evident when starting Jupyter Kernel Gateway and error messages indicating that new MappingKernelManager properties are not recognized.

Share on

Date

Tags

Newsletter

You Might Also Enjoy

There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to... Read More

Alluxio is fast virtual storage for Big Data. Formerly known as Tachyon, it’s an open-source memory-centric virtual distributed storage system (yes, all that!), offering data access at memory speed and persistence to a reliable storage. This technology accelerates analytic workloads in certain scenarios, but doesn’t offer any performance benefits in other scenarios. The purpose of this blog is to... Read More

Newly appointed (anointed?) Apache Spark committer Holden Karau isn't resting on her laurels. See her talk this Thursday at Spark Summit East where she'll be presenting "a monster identification guide... Read More

Apache Spark CommitterpysparksparkML

Spark Technology Center

The Apache Software Foundation has no affiliation with and does not endorse or review the materials provided on this website, which is managed by IBM. Apache®, Apache Spark™, and Spark™ are trademarks of the Apache Software Foundation in the United States and/or other countries.