Set up Fabric for Deep Learning to work in a private cloud environment, where your data is protected in your own data center

As many companies collect massive amounts of data, they want to use artificial intelligence (AI) and the collected data to improve user experiences for their products. Providing an easy way to use AI environments can lower the entry barrier and enable both developers and data scientists to focus on what they do best: analyzing data and defining and training cutting-edge neural network models (with automation) over these large data sets.

Fabric for Deep Learning (FfDL) is an open source collaboration platform for running deep learning workloads in private or public Kubernetes-based clouds. Leveraging the power of Kubernetes, FfDL provides a scalable, resilient, and fault-tolerant deep-learning framework by combining the right software, drivers, compute, memory, network, and storage resources.

IBM Cloud Private is an integrated environment for managing containers that includes the container orchestrator Kubernetes, a private image registry, a management console, and monitoring frameworks that are all running within your data center. Various types of solutions can benefit by deploying IBM Cloud Private on-premises for data privacy, data protection, and full control over the environment.

Together, IBM Cloud Private and FfDL can provide a solution with the flexibility, ease of use, the economics of a cloud service, and the power of deep learning.

Then deploy the daemon set from the master node will install the driver on each worker nodes:

# Launch the daemonset
kubectl create -f driver-installer.yaml

Then deploy the daemon set from the master node to install the driver on each worker node:

# Verify the driver is installed
kubectl describe ds nvidia-driver-installer -n kube-system
# ssh to your worker node(s) and run the following command on each node
/home/kubernetes/bin/nvidia/bin/nvidia-smi

If you don’t have a storage class or NFS storage available on your environment, you can setup an NFS server to export a shared directory and mount it on all worker nodes so pods running on different nodes can all access to it. Run the following commands on the master node:

Step 6. Verify the installation by running a Jupyter Notebook

To verify the proper operation of FfDL, you set up a Jupyter Notebook to run the code on FfDL.

Prepare a Dockerfile for Jupyter Notebook.

Create a Dockerfile on the master node with the following content. It installs PyTorch and downloads a Jupyter Notebook that uses the torchtext package. You can also replace it with your own Jupyter Notebook file.

Use the following command to get the port number for the Jupyter Notebook service: kubectl get svc jupyter-notebook -o jsonpath='{.spec.ports[0].nodePort}'

Now, open http://your-master-node-ip-address:your-jupyter-notebook-service-port in a browser and log in with time4fun as the password. Then, you can double click word_language_model_and_torchtext.ipynb to open the notebook

If there is no error, your FfDL environment is ready.

Summary

Now you know how to get FfDL running on an IBM Cloud Private cluster. Now you can try working in your own environment to explore the power and flexibility of running deep learning workloads in your private cloud.