By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to use for this business request. You also agree to subscribe to stay connected to the latest Intel technologies and industry trends by email and telephone. You may unsubscribe at any time. Intel's web sites and communications are subject to our Privacy Notice and Terms of Use.

GDPR_consent

Yes, I would like to subscribe to stay connected to the latest Intel technologies and industry trends by email and telephone. You can unsubscribe at any time.

By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to use for this business request. Intel's web sites and communications are subject to our Privacy Notice and Terms of Use.

FAQ

General

Review these high-level questions and answers to get started with the Intel AI DevCloud and AI.

How can I improve deep learning performance when my framework is running in the background?

In some situations, deep learning code with default settings does not take advantage of the full compute capability of the underlying machine on which it runs, especially when the code runs on Intel® Xeon® Scalable processors. Intel created optimization techniques that enable optimal CPU performance for popular frameworks like Caffe* and TensorFlow*. For more information, see Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs.

Who can request access to the Intel AI DevCloud?

Developers, data scientists, professors, students, start-ups, and others who are members of Intel® AI Academy are eligible to request access.

How do I become a member of the Intel® AI Developer Program?

You can join by requesting access or become a member by registering here.

What happens once I have received access?

Once you gain access, you will log on to a Linux*-based head node of a batch farm. There, you can stage your code and data, compile it, and then submit jobs to a queue. Once the queued job completes, your results will be in your home ($HOME) directory.

Jobs are scheduled on Intel® Xeon® Scalable processors.

Each processor has 24 cores with two-way hyperthreading.

Each processor has access to 96 GB of on-platform RAM (DDR4).

Only one job will run on any processor at a time.

You will get 200 GB of file storage quota.

Your home directory is not visible to other users.

Note: Once your access period expires, your home directory on the cluster will be deleted.

I don’t live in the United States. Can I still get access to the Intel AI DevCloud?

The Intel AI DevCloud is available to all members of the Intel AI Developer Program, and is accessible from any country.

How much do I have to pay to use the Intel AI DevCloud?

Nothing. It is free to the members of Intel® AI Developer Program. Joining the Intel AI Developer Program is also free.

Is CUDA* installed on the Intel AI DevCloud?

No. The Intel AI DevCloud consists of high-performing Intel® Xeon® Gold 6128 processors but does not include CUDA*.

Why do I get permission denied when I try to run a pip install <Package_Name>?

To install a package using the default Python* format, use the following with pip:

--user parameter

Example: pip install numpy --user

The other option is to create a new Conda* environment, activate it, and then perform the pip* or Conda installations in that environment. Multiple versions of the same package in the default Python format can create problems. To avoid these problems, create separate Conda environments for each activity.

Executing Jobs

To help developers execute jobs on the Intel AI DevCloud, type the following answers into the command line in Linux*.

How do I check whether I’m on login node or compute node?

Check the prompt of your terminal.

If it shows [uxxxx@c009 ~]$, you are on the login node.

If it shows something like [uxxxx@c009-n0xx ~]$, you are on the compute node.

How do I check the logs of my running job?

You can check the logs of your running job as follows:

For output logs, use qpeek:qpeek -o <JOB_ID>

For error logs, use qpeek -e <JOB_ID>

How do I check the logs of my completed job?

Check your completed job logs as follows:

If you gave a name when submitting a job using qsub, the log files will be <JOB_NAME>.o<JOB_ID> and <JOB_NAME>.e<JOB_ID>

If you did not give a name and you submitted a job as qsub <JOB_SCRIPT>, the log files are <JOB_ SCRIPT>.o<JOB_ID> and <JOB_ SCRIPT>.e<JOB_ID>

If you did not give a name and you submitted a job as <COMMAND> | qsub, the log files are STDIN.o<JOB_ID> and STDIN.e<JOB_ID>

How do I set total wall clock time to the maximum on the Intel AI DevCloud?

If this doesn't return any results, it is possible that the job is complete and qpeek did not get a chance to peek.

Alternatively run qsub with the "-k oe" option:

qsub -k oe my_script

Standard input/output and error will be dumped into your home directory. You can check at any time while jobs are running.

How do I increase the wall clock time?

In the command line, type one of the following:

#PBS -l walltime=<10:30>,mem=320kb

echo sleep 1000 | qsub -l walltime=<00:30:00>

How do I get the full information about a job?

In the command line, type:

qstat -f <JOBID>

How do I delete a job?

In the command line, type:

qdel <JOBID>

How do I find the architecture and features of the compute nodes available to me?

Run the following command in the login node:

pbsnodes

How do I log in to a compute node?

In the command line, type:

qsub –I

How do I get details of the nodes?

In the command line, type:

pbsnodes -a

Why do I get a memory error while running commands?

In most cases, memory error is caused by trying to run compute-intensive tasks on the login node. In such cases, log in to the compute node using qsub –I and execute your commands there.

My job takes more than 24 hours to run, but the Intel® AI DevCloud has a maximum wall clock time of 24 hours. How do I run my job in this case?

The maximum wall clock time is set to 24 hours to ensure fair utilization of cluster resources by all. However, some workarounds are provided:

Save the model at regular intervals or at least once before the wall time expires. At the end of 24 hours, submit a new job that will load the last saved model, and then start training from there.

By saving the model at regular intervals, your additional benefits include getting copies of your trained model early and evaluating the model on test data. This also helps to understand how the model is performing and to make any changes early enough instead of waiting for a long job to complete.

Login Node versus Compute Node

What is the difference between login node and compute node?

Login node uses a lightweight general purpose processor. Compute node uses an Intel® Xeon® Gold 6128 processor that is capable of handling heavy workloads. All of the tasks that need extensive memory and compute resources have to be run on compute node not on login node.

How are the login and compute nodes placed on the Intel® AI DevCloud?

The following diagram illustrates the overall architecture of Intel® AI DevCloud:

Why can’t I run compute-intensive tasks on the login node?

Login node is lightweight and not capable of handling heavy workloads. It is primarily intended to save your data; hence, a limit is enforced on both memory and compute on login nodes. This ensures that a memory error is thrown if you try to run any heavy tasks on the login node.

How do I check to see whether I’m on login node or compute node?

You are on the compute node if the prompt shows n0xx, as shown in the following image:

If the prompt does not display n0xx, you are on the login node, as illustrated in the following image:

How do I run compute-intensive jobs?

You can run jobs on the compute node with any of the following options:

Submit a job using qsub <JOB_SCRIPT>from the login node. The job waits in the queue until the scheduler picks it up and finally runs it on the compute node.

Run the job in interactive mode using qsub –I. This creates a job with default settings and provides you with a terminal from the compute node. You can directly run the commands there.

Use JupyterHub*. To do this, navigate to Colfax Research. Sign in, and then and start the server. Create a new notebook and run the code from there.

Use qsub from JupyterHub. To do this, navigate to Colfax Research. Log in, and then and start the server. Create a new notebook. Submit the job using the qsub command. Details are available in the Welcome.ipynb file in the home folder of Intel AI DevCloud.

Directly run from JupyterHub terminal. To do this, navigate to Colfax Research. Log in, and then and start the server. Start a new terminal, which is a compute node where you can directly run the commands of a compute-intensive job in the terminal. (For more details on starting a new terminal, see JupyterHub Terminal Versus SSH Terminal).

Using qsub from the JupyterHub terminal. To do this, submit the job using “qsub <JOB_SCRIPT>” from the JupyterHub terminal. This submits the job, and the job waits in the queue until the scheduler picks it up and executes it on compute node.

What is the difference between batch qsub mode and interactive qsub mode for job submission?

In batch qsub mode, jobs are created and submitted using the command “qsub <JOB_SCRIPT>” from the login node. <JOB_SCRIPT> contains the job commands.

In interactive qsub mode, a job is created using the command “qsub –I”.

After executing this command you get a new terminal from the compute node allocated for your job. You can then run job commands directly in the terminal.

How do I enter interactive qsub mode?

When using SSH (secure shell) to access the Intel AI DevCloud with a PuTTY*/Linux* terminal, enter the login node first. To run a job in interactive qsub mode, type qsub –I. This creates a new job and provides a terminal from the compute node allocated for this job. See the following image:

When should I use batch qsub mode and when should I use interactive qsub mode?

Use batch qsub mode when you have a tested running code and need to run it and store the results.

Use interactive qsub mode when you are still in the process of creating a running code and expect errors.

You can fix them simultaneously, just as if you were doing it on a local machine.

How do I check job status and logs in qsub mode versus interactive qsub mode?

In qsub mode, you can submit the job, work on other things, and then come back later to check for updates. Check your job status with the qstat command, and check logs using the log file/qpeek command.

In interactive qsub mode, once the terminal has expired, the logs written while the job or command was running are deleted.

Adjust GPU-Specific Configurations for a CPU

Can I run TensorFlow* with graphics processing unit (GPU) support on the Intel® AI DevCloud?

The Intel® AI DevCloud is a cluster of Intel® Xeon® Scalable processor high-performance CPUs. We need to make sure that code written for a GPU-specific environment is converted to make the execution possible with CPUs.

How do I convert GPU-specific configurations for a CPU?

You can do this in one of two ways:

While building the framework, change the configuration settings from GPU to CPU

Comment or change the code snippets that are written specifically for GPUs.

How do I change the code snippet from GPU to CPU in the deep learning frameworks?

For C++:

Remove the following lines of code in the file detection_loss_layer.cpp (folder: src/caffe/layers):

#ifdef CPU_ONLY STUB_GPU(DetectionLossLayer); #endif

For PyTorch*:

Remove the occurrences of .cuda to switch from a GPU to a CPU.

For example:

if torch.cuda.available():

import torch.cuda

else:

import torch

To comment or change the code snippets that are written specifically for GPUs, see some of the examples on GitHub for Caffe*-Yolo*.

What are the configuration changes that I need to set while building Caffe* in CPU mode?

JupyterHub* Terminal versus SSH Terminal

Enter your user ID and password in the Login page. Your user ID is available in the Intel® AI DevCloud welcome email. The password is the unique user ID (UUID) also included in the welcome email.

The image below shows the home directory contents that are displayed after signing in.

To open a new terminal, in the right corner, select New, and then select Terminal. The Jupyter Notebook* terminal appears on the compute node.

How do I log in to the SSH terminal?

Follow the instructions in the Intel AI DevCloud welcome email link.

(https://access.colfaxresearch.com/?uuid=xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx). If the sign in is successful, an SSH terminal appears, as shown below:

What is the difference between a JupyterHub terminal and an SSH terminal?

The differences are:

1. Execution time

A JupyterHub* terminal has a maximum session time of four hours. After that time, the session is terminated without leaving a log.

An SSH terminal gives a default wall clock time of six hours for jobs submitted using qsub. These default settings can be altered to provide a maximum of 24 hours wall clock time.

2. Job logs

A JupyterHub terminal does not currently provide a qpeek tool, which is used to get near real-time job logs. Also, job logs may not be properly written due to session expiry.

An SSH terminal does provide a qpeek tool. You can check near real-time job logs of a running job with qpeek –o <JOB_ID> (output logs) and qpeek –e <JOB_ID> (error logs).

3. Logged-in terminal

The JupyterHub terminal signs in to the compute node directly .

The SSH terminal signs in to the login node directly. Enter qsub –I to go to the compute node.

I submitted a job through a JupyterHub terminal. Why can’t I see the job logs?

A JupyterHub session expires at the end of four hours. Remaining time is displayed in the top right of the session page. Jobs longer than four hours are stopped at the end of the session time and logs are not properly written. For jobs longer than four hours, use qsub mode with a PuTTY or Linux SSH terminal. In these cases, we suggest you use qsub mode with a PuTTY* or Linux* SSH terminal.