Introduction

Message Passing Interface (MPI) is a standardized message-passing library interface designed for distributed memory programming. MPI is widely used in the high-performance computing (HPC) domain because it is well-suited for distributed memory architectures.

Python* is a modern, powerful interpreter that supports modules and packages. Python supports extension C/C++. While HPC applications are usually written in C or Fortran for performance, Python can be used to quickly prototype a proof of concept and for rapid application development because of its simplicity and modularity support.

The MPI for Python* (mpi4py*) package provides Python bindings for the MPI standard. The mpi4py package translates MPI syntax and semantics and uses Python objects to communicate. Thus, programmers can implement MPI applications in Python quickly. Note that mpi4py is object-oriented, and not all functions in the MPI standard are available in mpi4py; however, almost all the commonly used functions are. More information on mpi4py can be found here. In mpi4py, COMM_WORLD is an instance of the base class of communicators.

mpi4py supports two types of communications:

Communication of generic Python objects: The methods of a communicator object are lower-case (send(), recv(), bcast(), scatter(), gather(), and so on). In this type of communication, the sent object is passed as a parameter to the communication call.

Communication of buffer-like objects: The methods of a communicator object are upper-case letters (Send(), Recv(), Bcast(), Scatter(), Gather(), and so on). Buffer arguments to these calls are specified using tuples. This type of communication is much faster than Python objects communication type.

The Intel Distribution for Python 2018 is available free for Python 2.7.x and 3.5.x on macOS*, Windows* 7 and later, and Linux* operating systems. The package can be installed as a standalone or with the Intel® Parallel Studio XE 2018.

Intel Distribution for Python supports both Python 2 and Python 3. There are two separate packages available in the Intel Distribution for Python: Python 2.7 and Python 3.5. In this example, the Intel Distribution for Python 2.7 on Linux (l_python2_pu_2018.1.023.tgz) is installed on an Intel® Xeon Phi™ processor 7250 @ 1.4 GHz and 68 cores with 4 hardware threads per core (a total of 272 hardware threads). To install, extract the package content, run the install script, and follow the installer prompts:

After the installation completes, activate the root Intel® Distribution for Python* with the conda* package:

$ source /opt/intel/intelpython2/bin/activate root

Parallel Computing: OpenMP* and SIMD

While multithreaded Python workloads can use Intel TBB optimized thread scheduling, another approach is to use OpenMP* to take advantage of Intel® multi-core technology. This section shows how to implement multithread applications using OpenMP and the C math library in Cython*.

Cython is an interpreted language that can be built into native language. Cython is similar to Python, but it supports C function calls and C-style declaration of variables and class attributes. Cython is used for wrapping external C libraries that speed up the execution of a Python program. Cython generates C extension modules, which are used by the main Python program using the import statement.

For example, to generate an extension module, one can write a Cython code (.pyx) file. The .pyx file is then compiled by Cython to generate a .c file, which contains the code of a Python extension code. The .c file is in turn compiled by a C compiler to generate a shared object library (.so file).

One way to build Cython code is to write a disutils setup.py file (disutils is used to distribute Python modules). In the following multithreads.pyx file, the function vector_log_multiplication computes log(a)*log(b) for each entry in the A and B arrays and stores the result in the C array. Note that a parallel loop (prange) is used to allow multiple threads to be executed in parallel. The log function is imported from the C math library. The function getnumthreads() returns the number of threads:

The setup.py file invokes the setuptools build process that generates the extension modules. By default, this setup.py uses GNU Compiler Collection* to compile the C code of the Python extension. In order to take advantage of Intel AVX-512 and OpenMP multithreading in the Intel Xeon Phi processor, one can specify the options -xMIC-avx512 and -qopenmp in the compile and link flags, and use the Intel® C++ Compiler. For more information on how to create the setup.py file, refer to the Writing the Setup Script section of the Python documentation.

As mentioned above, this process first generates the extension code multithreads.c. The Intel compiler compiles this extension code to generate the dynamic shared object library multithreads.so.

How To Write a Python Application with Hybrid MPI/OpenMP

In this section, we write an MPI application in Python. This program imports the mpi4py and multithreads modules. The MPI application uses a communicator object, MPI.COMM_WORLD, to identify a set of processes that can communicate within the set. The MPI functions MPI.COMM_WORLD.Get_size(), MPI.COMM_WORLD.Get_rank(), MPI.COMM_WORLD.send(), and MPI.COMM_WORLD.recv() are methods of this communicator object. Note that in mpi4py there is no need to call MPI_Init() and MPI_Finalize() as in the MPI standard because these functions are called when the module is imported and when the Python process ends, respectively.

The sample Python application first initializes two large input arrays consisting of random numbers between 1 and 2. Each MPI rank uses OpenMP threads to do the computation in parallel; each OpenMP thread in turn computes the product of two natural logarithms c = log(a)*log(b) where a and b are random numbers between 1 and 2 (1 ≤ a,b ≤ 2). To do that, each MPI rank calls the vector_log_multiplication function defined in the multithreads.pyx file. Execution time of this function is short, about 1.5 seconds. For illustration purposes, we use the timeit utility to invoke the function 10 times, just to have enough time to demonstrate the number of OpenMP threads involved.

Below is the application source code mpi_sample.py. Note that if the running time of the program is too short, you may increase the value of FACTOR in the source code file to make the execution time longer. In this example, the value of FACTOR is changed from 512 to 1024:

While the Python program is running, the top command in a new terminal displays two MPI ranks (shown as two Python processes). When the main module enters the loop (shown with the message "Start timing …"), the top command reports almost 136 threads running (about 13,600 percent CPU). This is because, by default, all 272 hardware threads on this system are utilized by two MPI ranks, thus each MPI rank has 272/2 = 136 threads.

To get detailed information about MPI at run time, we can set the I_MPI_DEBUG environment variable to a value ranging from 0 to 1000. The following command runs four MPI ranks and sets the I_MPI_DEBUG to the value 4. Each MPI rank has 272/4 = 68 OpenMP threads as indicated by the top command:

We can specify the number of OpenMP threads used by each rank in the parallel region by setting the OMP_NUM_THREADS environment variable. The following command starts four MPI ranks; 34 threads for each MPI rank (or 2 threads/core):

Finally, we can force the program to allocate memory in MCDRAM (high-bandwidth memory on the Intel Xeon Phi processor). For example, before the execution of the program, the "numactl –hardware" command shows the system has two NUMA nodes: node 0 consists of CPUs and 96 GB DDR4 memory, node 1 is the on-board 16 GB MCDRAM memory:

Readers can also try the above code on an Intel® Xeon® processor system with the appropriate setting; for example, on an Intel® Xeon® Scalable processor, using –xCORE-AVX512 instead of –xMIC-AVX512, and set the appropriate number of available threads. Also note that the Intel Xeon Scalable processor doesn’t have high-bandwidth memory.

Conclusion

This article introduced the MPI for Python package and demonstrated how to use it via the Intel Distribution for Python. Furthermore, it showed how to use OpenMP and Intel AVX-512 instructions in order to fully take advantage of the Intel Xeon Phi processor architecture. A simple example was included to show how one can write a parallel Cython function with OpenMP, compile it with the Intel compiler with Intel AVX-512 enabled option, and integrate it with an MPI Python program to fully take advantage of the Intel Xeon Phi processor architecture.

About the Author

Loc Q Nguyen received an MBA from University of Dallas, a master’s degree in Electrical Engineering from McGill University, and a bachelor's degree in Electrical Engineering from École Polytechnique de Montréal. He is currently a software engineer at Intel Software and Services Group. His areas of interest include computer networking, parallel computing, and computer graphics.