You are here

Kebnekaise

Kebnekaise

Kebnekaise is the latest supercomputer at HPC2N. It is named after the massif of the same name, which has some of Sweden's highest mountain peaks (Sydtoppen and Nordtoppen). Just as the massif, the supercomputer Kebnekaise is a system with many faces.

Kebnekaise was delivered by Lenovo and installed during the summer 2016, except for the 36 nodes with the new generation of Intel Xeon Phi, also known as Intel Knights Landing (KNL), which are currently being installed and tested (March 2017). Kebnekaise was opened up for general availability on November 7, 2016.

There is local scratch space on each node (about 170GB, SSD), which is shared between the jobs currently running. Connected to Kebnekaise is also our parallel file system Ransarn ("PFS"), which provide quick acccess to files regardless of which node they run on. For more information about the different filesystems that are available on our systems, read the Filesystems and Storage page.

All nodes are running Ubuntu Xenial (16.04 LTS). Compared to Abisko we have also changed the build system for installing software to EasyBuild and a new module system called Lmod. We are still improving the portfolio of installed software. The software page currently lists only a few of the installed software packages. Please log in to Kebnekaise for a list of all available software packages.

With all the different node types of Kebnekaise, the scheduling of jobs is somewhat more complicated than on our previous systems. Different node types are "charged" differently. See the allocation policy on Kebnekaise page for details. Kebnekaise is using SLURM for job management and scheduling.

Kebnekaise in numbers

544 nodes

13 racks

17552 cores (of which 2448 cores are KNL-cores)

17104 available for users (the rest are for managing the cluster)

399360 CUDA cores (80 * 4992 cores/K80)

More than 125 TB memory (20*3TB + (432 + 36) * 128GB + 36 * 192GB)

66 switches (Infiniband, Access and Managment networks)

728 TFlops/s Peak performance

629 TFlops/s HPL (all parts)

HPL: 86% of Peak performance

HPL performance of Kebnekaise

Compute Nodes

374 TFlops/s

Large Memory Nodes

34 TFlops/s

2xGPU Nodes

129 TFlops/s

4xGPU Nodes

30 TFlops/s

KNL Nodes

62 TFlops/s

Total (all parts)

629 TFlops/s

Do note that running all 28 cores with lots of AVX (on the normal CPUs) will limit the clock to at absolute maximum 2.9 GHz per core, and probably no more than 2.5.

The same is the case for the KNLs - running all 68 KNL cores with AVX will limit the clock speed to not use turbo.

The AVX clock frequency != the rest of the CPUs clock frequency and has a lower starting point and lower max boost.

Compute nodes

The memory is shared in the whole node, but physically 64 GB is placed on each NUMA island. The memory controller on each NUMA node has 4 channels.

Intel Xeon E5-2690v4 (Broadwell)

Instruction set

AVX2 & FMA3

SP FLOPs/cycle

32

DP FLOPs/cycle

16

Base Frequency

2.5 GHz

Turbo Mode Frequency (single core)

3.8 GHz

Turbo Mode Frequency (all cores)

2.9 GHz

Large memory nodes

There are 18 cores on each of the 4 NUMA islands. The cores on each NUMA island share 768 GB memory, but have access to the full 3072 GB on the node. The memory controller on each NUMA island has 4 channels.

Each core has:

64 kB L1 cache

32 kB L1 data cache

32 kB L1 instruction cache

256 kB L2 cache

45 MB L3 cache shared between the cores on each NUMA island

GPU nodes

Each CPU core is identical to the cores in the compute nodes and in addition to that:

32 GPU nodes have 2 K80 GPUs

One K80 is located on each NUMA island

4 GPU nodes have 4 K80 GPUs

Two K80s are located on each NUMA island.

One GK210 compute engine with 15 SMXs (13 enabled). SMX is what NVIDIA calls their Next Generation Streaming Multiprocessor. (Picture copyright of NVIDIA)

One SMX. SMX is what NVIDIA calls their Next Generation Streaming Multiprocessor. (Picture copyright of NVIDIA)

Each K80 GPU has two GK210 chips (compute engines), each of which are made up of 15 SMX (Next Generation Streaming Multiprocessor) units and six 64‐bit memory controllers. Due to configuration and problems in fitting the two GK210s on a single K80, only 13 SMX units are enabled on each GK210 card. Since there are 192 CUDA cores on each SMX, this adds up to 13 x 192 x 2 = 4992 cores on each K80.

A KNL 7250 chip is made up of 34 tiles, interconnected by 2D Mesh. I/O: Max 36 lanes PCIe Gen3. 4 lanes of DMI for chipset.

An Intel Xeon Phi 7250 (Knight's Landing) chip. The version we have at HPC2N has 34 "Tiles" - which is where the cores are located.

One of the "Tiles" that is making up the main part of a KNL chip. A "Tile" contains two cores, 2 VPUs, 1 MB shared L2 cache, and a CHA unit used to keep the L2's coherent.

Each tile has:

2 cores (4 threads per core)

2 VPU/core

each VPU has a AVX512 unit

32 SP / 16 DP per VPU

X87, SSE, AVX1, AVX2, and EMU

1 MB shared L2 cache

16-way

1 Line Read / cycle

1/2 Line Write / cycle

CHA (Cashing/Home Agent) to keep L2s coherent

Intel Xeon 7250 (Knight's Landing)

Cores / Threads

68 / 272

Frequency

1400 MHz

Turbo

1600 MHz

L2 cache

34 MB (1 MB / tile)

Memory

16 GB MCDRAM
192 GB DDR4-2400 RAM

Memory Bandwidth, MCDRAM

400+ GB/s

Memory Bandwidth, DDR4 RAM

115.2 GB/s

DDR Memory Channels

6

Peak Double Precision

3046 GF

Max # of PCI Express Lanes

36

Instruction set

64-bit

L1i cache

32 KB / core

L1d cache

32 KB / core

The KNL nodes have AVX-512 extensions which provides:

512-bit FP/Integer Vectors

32 registers, & 8 mask registers

Hardware support for gather and scatter

There is a high bandwidth connection between cores and memory. The DDR4 RAM is used by default as well as for bulk memory. The high bandwidth MCDRAM memory is explicitly allocated when needed for critical data. This can be done in one of two ways: 1) with Fast Malloc functions (see https://github.com/memkind or 2) FASTMEM Compiler annotation for Intel Fortran. There are some example on Intel's page here.