The Knights landing, Intel Xeon Phi processor is a bootable host processor that delivers massive parallelism and vectorization to support the most demanding High Performance Computing (HPC) applications. The integrated and power-efficient architecture delivers significantly more compute per unit of energy consumed versus comparable platforms to give you an improved total cost of ownership. The integration of memory and fabric topples the memory wall and reduces cost to help you solve your biggest challenges faster. Read the MFDn Case Study.
2nd generation Intel Xeon Phi processors code-named Knights Landing (KNL) are specialized computing platforms capable of delivering better performance than general-purpose CPUs such as Intel Xeon products for some applications. Applications run best on Knights Landing if they have high degree of parallelism and well-behaved communication with memory. Read the BerkeleyGW Case Study.
In Knights Landing, each of its ≤ 72 cores has an L1 cache, pairs of cores are organized into tiles with a slice of the L2 cache symmetrically shared between the two cores, and the L2 caches are connected to each other with a mesh. All caches are kept coherent by the mesh with the MESIF (Modified Exclusive Shared Invalid Forward) protocol states of cache lines. In the mesh, each vertical and horizontal link is a bidirectional ring.
To maintain cache coherency, KNL has a distributed tag directory (DTD), organized as a set of per-tile tag directories (TDs), which identifies the state and the location on the chip of any cache line. For any memory address, the hardware can identify with a hash function the TD responsible for that address.
These improvements in cache organization in KNL come with increased complexity of the chip hardware. To manage this complexity and set the optimal mode of operation for any given computational application, the programmer has access to cache clustering modes.
When an application requests data from memory address, the processing tile (let’s call it tile A) will first query the local cache to see if the requested memory address is present there. If it is, the calculation will proceed with minimal latency for data access. Otherwise, tile A will query the DTD for the cache line (i.e., a 64-byte block of memory) containing that data. This means that a message will be sent from tile A to a TD on another tile (call it tile B). If according to the TD, this cache line is present in some other tile’s L2 cache (call it tile C), another message will be sent from tile B to tile C, and finally, tile C will send the data to tile A. If the requested memory address is not cached, tile B will forward the request to the memory controller responsible for this address (call it controller D). This may be on-package MCDRAM-based memory controller or on-platform DDR4-based memory controller.
It’s in the developer’s interests to maintain locality of these messages to achieve the lowest latency and greatest bandwidth of communication with caches. Knights landing supports all-to-all, quadrant/hemisphere and sub-NUMA cluster SNC-4/SNC-2 modes of cache operation.
With the all-to-all clustering mode, memory addresses are uniformly distributed across all TDs on the chip. This mode can have unfortunate cases where the points are far apart, and the latency of cache hits and cache misses is long. The all-to-all mode should not be used for day-to-day operation of Knights Landing. It is supported for troubleshooting and for situations where other clustering modes cannot operate.
In the quadrant clustering mode, the tiles are divided into four parts called quadrants, which are spatially local to four groups of memory controllers. Memory addresses served by a memory controller in a quadrant are guaranteed to be mapped only to TDs contained in that quadrant. Hemisphere mode functions the same way, except that the die is divided into two hemispheres instead of four quadrants. In the quadrant and hemisphere modes, the latency of L2 cache misses is reduced compared to the all-to-all mode because the worst-case path is shorter. The division into quadrants is hidden from the operating system: there are no breaks in the address space and the memory appears to be one contiguous block from the user’s perspective. This is the recommended model to use for applications that treat Knights Landing as a Symmetric Multi-Processor (SMP).
The sub-NUMA cluster modes SNC-4 and SNC-2 also partition the chip into four quadrants or two hemispheres, and, in addition, expose these quadrants (hemispheres) as NUMA nodes. In this mode, NUMA-aware software can pin software threads to the same quadrant (hemisphere) that contains the TD and accesses NUMA-local memory. Sub-NUMA clustering is the recommended mode of operation for NUMA-aware applications, i.e., applications that pin processing threads and their memory to the respective NUMA nodes.
Knights Landing processors are more forgiving to applications sensitive to cache traffic than their predecessors Knights Corner (KNC) due to more complex cache structure. Additional performance improvements in such applications may come from tuning their execution environment and parallel pattern for the clustering modes supported by KNL. For applications that treat the Knights Landing chip as an SMP, the quadrant and hemisphere mode may be used. For NUMA-aware applications, sub-NUMA cluster modes (SNC-4 and SNC-2) may be used.
The integrated architecture of Knights Landing improves performance and lowers your costs by reducing bottlenecks and system complexity. Knights Landing delivers up to 490 GB/s of sustained memory bandwidth without the need for additional discrete memory cards, and 100 GB/s I/O without the added cost and power needed for two fabric adapters.
Supported by a comprehensive Intel roadmap, Knights Landing is a future-ready solution that maximizes your return on investment by using open standards for code that is flexible, portable and reusable.