Overview

The central processing unit (CPU) is the heart of any high performance computing system.  As such, it is vital to have a fundamental understanding of current microprocessor design trends in order to achieve efficient cluster design.  Understanding how these design trends improve HPC performance also helps to clear up certain misconceptions surrounding CPU performance.

 

High Performance Computing is typically interested in double-precision floating point calculations that require not just a fast floating-point unit (FPU), but also the lowest possible latency to main memory.  The CPU's basic functions are to fetch, decode, execute and writeback instruction sets or programs.  Yet instructions and data are stored in separate memory locations, so performance bottlenecks often manifest as memory bandwidth latencies between the processors (cores) and memory. Intensive floating-point calculations being memory dependent have always been bottlenecked by this memory latency.
 
The design goals of the Intel Xeon 5500 and AMD Opteron 2300 series chips were to address this memory latency by adding on-die integrated memory controllers, on-die caches and on-die interconnects.  AMD and Intel now both offer multi-core CPUs with dedicated L1 and L2 caches (and increased cache sizes) per core.  Additionally, the on-die L3 caches are now dynamically linked and shared by all of the cores.  High-speed interconnects between cores, memory controllers, external memory and I/O hubs are also characteristic of the Intel Xeon 5500 and AMD Opteron 2300 series chips.  All of these features help to address and improve memory latency in addition to improving the overall performance of the processors.

 

 

Microprocessor Design
 
The Intel Xeon 5500 and AMD Opteron 2300 series, use a 45nm die-cast multi-core architecture which allows for higher core densities on the chip, while the close physical proximity of the transistors increases their switching speeds.  This allows higher core and bus clocking frequencies to be applied, but at the same power and thermal envelopes used in the previous generation of 65nm processors.
 
The shorter signal distances achieved in these 45nm chips results in less signal degradation, improved signal quality and therefore improved cache coherency and cache snooping performance.  The decrease in signal distance and increase in signal quality helps to greatly increase the energy efficiency of these CPUs.  Increased performance therefore, does not come at the cost of increasing the power requirements for these processors.
 

Additional features in the AMD and Intel 45nm chips include:  turning off portions of the chip which are not being used (clock gating), cores can be independently and dynamically clocked as demand necessitates (dynamic overclocking), and the floating point unit (FPU) can be turned off when performing integer heavy calculations.  Individual architectural designs of the AMD Opteron 2300 series and Intel Xeon 5500 series are further detailed in the AMD and Intel sections.

 


Performance
 
One common misconception with regard towards CPU performance is "peak" versus "sustained" performance.  The peak performance numbers are the ones commonly used for benchmarks and yet these measurements can be deceiving.  The performance of these highly tuned machines and benchmark applications can be beneficial in some circumstances, but you will rarely be able to achieve this level of performance in your own system using standardized codes.
 
Sustained performance is the more valid measurement when running multiple high performance computing applications.  The peak versus sustained argument can be applied to many different performance matrices such as clock rate, work per cycle, pipelining, number of cores, and number of processors/sockets.  Core throughput describes how effective a single core is at a given application.  There are many different ways to measure this, but most involve the clock rate or frequency of the processor.  While this can be a good way to judge processors within the same family or class, it generally does not work when comparing different manufacturers or different architectures.
 
The performance of any processor or core can be judged by the average work per cycle and multiplying it by the clock rate or frequency.  The average work per cycle can depend on many things, such as the micro-architecture, how the processor is configured in the BIOS, and which operating system is installed.  Another commonly used matrix for performance measurement, is the number of concurrent stages the processor (core) can perform.  This is called "pipelining".
 

Pipelining is used in processors to allow overlapping execution of multiple instructions with the same circuitry.  The circuitry is usually divided up into stages, including instruction decoding, arithmetic, and register fetching stages, wherein each stage processes one instruction at a time.  While this can be difficult to conceptualize, it can be understood through the analogy of an assembly line.

 

Short pipelined processors like the Intel Xeon 5500 or AMD Opteron 2300-series can be viewed as an assembly line with a small number of "worker cells".  Other processors like the older Xeon 5000 series have much longer pipelines, so they have many more worker cells.  The next concept to understand is that different cells of the pipeline perform different jobs, and some cells of the pipeline can be duplicated to increase speed.  To make it more complex, there can be more than one pipeline, or assembly line, in the core.  This is called  a "superscalar" architecture, and most modern processors possess this capability.
 
So, initially it may seem that cores or processors with high clock rates possess the best performance as they can produce the highest number of results over the shortest time cycle.  In practice this in not always the case as processors have to routinely flush their pipelines.  This is done for any number of reasons:  Perhaps a previous iteration has not been completed in time, or stale data is present in the current pipeline.  Since processors with longer pipelines have to flush more data, they require more time to recover.  For this reason processors with longer pipelines have extremely high peak performance but may lack sustained performance.  In some cases of real world work loads, processors that have shorter pipelines lose less data when they have to flush their pipelines, are quicker to recover, and sometimes produce more results in a given time period.  This is one of the reasons Aspen Systems always recommends that you benchmark your code.

 

 

 

Number of Cores and Sockets

 

Multi-core computing is quite prevalent now. Extreme examples of multi-core processors include the Xeon 7500 series which is able to house as many as 32 cores in a single system without the need for specialty hardware. The number of cores and sockets that a computer has should be based on the inherent requirements of your applications. Some applications are able to scale to thousands of processors in a single system while others can only scale to eight or even as few as four. This can be attributed to a variety of reasons such as memory bandwidth and the parallelism of the application. 


Generally speaking, more cores are able to produce higher peak and sustained performance.  Yet it is important to balance the number of cores against the scalability of your code, otherwise an excess number of cores can lead to an actual decrease in performance, as memory, I/O and communications bandwidths can quickly become saturated.  Therefore, larger multi-socket systems are more typically used for complex calculations while smaller two-socket setups are used for more I/O and memory bound applications.
 
Whatever your processing requirements, Aspen Systems can design a solution to fit your needs.  We have built many thousands of systems on a variety of different processing technologies, so rest assured that Aspen Systems has the expertise and experience necessary to engineer your solution.  Talk to your sales engineer today for more information.

Bookmark and Share