Intel 5500 Nehalem

 

The Intel® Xeon® 5500/7500 series is the most radical change in processor design,  since the introduction of the Intel® Core micro-architecture in 2006.  Originally developed under the codename Nehalem, the Xeon® 5500/7500 provides 1.1x to 1.25x the single-threaded performance or 1.2x to 2x the multi-threaded performance at the same power level as the previous Xeon® 5400 series.  Core-wise and clock-for-clock, Nehalem will provide a 15%-20% performance increase, compared to the Xeon® 5400 series, while consuming 30% less power for the same performance.

 


The Intel® Nehalem architectural design provides:

  • Greater scalability by dynamically managing cores, threads, interfaces and power.
  • Greater parallelism by allowing instructions to be run "out of order", while increasing the size of the the "out-of-order" window, the scheduler and the core's buffer size.
  • More efficient algorithms resulting in less stalls (dead cycles).
  • Faster synchronized primitives allowing better synchronization of threaded software, but also speeding up legacy primitives so that current threaded programs will also see a performance boost.
  • Faster handling of mispredictions by optimizing cases where predictions are wrong, so the overall penalty of branch mispredictions is lower than that of previous Intel® processor series.
  • Improved Hardware Prefetch and better Load-Store Scheduling further improving memory latencies.

 

Specific highlights of the Xeon® 5500/7500 series:

  • Integrated memory controller enables three channels of DDR3 1333 MHz memory,  resulting in up to 32.0 GB/sec memory bandwidth and an astonishing 96GB per socket memory capacity.  This memory controller's lower latency and higher memory bandwidth delivers amazing performance for data-intensive applications.
  • Intel® QuickPath Interconnect is designed for increased bandwidth and low latency.  It can achieve data transfer speeds as high as 6.4GT/s or 25.6 GB/sec. Intel® Quickpath Interconnect allows almost linear scaling between the number of sockets with memory bandwidth, while minimizing I/O latencies.
  • Native Quad-Core and eventually 8 Core, with up to eight processing cores per physical CPU Chip.
  • Intel® Smart Cache provides a higher-performance, more efficient cache subsystem, and is optimized for industry leading multi-threaded applications.  Additionally, the integrated 8MB L3 cache is shared across all cores reducing latency and snoop traffic in the event of a miss.
  • Intel® Hyper-Threading technology enables highly threaded applications to get more work done in parallel.  With 16 threads available to the operating system, multi-tasking becomes even easier.   
  • Simultaneous 32 and 64 bit computing when using Intel® Extended Memory 64 Technology (EM64T).
  • Intel® Turbo Boost technology dynamically increases core operating frequency to match your workload, providing more performance when you need it the most.
  • Super Shuffle Engine, SSE 4.2, instructions are issued at a throughput rate of one per clock cycle, allowing a new level of processing efficiency with SSE4 optimized applications.
  • Intel® Virtualization Technology (VT) optimizes performance for in-system parallel execution.

 

 

At the heart of the Nehalem Microarchitecture are four fundamental design changes.

 

The first and foremost is the on-chip memory controller, which allows for faster access and lower latency than is possible with the traditional front-side bus design. When combined with a new three channel DDR3 memory controller, Nehalem based CPUs have the highest memory bandwidth available. 

 

The second fundamental design change, was to further move beyond the front-side bus design by introducing the Intel® QuickPath Architecture.
 

With more powerful processors, a potential bottleneck can form anytime a processor or its individual cores cannot fetch instructions and data as fast as they are being executed.  When this happens:  Performance slows.  Therefore, to achieve optimal performance it becomes vital to make sure that the speeds at which the microprocessor and its execution cores access the system memory (and internal cache) are also improved.  The Intel® Quick Path Architecture achieves this with high speed interconnects between microprocessors and external memory, and between the microprocessors and the I/O hub.  This point to point design improves scalability and eliminates the competition between processors for bandwidth.  
 
The Intel® QuickPath Interconnect also handles memory transfers between sockets at an astonishing rate of up to 32.0 GB/s per link.  The Intel® QuickPath Interconnect employs a cache coherency protocol to keep memory and caching structures coherent during operation. It can also use both source snooping and home scoop behaviors to provide optimal direct cache to cache transfer for optimal latency. These factors are key when running memory intense code that uses MPI.
 
Intel® QuickPath Architecture Performance summary:

  • QPA uses up to 6.4GB/s links delivering up to 25GB/s of total bandwidth. (GB = Gigatransfers - number of data transfers or operations.)
  • QPA interconnects reduces the amount of communication required in the interface of multi-processor systems and thereby delivers faster payloads. 
  • QP Interconnects provides high reliability, availability and serviceability (RAS) through:  

                     •  Implicit cycle redundancy check (CRC) with link-level retry to ensure data quality and performance,

                         without the performance penalty of additional cycles.

                     •  Self-healing links that avoid persistent errors by re-configuring themselves to use the good parts of the link.

                     •  Clock fail-over to automatically re-route clock function to a data lane in the event of a clock-pin failure.

                     •  Hot plug capability to support hot plugging of nodes such as processor cards.


 

The third fundamental change that the Nehalem microarchitecture introduces is the creation of the Uncore part of the CPU. The Uncore houses the memory controller, the QuickPath interconnect controllers and the shared L3 cache. The Uncore shares a common clock that is separate from the core clock, thereby allowing better performance while at the same time providing a reduction in power consumption.  The Uncore makes it possible for Nehalem and future microarchitectures to scale to newer memory technologies, add or remove QuickPath Interconnect links, and adjust the size of the L3 cache without the need for a complete rebuild.

 


 

The fourth fundamental change was the modularization of the processor cores in the Nehalem microarchitecture. This allows the Nehalem and future microarchitectures to easily and quickly scale the number of cores available per socket. This change also paves the way for asymmetrical simultaneous multi-threading in future microprocessors.

 

The Nehalem microarchitecture combined with Intel's other impressive processor technologies (e.g. Intel® Virtualization Technology and Intel's Super Shuffle Engine 4.2) provides superior per core performance, with the ability to meet today's and future computational needs.

 


 

Future Intel microarchitectures

While Nehalem will be available in 2, 4, 6, and 8 cores, the successor microarchitecture Westmere (2009), will see a die shrink from 45nm down to 32nm.  Additionally, Westmere will offer native six-core and possibly dual-die hex-core 12-core processors, coupled with improved AES instruction sets and improved encryption/decryption rates.

 

Sandy Bridge (2010) microarchitecture will continue with the 32nm die-set and native 4 & 8 cores, while focusing on higher clock-rate performance, on-die GPU integration, and continued power-efficiency performance.   Sandy Bridge might have to address advances in bus speed technology in order to increase its performance with off-chip components.  Successive Sandy Bridge architectures, Ivy Brige (2011) may incorporate a 22nm die-cast.

 

Stay tuned for future discussions on the exciting evolution of Intel's microarchitectures.


 

 

Aspen can design your cluster around Intel's latest processors, ensuring maximum performance for your applications.  Aspen can also help you compile your code to take advantage of Nehalem's new features.  No matter how complex your processing requirements, Aspen can help. Contact your sales engineer today!

Bookmark and Share