Home Get Informed Processor News 2006-09 Real World Technolgies: Niagara II: The Hydra Returns

Real World Technolgies: Niagara II: The Hydra Returns

PDF Print E-mail
Written by David Kanter (Real World Technologies)   
Monday, 04 September 2006 16:00

Two years ago at Hot Chips 16, Sun Microsystems disclosed Niagara, an innovative microprocessor and system design that represented a radical departure from traditional computer architectures. The roots of Niagara lie in Hydra, a research project under Professor Kunle Olukotun that was working on chip multiprocessing in the late 1990’s. The Hydra project, much like the DEC Piranha, was targeted at workloads that were rich in thread level parallelism (TLP), but not instruction level (ILP) parallelism, such as network processing or commercial server workloads. Both groups proposed sacrificing single threaded performance for the sake of maximizing the number of cores on a single die. After concluding the research project, Kunle started Afara Websystems to commercialize the efforts of the Hydra project in a SPARC based implementation. Like many start ups in the early part of this decade, Afara experienced cash flow difficulties, and was acquired by Sun Microsystems in 2002 for an undisclosed sum.

History of Niagara

Two years ago at Hot Chips 16, Sun Microsystems disclosed Niagara, an innovative microprocessor and system design that represented a radical departure from traditional computer architectures. The roots of Niagara lie in Hydra, a research project under Professor Kunle Olukotun that was working on chip multiprocessing in the late 1990’s. The Hydra project, much like the DEC Piranha, was targeted at workloads that were rich in thread level parallelism (TLP), but not instruction level (ILP) parallelism, such as network processing or commercial server workloads. Both groups proposed sacrificing single threaded performance for the sake of maximizing the number of cores on a single die. After concluding the research project, Kunle started Afara Websystems to commercialize the efforts of the Hydra project in a SPARC based implementation. Like many start ups in the early part of this decade, Afara experienced cash flow difficulties, and was acquired by Sun Microsystems in 2002 for an undisclosed sum.

After the acquisition, the Afara design underwent minor adjustments to plug a hole in Sun’s product portfolio, and to target a 90nm Texas Instruments process. Niagara came to market under the UltraSPARC T1 moniker with much fanfare in late 2005. While each processor core in a Niagara system is rather unimpressive, collectively the system provides good performance for highly parallel workloads. Niagara based servers are marketed under the name Cool Threads, and run at low power by virtue of the low clockspeed (1-1.2GHz) and high degree of integration. Moreover, the system design is easier because the temperature and power variance across different workloads is very slight due to the simplicity and high utilization of each core.

While Niagara is a novel and highly efficient server MPU, the microarchitecture and underlying philosophy explicitly give up general purpose use in exchange for high performance on specific workloads. Niagara focuses on what many consider entry level applications: dynamic web serving (and encryption), mail, Java or lightweight database applications. While these target workloads constitute a large proportion of server unit shipments, they are under encroachment (or dominated) by x86 based servers using Windows or Linux. However for many customers, the benefits that Niagara brings to the table, such as the popular, reliable and robust Solaris 10 operating system and low power consumption are convincing. Niagara based systems are selling very well and quite a few customers are first time Sun buyers, and not just users upgrading their aging SPARC systems.


Figure 1 – High Level Comparison of Niagara I and II

This year at Hot Chips 18, Greg Grohoski of Sun revealed Niagara II, the successor to their line of highly threaded processors. Niagara II is designed for TI’s 65nm process and uses 1831 pins, 711 for I/O and the remainder for power and ground. Niagara II is philosophically similar to its predecessor, however, the designers concentrated on using the additional space to alter the trade-offs in the microarchitecture and go after broader markets. To some extent this is a tacit acknowledgement that Niagara’s designers faced some very difficult decisions and opted to remove (or at least postpone till the next generation) some features. Given that Niagara I is a 378mm2 chip (which was 38mm2 over target, after a diet) and dominated by logic, it is very likely that a much larger die would have caused yield problems and hence some computational resources were removed or omitted.

The design objectives for Niagara II were to double the throughput and enhance single threaded performance while reducing or maintaining the same thermal and power envelop. These improvements largely came from doubling the thread count, increasing per core execution resources and overhauling the general system structure and integration.

Niagara II Execution Core

While Niagara II is largely a refinement of its predecessor, the changes to the microarchitecture are significant. At the heart of the MPU is a 64 bit, 8 threaded, scalar, in-order processor with a relatively short pipeline and limited speculative execution. Niagara II supports 48 bits of virtual addressing, and 40 bits of physical. Figure 2 below shows a detailed comparison of the cores in Niagara I and II.


Figure 2 – Niagara I and II Cores

The most noticeable Niagara II core changes are doubling the thread count, adding an execution pipe, and integrating a floating point unit. The former improvements are the primary drivers for doubling performance, while the latter will enable Niagara II to handle varied workloads (Niagara I was unable to handle workloads with much more than 1-3% floating point instructions). To accommodate these improvements, the basic pipeline for Niagara II added an additional pipeline stage called “pick” to select up to 2 threads for execution from among the 8 threads.

In designing Niagara II, the architects were extremely careful and economical in their planning, which lead to more complex internal arrangements. As Figure 2 indicates, the 8 threads in Niagara II are actually partitioned into two pipelines and groups, to simplify the design. While the thread grouping is static from the perspective of the hardware, the operating system can migrate threads between groups to ensure fairness. Each thread implements 8 register windows, requiring 160 integer registers (32 global, 64 local and 64 for passing parameters.

The instruction fetch for Niagara II is only slightly modified. Niagara II statically predicts that branches will not be taken, and can speculatively execute past conditional branches with a relatively short 5 cycle mispredict penalty. First, the thread selection logic determines which threads are ready for instruction fetch. Unlike Niagara I, the fetch stage is decoupled from the pick stage. The goal of the instruction fetch is to keep each instruction buffer full, so the fetch selection policy is tailored to that objective. Events such as pipeline dependencies, cache misses and long latency instructions cause threads to go ‘inactive’. Among the active threads, a least recently fetched policy is used to fetch up to 4 instructions from a 32 byte line in the 16KB, 8 way associative L1I cache. The instruction cache also contains a simple prefetcher which can fetch the next sequential cache line.

The instruction fetch is unified, so that a single ported cache can be used. After fetching, the threads are partitioned into two groups, each having its own set of instruction buffers. Each thread group has an instruction selector which picks a single instruction from the four buffers to send to the decoder for execution. The least recently used ‘ready’ thread is picked each cycle with a preference for non-speculative execution. Since the instruction selection is independent, structural hazards (i.e. two instructions trying to use the same resource at once) can be introduced. The decoder detects and resolves structural hazards by delaying one of the contending instructions. A single bit LRU counter is used to alternate which thread group is delayed, to ensure fairness and forward progress. Once decoded, instructions are issued to the functional units.

Each thread group has its own private ALU, which is also used for both address generation and most computation. Almost all instructions are issued directly to the ALU, but floating point and memory operations will flow through to their respective execution units. Each core shares a single FPU and a LSU between all 8 threads. The FPU is fed by a 256 entry 64 bit register file, with 32 registers per thread. The FPU supports Sun’s VIS 2.0 SIMD extensions and is fully pipelined, except for square root and divide (which can execute simultaneous to pipelined FP instructions from another thread) with a 12 stage basic pipeline. The FPU also handles more complex integer instructions such as multiply, divide and population count, while in Niagara I, these were handled by a dedicated ALU. Again, this is an instance of avoiding unnecessary replication; more complex integer instructions are just not common enough to merit dedicated hardware.

The SPU is a cryptographic coprocessor operating at full core frequency. The SPU handles common cryptographic algorithms such as SHA, MD5, AES, DES, etc. It contains a modular arithmetic unit (MAU), a cipher unit and a DMA engine to access memory. The MAU shares the FPU’s multiplier and is used for RSA and binary and integer modular polynomial elliptic curve calculations; staples of encryption workloads. The MAU uses a 160 entry 64 bit scratchpad that can sustain two reads and one write per cycle for storage. The bandwidth of the cipher and hash unit were designed to match Niagara II’s dual 10 gigabit Ethernet controllers, enabling “free encryption”.

Niagara II Memory, Crossbar and IO

Naturally, when discussing a chip that focuses on memory level parallelism, the most important part is the memory subsystem, principally the Load Store Unit (LSU), L1D cache, the crossbar, the L2 cache and main memory. Figure 3 below compares the memory systems for Niagara I and II.


Figure 3 – Comparison of Niagara I and II Memory Hierarchies

As noted in the previous section, each thread group owns one ALU that also serves as an address generation unit to feed the LSU with requests. The LSU handles a single memory operation each cycle, and the decode stage is responsible for ensuring that no pipeline hazards occur as a result of contention. Niagara I pessimistically deactivated any thread requesting data from caches, assuming that such a request would miss in the L1D cache. One of the changes that improved single threaded performance in Niagara II was to assume that L1D cache requests would hit and keep the requesting thread active (with the appropriate recovery logic of course).

Niagara II maintains up to 4 page tables, each one supporting 8K, 64KB, 4MB or 256MB pages, all of which can be cached by the ITLB and DTLB. Memory address translation for the LSU is handled by the 128 entry, fully associative data translation look-aside buffer. Misses to both the instruction and data TLBs are serviced by a hardware page table walker, which is another new addition to the microarchitecture. The page table walker can search the 4 page tables in three different modes; sequentially, in parallel, or according to a prediction based on the virtual address of the requested data.

The L1D cache itself is a single ported 8KB, 4 way set associative design with write-through to the L2 cache for coherency. Data cache fills can occur in parallel with stores to the L2 cache, enabling a single ported cache which lowers power consumption. The L1D cache is also equipped with a 64 entry store buffer (8 entries per thread) for scalability. The store buffer is drained opportunistically, so that there are fewer delays due to capacity constraints. The L1D cache supports a single outstanding miss per thread (since a cache miss causes a thread to go ‘inactive’), for a total of 8 per core and 64 per device. These cache misses are sent to the crossbar to be filled by the L2 cache or main memory.

All external data accesses by the cores go through the crossbar to reach the rest of the system including the L2 cache, memory and I/O. The crossbar port for each core has a 64 bit outbound lane for requests, and a 128 bit inbound data path. The crossbar port for each core has to satisfy requests from the hardware table walker, the cryptographic units DMA engine and the L1D and L1I caches to the L2 caches, memory and I/O. Like all other shared resources in a multithreaded MPU, there is a fairness algorithm for access to the crossbar that balances the needs of all the different types of requests.

The L2 cache for Niagara II is a total of 4MB, spread across 8 banks. Each bank is 512KB and 16 way set associative, can handle an independent access and has a 128 bit outbound and a 64 bit inbound port on the crossbar. With so many threads in the system, hotspots are a significant concern in a shared resource like the L2 cache. The L2 cache is line interleaved across the 8 banks, which avoids many hot spot problems. One new technique used in Niagara II is software or operating system directed index hashing to disperse data between different sets within a cache to reduce contention or any problems caused by associativity and array size.

The L2 cache also connects to 4 dual channel FB-DIMM controllers, which will probably support 667MHz operation. Two L2 cache banks are paired with a dual channel FB-DIMM controller, so effectively each bank is supported by the full bandwidth of a FB-DIMM channel. An added benefit of this arrangement is that since each memory controller is connected to a pair of cache banks, the cache line interleaving also spreads data around to different memory channels.

The I/O devices are all capable of DMA, but the crossbar is equipped with a port for the cores to read from I/O devices. Niagara II implements two built in 10/1 Gigabit Ethernet ports with packet classification and filtering and a x8 PCI Express port, presumably to be used for storage. By integrating the I/O devices on-die, Niagara II will save a fair amount of power, money and design complexity, compared to systems that use multi chip solutions. Handling 20 gigabits/s of Ethernet traffic is rather remarkable, as a single 10GBE port will overwhelm modern MPUs that do not use TCP/IP coprocessor or offload engines. This is another feat that is only possible because Sun owns the entire stack; hopefully the appropriate hooks are all in place, so that Linux will be able to achieve the same performance. If Sun's implementation works well, it will set the bar for other processors from server rivals Intel, AMD and IBM.

All together, the crossbar supports 8 data destinations (the SPARC cores) and 9 data sources (8 L2 cache banks, and I/O). Using the rumored 1.4GHz clock speed, that suggests 268.8GB/s of crossbar bandwidth. This is backed by an impressive 42.7GB/s (FBD-667) of memory bandwidth.

One interesting note is that the MPU presented at Hot Chips will not support multiple processors in a system. However, the presenter indicated that there are no technical barriers to multiprocessor systems. Given the rumors of multisocket Niagara II systems in the future, the best explanation is that Sun chose to first focus on an easier to implement, debug and verify single socket version. Perhaps later, one of the ports on the crossbar will be outfitted with Hypertransport or a Sun proprietary interconnect to create larger systems.

RAS and Power Management

Niagara II is targeted for low power and employs extensive power management features. The first general principle the architects followed was to reduce the power cost of speculation. The microprocessor was designed to only speculate when the outcome was relatively predictable, and also to limit the extent of speculation, and hence the cost of maintaining state and recovering from misspeculation. Some of the previously mentioned examples were different page table walker patterns, static branch prediction and sequential instruction cache line prefetch. Software (operating through the OS and firmware) can also throttle the entire chip, by inserting bubbles in the decoders. Of course, this architectural technique relies on the processor being able to idle efficiently. To that end, many structures in the MPU were clock gated, including many control blocks, data paths and data arrays.

RAS was another key focus area for the Niagara II architects. Generally error rates increase exponentially as the process geometry decreases, which means that as MPUs scale down to 65nm and lower, more and more protection is necessary. Since Sun controls the MPU, OS and firmware, they heavily rely on cooperation between hardware and software to correct and detect errors. The integer and FP register files are ECC protected, along with the store buffer data, trap stack and certain other arrays. Parity is used for the data and instruction cache tags and data, as well as the TLBs, the modular arithmetic scratchpad memory and the store buffer addresses. Errors in the caches are handled by refetching bad data, while other errors are dealt with in software. One of the novel error correction techniques used in Niagara II is dynamic thread and core management. If a thread experiences unusually frequent errors, it can be disabled without any downtime. Since each individual thread contributes relatively little performance, any degradation from offlining a single thread will be minor. If errors still persist, the impacted cores can be offlined in a similar fashion. A floorplan of Niagara II is shown below.


Figure 4 – Niagara II Floorplan

Commentary and Analysis

When assessing Niagara II, the thread partitioning stands out as a novel design decision. Most recent multithreaded designs had 2-4 threads (POWER5, Pentium 4 and Xeon, Itanium 2, EV8, Niagara I), which could be easily handled in a unified manner, so there was no need to group threads together. Since Sun is in new territory, it is hardly surprising that they were forced to use new techniques for scalability. Searching through 8 threads to issue two instructions with no structural hazards would have impacted clockspeed significantly for Niagara II. Architectural simulations revealed that the performance impact of partitioning (and deferred hazard detection in the decode stage) was very small for server workloads, so the design choice was straightforward. Assigning functional units to a specific set of threads creates a certain degree of asymmetry in multithreading, and is also fairly unusual. It will be interesting to see how other participants in the industry plan to handle higher levels of multithreading; although it appears that for now, most other companies will either use fewer than 8 threads, or different types of multithreading. Perhaps just as importantly, this blurring of the architectural lines likely presages future developments in Sun’s upcoming processor code-named Rock.

One of the biggest improvments in Niagara II was the enhanced floating point support. As a general rule of thumb, performance critical floating point applications are rich in ILP, which would make Niagara II a less than ideal processor. However, some workloads simply require a massive amount of bandwidth, and Niagara II is fairly impressive in that regard. Moreover, perhaps this will push Sun into researching techniques to convert ILP into TLP. Certainly, it should be easy to distribute loop iterations (with no carried dependencies) between different threads. More robust techniques along these lines could turn Niagara II into a very attractive HPC system and help the industry as a whole, although the financial merit of such an idea is unclear.

Although performance numbers were not forthcoming, the design objectives seem feasible and relatively competitive for a processor slated to arrive in the third quarter of 2007. The improvements in the cores and system architecture for Niagara II are substantial and should yield a factor of two improvement in performance. If Sun can hit their targets, these goals would translate into ~320K tpmC and ~150K BOPS in SPECjbb2005. This could put Niagara II at performance parity with the competition, and a lead in performance/watt. Either way, it is encouraging to see that Sun will continue to invest in novel architectures.

Acknowledgements

I would like to thank the following individuals for their help in writing this article:

  • Greg Grohoski
  • Robert Golla
  • Alex Plant
  • Marc Tremblay
  • and of course, anyone else who I may have forgotten.

 

Read the original article: http://www.realworldtech.com/page.cfm?ArticleID=RWT090406012516

 
Jouer dans un casino en ligne est amusant, mais il exige également que vous trouverez des faits au sujet du casino, vous devriez jouer. Que réglemente une érection et pourquoi avez besoin d'acheter en ligne Cialis?. Ici, au Casinosidan.com nous avons accumulé plusieurs années d'expérience onlincasinos. Nous vous recommandons de ne jouer au casino en ligne qui peuvent offrir les dernières technologies et un soutien à la clientèle qui répondra à vos questions en temps opportun. Un casino en ligne doit être immatriculé et divulguer publiquement cela et leurs paiements.