|
Two years ago at Hot Chips 16, Sun Microsystems disclosed Niagara, an
innovative microprocessor and system design that represented a radical
departure from traditional computer architectures. The roots of Niagara
lie in Hydra, a research project under Professor Kunle Olukotun that
was working on chip multiprocessing in the late 1990s. The Hydra
project, much like the DEC Piranha, was targeted at workloads that were
rich in thread level parallelism (TLP), but not instruction level (ILP)
parallelism, such as network processing or commercial server workloads.
Both groups proposed sacrificing single threaded performance for the
sake of maximizing the number of cores on a single die. After
concluding the research project, Kunle started Afara Websystems to
commercialize the efforts of the Hydra project in a SPARC based
implementation. Like many start ups in the early part of this decade,
Afara experienced cash flow difficulties, and was acquired by Sun
Microsystems in 2002 for an undisclosed sum.
History of Niagara
Two years ago at Hot Chips 16, Sun Microsystems disclosed Niagara, an
innovative microprocessor and system design that represented a radical
departure from traditional computer architectures. The roots of Niagara
lie in Hydra, a research project under Professor Kunle Olukotun that
was working on chip multiprocessing in the late 1990s. The Hydra
project, much like the DEC Piranha, was targeted at workloads that were
rich in thread level parallelism (TLP), but not instruction level (ILP)
parallelism, such as network processing or commercial server workloads.
Both groups proposed sacrificing single threaded performance for the
sake of maximizing the number of cores on a single die. After
concluding the research project, Kunle started Afara Websystems to
commercialize the efforts of the Hydra project in a SPARC based
implementation. Like many start ups in the early part of this decade,
Afara experienced cash flow difficulties, and was acquired by Sun
Microsystems in 2002 for an undisclosed sum.
After the acquisition, the Afara design underwent minor
adjustments to plug a hole in Suns product portfolio, and to target a
90nm Texas Instruments process. Niagara came to market under the
UltraSPARC T1 moniker with much fanfare in late 2005. While each
processor core in a Niagara system is rather unimpressive, collectively
the system provides good performance for highly parallel workloads.
Niagara based servers are marketed under the name Cool Threads, and run
at low power by virtue of the low clockspeed (1-1.2GHz) and high degree
of integration. Moreover, the system design is easier because the
temperature and power variance across different workloads is very
slight due to the simplicity and high utilization of each core.
While Niagara is a novel and highly efficient server MPU, the
microarchitecture and underlying philosophy explicitly give up general
purpose use in exchange for high performance on specific workloads.
Niagara focuses on what many consider entry level applications: dynamic
web serving (and encryption), mail, Java or lightweight database
applications. While these target workloads constitute a large
proportion of server unit shipments, they are under encroachment (or
dominated) by x86 based servers using Windows or Linux. However for
many customers, the benefits that Niagara brings to the table, such as
the popular, reliable and robust Solaris 10 operating system and low
power consumption are convincing. Niagara based systems are selling
very well and quite a few customers are first time Sun buyers, and not
just users upgrading their aging SPARC systems.

Figure 1 High Level Comparison of Niagara I and II
This year at Hot Chips 18, Greg Grohoski of Sun revealed Niagara II,
the successor to their line of highly threaded processors. Niagara II
is designed for TIs 65nm process and uses 1831 pins, 711 for I/O and
the remainder for power and ground. Niagara II is philosophically
similar to its predecessor, however, the designers concentrated on
using the additional space to alter the trade-offs in the
microarchitecture and go after broader markets. To some extent this is
a tacit acknowledgement that Niagaras designers faced some very
difficult decisions and opted to remove (or at least postpone till the
next generation) some features. Given that Niagara I is a 378mm2 chip (which was 38mm2
over target, after a diet) and dominated by logic, it is very likely
that a much larger die would have caused yield problems and hence some
computational resources were removed or omitted.
The design objectives for Niagara II were to double the throughput and
enhance single threaded performance while reducing or maintaining the
same thermal and power envelop. These improvements largely came from
doubling the thread count, increasing per core execution resources and
overhauling the general system structure and integration.
Niagara II Execution Core
While Niagara II is largely a refinement of its predecessor, the
changes to the microarchitecture are significant. At the heart of the
MPU is a 64 bit, 8 threaded, scalar, in-order processor with a
relatively short pipeline and limited speculative execution. Niagara II
supports 48 bits of virtual addressing, and 40 bits of physical. Figure
2 below shows a detailed comparison of the cores in Niagara I and II.

Figure 2 Niagara I and II Cores
The most noticeable Niagara II core changes are doubling the thread
count, adding an execution pipe, and integrating a floating point unit.
The former improvements are the primary drivers for doubling
performance, while the latter will enable Niagara II to handle varied
workloads (Niagara I was unable to handle workloads with much more than
1-3% floating point instructions). To accommodate these improvements,
the basic pipeline for Niagara II added an additional pipeline stage
called pick to select up to 2 threads for execution from among the 8
threads.
In designing Niagara II, the architects were extremely careful
and economical in their planning, which lead to more complex internal
arrangements. As Figure 2 indicates, the 8 threads in Niagara II are
actually partitioned into two pipelines and groups, to simplify the
design. While the thread grouping is static from the perspective of the
hardware, the operating system can migrate threads between groups to
ensure fairness. Each thread implements 8 register windows, requiring
160 integer registers (32 global, 64 local and 64 for passing
parameters.
The instruction fetch for Niagara II is only slightly modified.
Niagara II statically predicts that branches will not be taken, and can
speculatively execute past conditional branches with a relatively short
5 cycle mispredict penalty. First, the thread selection logic
determines which threads are ready for instruction fetch. Unlike
Niagara I, the fetch stage is decoupled from the pick stage. The goal
of the instruction fetch is to keep each instruction buffer full, so
the fetch selection policy is tailored to that objective. Events such
as pipeline dependencies, cache misses and long latency instructions
cause threads to go inactive. Among the active threads, a least
recently fetched policy is used to fetch up to 4 instructions from a 32
byte line in the 16KB, 8 way associative L1I cache. The instruction
cache also contains a simple prefetcher which can fetch the next
sequential cache line.
The instruction fetch is unified, so that a single ported cache
can be used. After fetching, the threads are partitioned into two
groups, each having its own set of instruction buffers. Each thread
group has an instruction selector which picks a single instruction from
the four buffers to send to the decoder for execution. The least
recently used ready thread is picked each cycle with a preference for
non-speculative execution. Since the instruction selection is
independent, structural hazards (i.e. two instructions trying to use
the same resource at once) can be introduced. The decoder detects and
resolves structural hazards by delaying one of the contending
instructions. A single bit LRU counter is used to alternate which
thread group is delayed, to ensure fairness and forward progress. Once
decoded, instructions are issued to the functional units.
Each thread group has its own private ALU, which is also used
for both address generation and most computation. Almost all
instructions are issued directly to the ALU, but floating point and
memory operations will flow through to their respective execution
units. Each core shares a single FPU and a LSU between all 8 threads.
The FPU is fed by a 256 entry 64 bit register file, with 32 registers
per thread. The FPU supports Suns VIS 2.0 SIMD extensions and is fully
pipelined, except for square root and divide (which can execute
simultaneous to pipelined FP instructions from another thread) with a
12 stage basic pipeline. The FPU also handles more complex integer
instructions such as multiply, divide and population count, while in
Niagara I, these were handled by a dedicated ALU. Again, this is an
instance of avoiding unnecessary replication; more complex integer
instructions are just not common enough to merit dedicated hardware.
The SPU is a cryptographic coprocessor operating at full core
frequency. The SPU handles common cryptographic algorithms such as SHA,
MD5, AES, DES, etc. It contains a modular arithmetic unit (MAU), a
cipher unit and a DMA engine to access memory. The MAU shares the FPUs
multiplier and is used for RSA and binary and integer modular
polynomial elliptic curve calculations; staples of encryption
workloads. The MAU uses a 160 entry 64 bit scratchpad that can sustain
two reads and one write per cycle for storage. The bandwidth of the
cipher and hash unit were designed to match Niagara IIs dual 10
gigabit Ethernet controllers, enabling free encryption.
Niagara II Memory, Crossbar and IO
Naturally, when discussing a chip that focuses on memory level
parallelism, the most important part is the memory subsystem,
principally the Load Store Unit (LSU), L1D cache, the crossbar, the L2
cache and main memory. Figure 3 below compares the memory systems for
Niagara I and II.

Figure 3 Comparison of Niagara I and II Memory Hierarchies
As noted in the previous section, each thread group owns one ALU
that also serves as an address generation unit to feed the LSU with
requests. The LSU handles a single memory operation each cycle, and the
decode stage is responsible for ensuring that no pipeline hazards occur
as a result of contention. Niagara I pessimistically deactivated any
thread requesting data from caches, assuming that such a request would
miss in the L1D cache. One of the changes that improved single threaded
performance in Niagara II was to assume that L1D cache requests would
hit and keep the requesting thread active (with the appropriate
recovery logic of course).
Niagara II maintains up to 4 page tables, each one supporting
8K, 64KB, 4MB or 256MB pages, all of which can be cached by the ITLB
and DTLB. Memory address translation for the LSU is handled by the 128
entry, fully associative data translation look-aside buffer. Misses to
both the instruction and data TLBs are serviced by a hardware page
table walker, which is another new addition to the microarchitecture.
The page table walker can search the 4 page tables in three different
modes; sequentially, in parallel, or according to a prediction based on
the virtual address of the requested data.
The L1D cache itself is a single ported 8KB, 4 way set
associative design with write-through to the L2 cache for coherency.
Data cache fills can occur in parallel with stores to the L2 cache,
enabling a single ported cache which lowers power consumption. The L1D
cache is also equipped with a 64 entry store buffer (8 entries per
thread) for scalability. The store buffer is drained opportunistically,
so that there are fewer delays due to capacity constraints. The L1D
cache supports a single outstanding miss per thread (since a cache miss
causes a thread to go inactive), for a total of 8 per core and 64 per
device. These cache misses are sent to the crossbar to be filled by the
L2 cache or main memory.
All external data accesses by the cores go through the crossbar
to reach the rest of the system including the L2 cache, memory and I/O.
The crossbar port for each core has a 64 bit outbound lane for
requests, and a 128 bit inbound data path. The crossbar port for each
core has to satisfy requests from the hardware table walker, the
cryptographic units DMA engine and the L1D and L1I caches to the L2
caches, memory and I/O. Like all other shared resources in a
multithreaded MPU, there is a fairness algorithm for access to the
crossbar that balances the needs of all the different types of
requests.
The L2 cache for Niagara II is a total of 4MB, spread across 8 banks.
Each bank is 512KB and 16 way set associative, can handle an
independent access and has a 128 bit outbound and a 64 bit inbound port
on the crossbar. With so many threads in the system, hotspots are a
significant concern in a shared resource like the L2 cache. The L2
cache is line interleaved across the 8 banks, which avoids many hot
spot problems. One new technique used in Niagara II is software or
operating system directed index hashing to disperse data between
different sets within a cache to reduce contention or any problems
caused by associativity and array size.
The L2 cache also connects to 4 dual channel FB-DIMM controllers, which
will probably support 667MHz operation. Two L2 cache banks are paired
with a dual channel FB-DIMM controller, so effectively each bank is
supported by the full bandwidth of a FB-DIMM channel. An added benefit
of this arrangement is that since each memory controller is connected
to a pair of cache banks, the cache line interleaving also spreads data
around to different memory channels.
The I/O devices are all capable of DMA, but the crossbar is
equipped with a port for the cores to read from I/O devices. Niagara II
implements two built in 10/1 Gigabit Ethernet ports with packet
classification and filtering and a x8 PCI Express port, presumably to
be used for storage. By integrating the I/O devices on-die, Niagara II
will save a fair amount of power, money and design complexity, compared
to systems that use multi chip solutions. Handling 20 gigabits/s of
Ethernet traffic is rather remarkable, as a single 10GBE port will
overwhelm modern MPUs that do not use TCP/IP coprocessor or offload
engines. This is another feat that is only possible because Sun owns
the entire stack; hopefully the appropriate hooks are all in place, so
that Linux will be able to achieve the same performance. If Sun's
implementation works well, it will set the bar for other processors
from server rivals Intel, AMD and IBM.
All together, the crossbar supports 8 data destinations (the SPARC
cores) and 9 data sources (8 L2 cache banks, and I/O). Using the
rumored 1.4GHz clock speed, that suggests 268.8GB/s of crossbar
bandwidth. This is backed by an impressive 42.7GB/s (FBD-667) of memory
bandwidth.
One interesting note is that the MPU presented at Hot Chips
will not support multiple processors in a system. However, the
presenter indicated that there are no technical barriers to
multiprocessor systems. Given the rumors of multisocket Niagara II
systems in the future, the best explanation is that Sun chose to first
focus on an easier to implement, debug and verify single socket
version. Perhaps later, one of the ports on the crossbar will be
outfitted with Hypertransport or a Sun proprietary interconnect to
create larger systems.
RAS and Power Management
Niagara II is targeted for low power and employs extensive power
management features. The first general principle the architects
followed was to reduce the power cost of speculation. The
microprocessor was designed to only speculate when the outcome was
relatively predictable, and also to limit the extent of speculation,
and hence the cost of maintaining state and recovering from
misspeculation. Some of the previously mentioned examples were
different page table walker patterns, static branch prediction and
sequential instruction cache line prefetch. Software (operating through
the OS and firmware) can also throttle the entire chip, by inserting
bubbles in the decoders. Of course, this architectural technique relies
on the processor being able to idle efficiently. To that end, many
structures in the MPU were clock gated, including many control blocks,
data paths and data arrays.
RAS was another key focus area for the Niagara II architects.
Generally error rates increase exponentially as the process geometry
decreases, which means that as MPUs scale down to 65nm and lower, more
and more protection is necessary. Since Sun controls the MPU, OS and
firmware, they heavily rely on cooperation between hardware and
software to correct and detect errors. The integer and FP register
files are ECC protected, along with the store buffer data, trap stack
and certain other arrays. Parity is used for the data and instruction
cache tags and data, as well as the TLBs, the modular arithmetic
scratchpad memory and the store buffer addresses. Errors in the caches
are handled by refetching bad data, while other errors are dealt with
in software. One of the novel error correction techniques used in
Niagara II is dynamic thread and core management. If a thread
experiences unusually frequent errors, it can be disabled without any
downtime. Since each individual thread contributes relatively little
performance, any degradation from offlining a single thread will be
minor. If errors still persist, the impacted cores can be offlined in a
similar fashion. A floorplan of Niagara II is shown below.

Figure 4 Niagara II Floorplan
Commentary and Analysis
When assessing Niagara II, the thread partitioning stands out as a
novel design decision. Most recent multithreaded designs had 2-4
threads (POWER5, Pentium 4 and Xeon, Itanium 2, EV8, Niagara I), which
could be easily handled in a unified manner, so there was no need to
group threads together. Since Sun is in new territory, it is hardly
surprising that they were forced to use new techniques for scalability.
Searching through 8 threads to issue two instructions with no
structural hazards would have impacted clockspeed significantly for
Niagara II. Architectural simulations revealed that the performance
impact of partitioning (and deferred hazard detection in the decode
stage) was very small for server workloads, so the design choice was
straightforward. Assigning functional units to a specific set of
threads creates a certain degree of asymmetry in multithreading, and is
also fairly unusual. It will be interesting to see how other
participants in the industry plan to handle higher levels of
multithreading; although it appears that for now, most other companies
will either use fewer than 8 threads, or different types of
multithreading. Perhaps just as importantly, this blurring of the
architectural lines likely presages future developments in Suns
upcoming processor code-named Rock.
One of the biggest improvments in Niagara II was the enhanced floating
point support. As a general rule of thumb, performance critical
floating point applications are rich in ILP, which would make Niagara
II a less than ideal processor. However, some workloads simply require
a massive amount of bandwidth, and Niagara II is fairly impressive in
that regard. Moreover, perhaps this will push Sun into researching
techniques to convert ILP into TLP. Certainly, it should be easy to
distribute loop iterations (with no carried dependencies) between
different threads. More robust techniques along these lines could turn
Niagara II into a very attractive HPC system and help the industry as a
whole, although the financial merit of such an idea is unclear.
Although performance numbers were not forthcoming, the design
objectives seem feasible and relatively competitive for a processor
slated to arrive in the third quarter of 2007. The improvements in the
cores and system architecture for Niagara II are substantial and should
yield a factor of two improvement in performance. If Sun can hit their
targets, these goals would translate into ~320K tpmC and ~150K BOPS in
SPECjbb2005. This could put Niagara II at performance parity with the
competition, and a lead in performance/watt. Either way, it is
encouraging to see that Sun will continue to invest in novel
architectures.
Acknowledgements
I would like to thank the following individuals for their help in writing this article:
- Greg Grohoski
- Robert Golla
- Alex Plant
- Marc Tremblay
- and of course, anyone else who I may have forgotten.
Read the original article: http://www.realworldtech.com/page.cfm?ArticleID=RWT090406012516
|