|
SUN's Ultrasparc T1, formerly known as Niagara, is much more than just
a new UltraSparc. It is the harbinger of a new generation of CPUs,
which focus almost solely on Thread Level Parallelism. No less than 32
independent parts of a different program (threads) can be "in flight"
on the chip. It is SUN's first implementation of their Throughput
computing philosophy, and compared to what we are used to in the
AMD/Intel world, it is a pretty extreme architecture that focuses on
network and server performance.
Introduction
SUN's Ultrasparc T1, formerly known as Niagara, is much more than just
a new UltraSparc. It is the harbinger of a new generation of CPUs,
which focus almost solely on Thread Level Parallelism. No less than 32
independent parts of a different program (threads) can be "in flight"
on the chip. It is SUN's first implementation of their Throughput
computing philosophy, and compared to what we are used to in the
AMD/Intel world, it is a pretty extreme architecture that focuses on
network and server performance.
SUN's Ultrasparc T1 is little less than a revolution in the server
world. How else would you describe a 72 W, 1.2 GHz chip that is almost
3 times (in SpecWeb2005) as fast as four Xeon cores at 2.8 GHz, which
consume up to 300 W? Of course, there are a few snakes in the grass
too, as T1 does not like every kind of server workload. In this
article, we explore the architecture and the principles behind it, and
how it performs.
Stubborn Server applications
The basic idea behind the UltraSparc T1 is that most modern superscalar
Out-Of-Order CPUs may be excellent for games, digital content creation
and scientific calculations, but they are not a good match for
commercial server loads.
These complex CPUs can decode up to 3 (Opteron) to 8 (Power 5)
instructions in parallel, put in a buffer and try to issue them across
9 or more units. In theory, these CPUs can decode, issue, execute and
retire up to 3 (Opteron) to 5 (IBM Power) instructions per clock cycle.
They have huge buffers (up to 200 instructions) to keep many
instructions in flight.
Server workloads, however, cannot make good use of all this parallelism
for several reasons. The main reason is that commercial server loads
move a lot of data around and perform relatively little calculation on
that data. Moving a lot of data around means that you may need a lot of
accesses to the memory, which results in many cycles wasted while the
CPU has to wait for the data to arrive. As many different users query
different parts of the database, caching cannot be as efficient (low
locality of reference). In the past years, memory latency has become
worse as memory speed increased a lot slower than the speed of the CPU.
Memory latency is even worse on MP (Multi-Processor) systems, and has
risen from a few tens of CPU cycles to 200-400 clock cycles. The second
reason is that many of the calculations performed on that data involve
data dependent (read: hard to predict) branches, which makes it even
harder to do a lot in parallel.
You might counter these two problems by eliminating the branches
through predication and incorporate very large caches. That is what the
Itanium family does, but even the mighty Itanium is not capable of
running those server loads at high speeds despite predication and
gigantic caches. Below, you can see Intel's own numbers for CPU
utilization on the 3 different workloads.
So, while the Floating point intensive applications such as scientific
simulations and 3D rendering achieve relatively good parallelism on the
superscalar CPUs, even the chip with the highest IPC stalled 85% of the
time in Enterprise (i.e. server loads).
The applications that can be found inside Spec Integer benchmark
are still rather compute-intensive compared to server applications.
Compression, FPGA Circuit Placement and Routing, Compiling and
interpreting, and computer visualization are representatives of very
CPU intensive integer loads. On average, the best desktop CPUs such as
the Athlon 64 or Intel Dothan are capable of sustaining 0.8 to 1
instructions per clock cycle in this benchmark, while the Pentium 4 is
around 0.5-0.7 IPC. Itanium is capable of a 1.3-1.5 IPC. That may sound
like very low numbers, but let us compare SpecInt with typical server
loads. In the table below, you find how the 4-way superscalar USIIIi
does on the various benchmarks.
| Benchmark | IPC | | SPECint | 0.9 | | SPECjbb | 0.5 | | SPECweb | 0.3 | | TPC-C | 0.2 |
Rather than focus on the absolute numbers, it is more important to
note that web applications have 3 times less IPC than CPU intensive
integer apps. OLTP databases (TPC-C) do even worse: the CPU sustains on
average 0.2 instructions per clock pulse, or 4.5 less than SpecInt.
These numbers are no different for the Opteron or Xeon. So despite Out
of Order execution, nifty branch prediction schemes and big caches,
commercial server loads utilize a very meagre 10 to 15% of the
potential of modern CPUs.
One possible solution is to focus on clock
speed instead of trying to process as many instructions in parallel
(ILP, instruction level parallelism). The long pipelines of such CPUs
make the branch prediction problem worse, and the power consumption
goes up exponentially as we discussed in a previous article about dynamic power and power leakage.
Thread Machine Gun
Besides hard-to-predict branches and high memory latency, server
applications on MP systems also get slowed down by high latency network
communication and cache coherency (keeping all the data coherent across
the different caches; read more here).
To summarize, the challenges and problems that server CPUs face are:
- Memory latency, load to load dependencies
- Branch misprediction
- Cache Coherency overhead
- Keeping Power consumption low
- Latency of the Network subsystem
So, how did Les Kohn, Dr. Marc Trembley, Poonacha Kongetira,
Kathirgamar Aingaran and other engineers at SUN attack these problems?
Let us take deeper look at Niagara or the UltraSparc T1.
Memory latency is by far the worst problem, causing a typical
server CPU to be idle for 75% of the time. So, this is the first
problem that the SUN/Afara engineers attacked.
The 8 cores of the 64 bit T1 can process 8 instructions per cycle, each
of a different thread, so you might think that it is a just a massive
multi-core CPU. However, the register file of each core keeps track of
4 different active threads contexts. This means that 4 threads are
"kept alive" all the time by storing the contents of the General
Purpose Registers (GPR), the different status registers and the
instruction pointer register (which points to the instruction that
should be executed next). Each core has a register file of no less than
640 64-bit registers, 5.7 KB big. That is pretty big for a register
file, but it can be accessed in 1 cycle.
Each core has only one pipeline. During every cycle, each core switches
between the 4 different active threads contexts that share a pipeline.
So at each clock cycle, a different thread is scheduled on the pipeline
in a round robin order; to put it more violently: it is a machine gun
with threads instead of bullets.
In a conventional CPU, such a switch between two threads would cause a
context switch where the contents of the different registers are copied
to the L1-cache, and this would result in many wasted CPU cycles when
switching from one thread to another thread. However, thanks to the
large register file which keeps all information in the registers and
Special Thread Select Logic, a context switch doesn't require any
wasted CPU cycles. The CPU can switch between the 4 active threads
without any penalty, without losing a cycle. This is called Fine
Grained Multi-threading or FMT.
Much to this article........
Read the original article: http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2657&p=1
|