Standard
microprocessors increasingly dominate the HPC market, yet Cray, IBM,
SGI and others see a need to complement microprocessors with other
types of processors. At the HPC User Forum meeting in Denver this week,
a panel on the growing number of processor options was led by Richard
Walsh, technical specialist at the Army High Performance Computing
Research Center/Network Computing Services, Inc. HPCwire, a co-sponsor
of the HPC User Forum meetings, talked with Walsh. The views are his
own, not the panel's.
HPCwire:
A decade ago, the expectation in some circles was that commodity
microprocessors would eliminate the need for any other types of
processors in HPC. Commodity microprocessors have become the
mainstream. Is there still a need for other types of processors?
Walsh:
Implicit in your question are the evolving micro-architectural details
of the general-purpose microprocessor. The dominant design theme over
the ten years referred to has been clock period improvement companioned
with a general widening of the processor to increase ILP
[instruction-level parallelism] and IPC [instructions per clock], and
to turn scalar into superscalar performance. Instruction re-order
buffers and larger, on-chip caches were further additions to support
this design theme. The need to preserve the x86 ISA and to directly
serve the personal computing market to a large extent dictated these
choices. None of these events took direct stock of the needs of HPC;
and yet, the commodity, general-purpose microprocessor has been the
computational engine, when combined with new interconnect technology
and the MPI parallel programming model, to take HPC into the massively
parallel era. Clock speed and price-performance have been the ultimate
aphrodisiac as system processor counts have multiplied dramatically in
HPC, and this is reflected in the fact that today approximately 72
percent of processors in the Top500 supercomputers live in so-called
commodity clusters.
Faster clock speed and more IPC at low price
can improve the performance of almost any application, but further easy
clock and IPC improvements are less likely. Here, even low price refers
only to up front, capital costs which are shrinking on a
percentage-of-TCO [total cost of ownership] basis, as the typical large
HPC system now has from 1024 to 2048 processors. Power and cooling
bills for such systems over their lifetime are substantial compared to
the up-front price of the processors. The design question of today has
become, how does one enhance performance without faster, power-hungry
processors, without complicated circuitry to find "just-in-time" ILP
prior to execution, and still maintain low price? Of course, the answer
is to discover and use program parallelism of a different sort. The
general-purpose microprocessor industry has chosen multi-cores and
multi-threads (ignoring SSE for the moment). This is an
instruction-intensive approach, which again preserves the x86 ISA,
serves their dominant market, and maximizes the use of the industry's
talent and prior investments in micro-architectures they have already
designed.
Is there still a need for other types of processors?
Well, as I have suggested above, commodity microprocessors are becoming
"other types of processors" themselves, but how well this new
generation serves the HPC parallel performance sweet spot is another
question. I think most would agree that HPC applications are typically
data intensive and that their performance is limited by data latency
rather that instruction latency. The data latency problem is
exacerbated in HPC as processor speed outstrips memory speed and as
program data is partitioned across large, grid-like systems of
distributed nodes. Accepting that as true for this discussion, are
commodity, multi-threaded, general-purpose microprocessors all that are
needed by HPC going forward? I think the answer to this more specific
question is clearly no. Other commodity and non-commodity processor
designs will actually play an increasing role as co-processors, and in
large mixed-processor systems. This is supported when one looks at
where the HPC community has turned its attention and is making its
investments. I see three important, valued-added design features in
these "other types of processors" that are not emphasized in today's
traditional general- purpose commodity designs or their ISAs, but that
have a future in HPC.
The first, which takes HPC "back to the
future," is pipelined, data-level parallelism (DLP) as variously
expressed today in the commodity GPU [graphics processing unit], the
perhaps-commodity IBM Cell, the custom Cray BlackWidow, and even FPGA
co-processors. As Seymour Cray knew, DLP-oriented instruction sets and
micro-architectures target the dominant parallel feature of HPC
applications, offer performance with instruction set and circuit
economy, and do so at relatively low clock and power requirements. The
GPU DLP-engine most easily meets the commodity price test with graphics
cards available today capable of 200 32-bit GFLOPS and priced at under
$500. The second design feature is global memory, data latency
remediation. This is expressed in latency-hiding designs like Cray's
Eldorado processor, in experimental designs capable of executing local
instruction parcels remotely (PIM-lite), and even in the global,
vector-memory operations of the Cray BlackWidow. Commodity
microprocessors have not been engineered to address HPC's global memory
latency problem.
Finally, there are the related design features
of polymorphism, heterogeneity, and re-configurability really the Holy
Grail of HPC performance. These respond to the fact that as a class HPC
applications are of mixed character, parallel and serial. A polymorphic
or modal processor and ISA allow a tiled-array of processing elements
to be flexibly configured to a particular application. A heterogeneous
system with a variety of processor or core types could distribute the
distinct parts of an application to the processors for which they are
best suited. In reconfigurable systems, the fixed microprocessor ISA is
abandoned altogether in favor of a dynamically configured Application
Specific Architecture (ASA) that provides near-perfect performance for
a key portion of an application kernel. These features are most
obviously expressed today in FPGA co-processors, but also in the RAW
and TRIPS experimental microprocessors, and will be part of future
mixed-architecture systems planned by Cray.
So, looking ahead at
the next ten years, the new, more power-efficient, multi-core, general
purpose processors that are now forming the backbone of large, parallel
HPC systems will retain an important role, but they will be
increasingly supported in mixed-architecture environments by
special-purpose commodity and custom processors targeting, by design or
coincidence, the special requirements of HPC.
HPCwire:
What's the single biggest strength and weakness of each processor type:
microprocessors, FPGAs, vector, Cell and multithreaded processors?
Walsh:
Taking the list in order, the conventional microprocessor's strength
has been in attacking the HPC performance problem with a lowest common
denominator approach a dose of fast clock, low price, and ILP. The
product iteration rate and cutting-edge line widths should also be
mentioned. Today, from an HPC perspective, their weakness is in the
power required to maintain this approach and in the growing cost and
hardware complexity of O-o-O [out-of-order], superscalar designs. The
power remediating effects of die shrink have been reduced by the
non-linear increase in leakage power loss at today's production line
widths. The partial mismatch of today's multi-core, PC market-driven,
TLP design theme with the dominant data-parallel performance theme in
HPC applications is also a weakness.
FPGAs are just arriving on
the HPC scene. They have the strength of being able to work outside the
traditional, fixed-in-advance, ISA-limited performance regime by
providing Application Specific Architectures (ASAs) configured
"just-in-time" to produce near-perfect performance for a piece of the
application kernel. I have perhaps exaggerated above, but it is to make
the point that every kernel at a fixed clock has its own unique,
idealized, perfect-performance architecture in which latency is
overcome by data and circuit (or instruction) pre-placement, and all
hardware-imposed seriality has been removed. This is the performance
that FPGAs target and, in this ideal case, executing an application's
kernel on an FPGA is like firing a gun.
Historically, FPGAs have
been used primarily for integer or bit-manipulation. Programming for
full 64-bit, IEEE floating-point is now possible, but consumes more
circuit logic and remains more difficult. How much of a double
precision, floating-point kernel can be placed on a single FPGA chip is
currently HPC-performance limiting. When a whole kernel does fit, there
are often bandwidth limitations to and from the card/chip. FPGAs also
lack the things provided by the more conventional and evolved HPC
processing alternatives a tested, flexible, parallel-programming
environment that responds rapidly to its user community and, of course,
high clock rates. FPGA system vendors are working on the programming
environment, which has improved significantly in the last few years,
but it will take more time to fully integrate FPGA programming into the
HPC culture.
The venerable vector processor's great strength
is that its ISA most aptly and economically reflects the sweet spot of
HPC parallelism, DLP. As a bonus, in extending the vector concept into
memory (both local and remote on the Cray X1E), it elevates sustained
performance from the abysmal single digit percentage realm of the
commodity microprocessor and hides a great portion of the data latency
that would otherwise sour the sweetness of DLP. All this is done at
lower, more power-efficient clock speeds. Its weaknesses all relate
back to the way it clashes with HPC on commodity microprocessors.
First, general-purpose microprocessors consume poorly written code with
limited ill effect (a rising clock speeds all programs). Feed a vector
processor on the same diet and you will give its scalar unit heart
burn. This is because scalar processors inherit the slow clocks of
their vector companions, and because they are not designed by the same
army of electrical engineers that work for AMD and Intel. Scalar units
on vector processors have been invariably weak, and code modifications
to improve vector performance, when they are undertaken, often benefit
cache-based microprocessors to some extent as well. Vector processors
have survived and will continue to do so because well-written vector
code can deliver sustained performance that is 30, 40 or 50 percent or
more of peak.
The IBM Cell processor, like its GPU cousins, is
designed as a high performance graphics engine that will deliver over
200 32-bit, not-quite-IEEE GFLOPS at clock speeds of 3.2 GHz and up in
the Sony Playstation. Such high peak performance figures make great
marketing material for both IBM and the vendors of graphics cards. And
yet, Cell offers more to HPC than a high-end graphics card. It has
greater programming flexibility, true IEEE 64-bit floating-point, and
glue-less SMP capability that the typical graphics card lacks. This has
created a lot of potential interest in Cell in the HPC user community.
A dual-socket Cell node has a peak, 64-bit floating-point performance
of about 50 GFLOPS when its master PowerPC Processing Element (PPE) and
8 Synergistic Processing Element (SPE) cores are counted together. Its
SPEs can be flexibly utilized in a thread-level-parallel or a
data-stream-parallel mode. Cell's weaknesses include: a complicated
programming model that will at first require separately compiled
objects for the master PPE and slave SPEs, and a pthreads-like parallel
API; initial memory-per-socket limitations of 512 MBs, due to channel
skew on its XDR RDRAM memory interface; and the question of whether it
will meet the predicted commodity, game-space price-points in the HPC
space.
Regarding threads, care must be taken to distinguish
between multi-cored-ness and multi-threaded-ness. The first is
fundamentally a hardware concept like that expressed in Intel's new
dual-core Woodcrest processors, which double core hardware, retain
exactly the same ISA, and do not offer hyper-threading. From this point
of view, multi-core intersects HPC space in much the same way that
multi-socket does. It provides another partially independent parallel
processing engine that can be applied to a parallel application,
whether via MPI or OpenMP. In theory, its design strength is that it
doubles (or quadruples) the parallel processing power available on the
same chip/node without doubling the clock or the investment in ILP
detecting circuitry. The primary weakness from the HPC point of view is
that code running on each core must work out of its own cache or share
bandwidth to memory -- like a Siamese twin with two torsos and one pair
of legs. Memory bandwidth, which is already often rate-limiting in HPC
codes, becomes potentially even more limiting with dual-core processors.
It
is more interesting to consider true multi-threaded designs whose
micro-architectural plan supports the pre-definition and positioning of
blocks of independent instructions (threads) to sustain the forward
progress of an application or an application mix when there is a bubble
in the processors pipeline, or when either a data or instruction
latency event occurs. Such a simultaneous, multi-threaded (SMT) design
is enhanced by a multi-core, but does not require it. As an example,
the Cray Eldorado microprocessor pre-positions up to 128 thread
segments in processor-resident segment registers. Its strength is that
if, in any segment instruction queue, there is a latency event (a
memory, branch, or synch instruction), that thread segment is skipped
and forward progress is sustained elsewhere in another thread. Such a
machine reduces the program performance problem to the single goal of
hiding latency. It is worth noting that Eldorado is a single-core chip.
From
an HPC perspective, the potential weakness of this
thread/instruction-intensive approach is that the ratio of independent
executable instructions to latency-generating instructions has to be
high, or the raw latency of the program is exposed as a performance
slow down. In many HPC applications on scalar processors, the number of
memory operations alone is often close to 45 percent of the instruction
mix. While some instruction-intensive HPC applications with little data
locality are suited for such a multi-threaded design (linked list
searches, graph based algorithms, even sparse matrix operations, etc.),
many are not. This design clearly has an advantage in the
instruction-intensive environment of web and other servers, where
sustaining the throughput of a large mix of jobs is the main goal.
Sun's new UltraSparc T1 processor, capable of running 32 threads
simultaneously, is designed to capitalize on this requirement.
HPCwire:
A number of vendors plan to couple multiple processor types together,
tightly or loosely, in large-scale systems. They claim this form of
heterogeneous processing will be more efficient, because each
application, or portion of an application, can be sent to the processor
type that's best suited for it. What's your opinion?
Walsh:
The notion that applications run most efficiently on processors best
suited for them seems to be beyond question. Yet, in the idea's
self-evidence it stimulates the more interesting thought that takes it
to its limit -- every application will perform best on a custom
processor designed for it. This is the point on the horizon which gives
rise to your question and the mixed-architecture roadmaps being
promulgated by HPC vendors today. The real question is, as lowest
common denominator, clock-driven performance improvements fade, how and
how fast will heterogeneous processing capabilities become integral to
HPC? This is a technology horse race for which we all wish we could
forecast the outcome.
Trends in technology seldom spring from
the ether ab initio, and so it is with heterogeneous processing in HPC.
There is already a pre-history of mixed-architecture systems to help
answer the question above. The recent, rapid evolution of the graphics
processing units into powerful, floating-point engines has placed them
in mixed-architecture systems solving HPC problems. In mostly academic
settings, GPUs integrated into HPC clusters have already been used to
significantly speed up BLAS, FFT, CFD, and sequence analysis
applications. The rock-bottom dollar per GFLOP trajectory of the GPU
will ensure that this mixed-architecture HPC trend will continue for a
time, albeit weighed down by a still clumsy programming model, the lack
of 64-bit IEEE arithmetic, high power consumption, and a rigid DLP-only
architecture.
Mixed-architecture contenders that probably have
an advantage over the GPU are commodity and custom CPU-FPGA systems
such as those provided by Cray, SRC, and HPTi. Compared to GPUs, FPGAs
consume less power; they can perform an increasing number of 64-bit,
full IEEE GFLOPS; and they can be flexibly re-architected for each HPC
application kernel. These advantages and recent, significant efforts to
streamline its unfamiliar code-plus-circuit programming model give it
an advantage over GPUs. There are as many significant, early HPC
applications of mixed-architecture, CPU-FPGA technology to HPC problems
as there are of CPU-GPUs. Moreover, in the long run FPGAs are the
technology most likely to provide every application with a custom
processor designed for it.
The IBM Cell, as a heterogeneous,
multi-core processor, illustrates the difficulty of providing an
HPC-user-friendly parallel programming model for mixed-architecture
systems. The IBM Cell is unencumbered by the GPU's graphics programming
abstraction or the circuit definition requirements of the FPGA, and
benefits from having its mixed-architecture on a single chip. Yet,
IBM's papers on developing its optimizing, parallel compiler for the
Cell demonstrate the difficulty of the task. While the chip is ready,
its single-source parallel programming model is not. Major challenges
that the well-staffed IBM Cell compiler group has had to address
include implementing software branch prediction, developing a software
cache for the SPE's local store, presenting a unified memory
abstraction to the programmer, automating the generation of SIMD
instructions for both the SPEs and PPEs, etc. For all the progress that
has been made, HPC Cell programmers will need to work in dual-source
mode for the time being. Smaller, less well-funded and staffed
companies setting out to provide a single source look-and-feel for
multi-socket or multi-core, mixed-architecture systems should beware.
Regardless,
the short answer to the question is yes, heterogeneous systems offer
the prospect of better performance for HPC, but this will not be
realized broadly without substantial catalytic assists to the
programmer from companion parallel programming environments for
heterogeneous HPC systems.
HPCwire:
What are the main challenges involved in making heterogeneous systems
like this effective? Can it be done in, say, the next 4-5 years?
Walsh:
I think I have partially answered this question. The main challenge HPC
faces with heterogeneous systems is the absence of an HPC-familiar and
productive parallel programming environment to extract performance from
such systems. Heterogeneous systems, particularly those with
co-processors, add layers to the memory stack and create new data
partitioning and communication challenges for the parallel (MPI,
OpenMP, UPC, CAF) programmer. This is in addition to removing the
convenient simplification that all the processors working on a parallel
application are identical, and replacing it with a system in which they
can be radically different.
While there have already been some
programming successes with mixed-architectures (in academic settings in
particular) even without the wished-for HPC programming environments,
and early-adopter organizations, with large or time critical problems,
will invest capital and intellectual resources in this new HPC
technology, it will take as long to mature and integrate into the HPC
community as parallel programming has taken. If this is correct, then I
think your suggested 4 to 5 year timeframe is perhaps a bit optimistic.
Surely, by then significant parts of the HPC user community will be
comfortable with program-ming for heterogeneous systems, but
significant parts will not yet be comfortable, and some may still be
optimizing first-generation, MPI-parallel versions of their codes.
HPCwire:
With the slowdown in Moore's Law's progress, vendors have already gone
to dual-core, quad-core and even eight-core processors. How does this
trend affect the breadth-of-applicability of HPC systems?
Walsh:
If multi-core architectures offered as much to the HPC community in
improved price-performance going forward as the rapid clock period
improvements and advances in superscalarity did over the last ten
years, we would not be talking at all about mixed-architecture HPC
systems. It could be argued that multi-core, general-purpose
microprocessors are a tool that can be purchased at a low price, but
that were designed for use by a different market or buyer. Their low
cost does not guarantee their effective use in HPC. The question of
viable effective use would seem to grow for the HPC community with the
number of cores on the chip. This relates back to the data-intensive
nature of most HPC applications and the sharing of already limited
bandwidth to memory.
The stream benchmark performance of Intel's
new Woodcrest dual-core processor illustrates this point. Woodcrest has
helped Intel in its performance race with AMD, and early benchmarks
predict it will be a success in the marketplace. Much effort was put
into improving Woodcrest's memory subsystem, which offers a total of
over 21 GBs/sec on nodes with two sockets and four cores. Yet,
four-threaded runs of the memory intensive Stream benchmark on such
nodes that I have seen extract no more than 35 percent of the available
bandwidth from the Woodcrest's memory subsystem. It is perhaps early
and compiler improvements may offer more at some point, but for future
single-socket, quad-core, or eight-core systems (perhaps with a shared
cache like Woodcrest) what should be expected, and what are the
implications for data intensive HPC applications?
Even if cache
remains unshared and grows proportionally with the number of cores on
the chip, blocking for cache on four- and eight-core systems sharing
one path to memory will be less effective. For codes where this is
still possible, or whose kernels are cache-resident, or that have fixed
memory requirements and might see super-linear speedups at some scale,
multi-core, beyond dual, will offer something. However, for HPC
applications demanding high-bandwidth, having smallish FLOP/MOP ratios
in their kernels, and with perhaps limited data-locality, multi-core,
beyond dual, will offer little. HPC users with such applications will
find better performance on those DLP-oriented mixed-architecture or
vector systems with the best bandwidth that we discussed earlier.
The
doubling and quadrupling in ILP that multi-core chips offers on paper
will not deliver the same broadly based benefits across HPC
applications space as the clock period and super-scalarity improvements
of the prior decade did. Today's much higher processor-to-memory clock
ratio and memory bus width and bandwidth limitations contribute to this
effect. If tomorrow's Top500 HPC systems are to be based largely on
commodity microprocessors as they are today, but with four to eight
cores per processor, their breadth-of-applicability across HPC
applications space will be reduced. On the other hand, more innovative
and currently experimental, tiled multi-core designs similar to MIT's
RAW microprocessor may offer a way around the multi-core memory
bottleneck through stream-oriented processing and instruction sets that
expose the control of on-chip interconnects to the compiler. This is
not currently part of the commodity multi-core trend or the x86
instruction set.
HPCwire: When you think about the future of HPC, what keeps you up at night?
Walsh:
It would have to be the excitement and difficulty of tracking the
developments in a field that is incredibly dynamic and broad in scope,
whether viewed purely as an evolving technology or as a catalyst of
science and engineering.
-----
Richard B. Walsh is a
project manager with Network Computing Services Inc. at the Army High
Performance Computing Research Center (AHPCRC).
The Army High
Performance Computing Research Center is funded under contract
DAAD19-03-D-0001 with the U.S. Army Research Laboratory. The views and
conclusions should not be interpreted as presenting the official
policies or positions, either expressed or implied, of the U.S. Army
Research Laboratory or the U.S. Government.