|
Every year the International Solid-State Circuits Conference (ISSCC) is
held at the San Francisco Marriott, so that chip design companies can
show off the tricks and techniques used in their latest technological
marvels. ISSCC is sponsored by IEEE Solid-State Circuits Society, the
local Santa Clara chapter and the University of Pennsylvania however,
the event is largely co-ordinated by graduate students and staff from
the University of Toronto. However, the scope of the event is clearly
international - for most European and Japanese companies, the long and
expensive flights dictate that only a few events each year can be
attended, and ISSCC is the conference of choice.
Introduction
Every year the International Solid-State Circuits Conference (ISSCC) is
held at the San Francisco Marriott, so that chip design companies can
show off the tricks and techniques used in their latest technological
marvels. ISSCC is sponsored by IEEE Solid-State Circuits Society, the
local Santa Clara chapter and the University of Pennsylvania however,
the event is largely co-ordinated by graduate students and staff from
the University of Toronto. However, the scope of the event is clearly
international - for most European and Japanese companies, the long and
expensive flights dictate that only a few events each year can be
attended, and ISSCC is the conference of choice.
Well over 4000 attendees showed up, for tracks ranging from PLL
design, to the ever popular analog and MPU sessions. This years ISSCC
was quite interesting and featured a wide variety of presentations,
often scheduled at mutually exclusive times. There were two papers in
the emerging technology track on the use of carbon nanotubes, both from
IBM Research. The circuit design forum, which occurs the day before
ISSCC had a very interesting session on 3D integration, with
presentations from IBM, Intel, Georgia Tech, DARPA and others. While
the sessions are usually quite valuable, one of the best parts about
ISSCC is that it brings together an incredible range of talented
engineers in a single venue, and the ensuing discussions are always
very enlightening.
Our coverage of ISSCC will be split into several parts. This first part
will cover several sessions from the MPU track in moderate detail,
particularly those concerning PA Semi, Intels Merom, Suns Niagara II
and NECs early defect prediction. Later articles will go in-depth on a
single subject, such as Barcelona and Intels Polaris.
PA Semi
The fifth presentation in the microprocessor track was from PA Semi
describing the techniques used to achieve a low power part with
reasonable performance.
The PA6T-1682 is a system on a chip with 25W TDP that features a pair
of three way superscalar out-of-order cores operating at 2GHz, a 2MB L2
cache, two integrated DDR2 controllers, and an I/O system connected
through a coherent crossbar. The I/O portion of the chip contains two
10GBE MACs, four 1GBE MACs, 8 PCI Express lanes and several
coprocessors. A previous article at RWT describes the architecture in much greater detail. The system on a chip is fabricated on a 65nm, triple Vt process with 8 metal layers. The entire design uses 200M transistors, 21M per core, and is 115mm2 with 23,000 clock gates. The device will be packaged in an 1156 ball BGA and has currently sampled to select customers.

Figure 1 PA6T-1682
The design methodology heavily relied on internally developed standard
cells that were optimized for power efficiency. Relatively few custom
blocks are used, due to power constraints, and the high speed portions
of the chip were done with a structured custom approach. PA also
developed an internal tool that estimates power savings for clock
gating based on the RTL. As the design moved farther along, commercial
tools were used to verify the savings, and correlated well with
estimates from the custom tools.
Figure 1 above shows a die micrograph, with different colors for the
various voltage domains. Each core has an independent supply and
adaptive control. Software specifies the frequency to each processor,
and then the voltage is adjusted to the lowest level such that the
desired frequency can be obtained. This tuning occurs on a per part
basis, and therefore takes into account process variation. Adjusting
the frequency based on demand is nothing new, as modern mobile chips
have done that for a while; but the simultaneous per-part voltage
optimization is novel. The cores can also be shut down, without any
problems. The SRAM arrays have their own fixed Vdd
supply, because the voltage must stay relatively high, to ensure that
writes will function properly. Similarly, the memory and I/O system
also have their own fixed Vdd.
Using a dynamic Vdd for each core saves a
substantial amount of power, but also creates some headaches for clock
distribution. Since the core voltage varies, while the bus voltage is
fixed, the clock tree delays may not match up. To solve this problem,
hardware tracks phase drift between the core and bus, and then will
choose a path which is both synchronous and fixed latency. Over time,
the appropriate path may change due to temperature or voltage, and when
that happens, the bus will halt to make adjustments.
The memory controller is another large source of power
consumption that PA Semi worked on. The scheduler for the controller
works to put ranks to sleep based on the performance impact. The ranks
with the most outstanding transactions are left on, while others are
powered down. In the case of no outstanding transactions, the ranks
would be closed relatively quickly. This straight forward optimization
saves around 2 watts on multimedia or floating point workloads, while
only losing 1-2% performance, an acceptable and attractive trade-off.
Through the use of novel power saving design techniques the
PA6T-1682 ultimately is able to achieve a 13W typical, 25W maximum
power at 2GHz. Software can lower the frequency to 1.5, 1, 0.5GHz,
which reduces power to 16, 12 and 9W respectively. However, the real
savings are in the three different sleep modes: doze, nap and sleep.
The doze mode stops the core clock, but still snoops on the crossbar
and offers immediate transition to an active state, while consuming
around 2.5W. Nap and sleep mode go even lower to 2W, but require slight
entry and recovery times, as they flush the data caches.
Merom
The next paper in the microprocessor track was from Intel, regarding
Merom, or the Core microarchitecture. Merom implements two 4 issue
superscalar, out-of-order microprocessors, sharing a 4MB L2 cache and a
front-side bus interface in Intels high performance 65nm processor.
The microarchitecture was previously disclosed at IDF and described in
great detail here. Unfortunately, it appears that the ISSCC
presentation ran afoul of some rather aggressive marketing staff, and
was relatively light in terms of content. The highlight of the
presentation was a discussion of the L2 cache.
Meroms L2 cache is implemented as 1024 4KB sub-arrays, with 16 way associativity. The SRAM bit cells are 0.74um2
each and the cache access time (from when the address arrives to when
data is sent out) is 2ns, including tag check, data read, and any error
correction. During such an access, the cache only powers up 0.8% of all
blocks.
The cache uses sleep transistors to set a virtual Vcc as much as 500mV below the actual Vcc,
reducing leakage by 3x. The sleep transistors are also used in what
Intel calls cache on demand mode. Essentially, the microarchitecture
identifies the least frequently (or perhaps least recently) used cache
blocks, and shuts them off, evicting the data to memory. This is a
risky technique, as it would be easy to hurt performance and increase
power draw (since fetching from memory is very expensive), but reduces
leakage by 7x versus normal array operation. All these techniques
contribute to an excellent idle power consumption of roughly 380mW/MB.

Figure 2 Merom Die Micrograph
Niagara II
The last processor presented was Niagara II, Suns second throughput
computing oriented processor. The architecture for Niagara II was first
presented at Hot Chips 18, and previously described here.
Niagara I implemented 8 scalar SPARC compatible cores, each supporting
four threads, a single shared FPU and four integrated DDR memory
controllers all in TIs 90nm process. Niagara II takes advantage of the
denser 65nm process to create a system on a chip with roughly twice the
performance. Niagara II augments each core with a FPU pipeline, an
integer pipeline and four more threads. At the system level, the device
sports 4MB L2 cache, two 10GBE interfaces, wire speed cryptography, a
PCI-Express x8 port for storage and 4 FB-DIMM memory controllers. The
whole device is 342mm2, and uses 503M transistors in TIs
65nm bulk process with 11 layers of metallization. The I/O portion of
the chip mainly uses SerDes at 1.5V, and the core operates at 1.1V. At
1.1V, the device is targeted at 1.4GHz, with a worst case power draw of
84W.
It appears that in the aftermath of the Millenium project, Sun
has really put a lot of emphasis on timely execution and delivery.
Niagara II certainly put a heavy emphasis on design for manufacturing
techniques to increase yields. To avoid project risk and decrease power
consumption, a static cell-based methodology was used for most of
Niagara II. The only custom circuits were for SRAMs and analog and were
proven on test chips prior to first silicon. As with all of the other
MPU designs presented, low Vt
transistors were used, but only sparingly and in crucial speed paths.
Oftentimes, transistors were laid out using larger than minimum design
rule, and critical areas were checked using OPC simulations to ensure
correctness. Architectural DFM features include support for less than 8
SPARC cores or L2 cache banks; selectively disabling cores/banks on
partially flawed dice increases the overall yield.
One of the more challenging areas that the presentation touched on was
the clocking across the chip. Since Niagara II is a system on a chip,
there are numerous regions of the chip that are running with varying
degrees of synchronization.
Figure 3 Clock Domains in Niagara II
The asynchronous clock crossings are handled by FIFOs that absorb any
clock period or skew mismatches. An on-chip PLL generates ratioed
synchronous clocks, with a wide range of fractional divisors (2-5.25 in
0.25 increments) to accommodate many of the clock domain crossings.
Because the target frequency for Niagara II is relatively low, a less
accurate global clock is tolerable. A combination of H-trees and grids
were used for clock distribution, compromising between low skew and low
power.
The ratioed synchronous clock crossings occur at interfaces
between the SPARC cores, crossbar interconnect and other system
elements; typically the latter run at a slower clock. Data is
transferred between the fast and slow clock domains at the optimal fast
clock cycle. Since the clocks are started based on the reference clock,
there is a periodic alignment between the rising edges of each clock.
An edge detection circuit is responsible for tracking this alignment
(which is periodic in nature). It emits an aligned signal, which
tracks the fast clock latency, when the clocks will be aligned at the
destination cluster, and a data transfer is initiated in both
directions.
Niagara II incorporates three different high speed, serial I/O
technologies: FB-DIMM for memory, PCI-Express and XAUI for 10GBE. These
run respectively at 4.8GHz, 2.5GHz and 3.125GHz, and provide 921, 40
and 100Gb/s raw bandwidth respectively, over a terabit per second
total. All three interfaces use a common SERDES microarchitecture. To
accommodate the slight differences, specifically that FD-DIMM uses Vss signaling (rather than Vdd), a level shifter was employed so that all three SERDES could share the same NMOS-based receivers.
Naturally, a lot of emphasis went into techniques to reduce power
consumption for Niagara II. Clocks are gated at both the cluster and
local clock-header level. The circuit designers also employed
gate-bias cells, which have a 10% longer channel, but reduce leakage
by 40%. Niagara II also incorporates dynamic power management; the
operating system can turn off threads, and a power throttling mode
alters the instruction issue rate for the SPARC cores to manage power
consumption. This power throttling can reduce consumption by up to 30%
at the most aggressive setting, with a suitable workload. Similarly,
the memory controllers can throttle access rates, or enter DRAM
power-down modes to reduce memory power consumption. Lastly, on-chip
thermal diodes monitor the junction temperature, in cases of cooling
failures, the operating system can use the various techniques above to
ensure continuous (albeit slow) operation. All these factors help to
keep power consumption under 84W at worst case, which is fairly
remarkable for a high performance server system it will be
interesting to see the resulting server products.
NECs Early Defect Prediction
Presentation 22.3 in the Digital Circuit Innovations track was from
researchers at NEC who demonstrated a method for detecting failures in
integrated circuits. The motivation for this technique is twofold. In
the long term, the semiconductor market will grow fastest in the
embedded world; companies have identified medical and automotive
applications as key targets. Both of these industries do not tolerate
failure well; if a PC component dies, its really not a big deal.
However, if a brake controller, or an implanted insulin monitor breaks,
or worse yet, experiences silent data corruption, someone could easily
die. The second trend is that as manufacturing moves to finer and finer
geometries, the error rate increases exponentially. The end result is
that designers must begin to actively plan for failures in the field,
and figure out how to continue operation.

Figure 4 Defect Prediction Flip Flop
NEC uses what they call a defect prediction flip flop (DPFF) to
sense the total path delay in a small logic block. Failures that occur
gradually over time will manifest as increasing path delays, until the
delay exceeds the cycle time. Detecting such a pattern is a relatively
simple matter. If the path delay exceeds a threshold value for several
consecutive cycles, then a failure is likely to occur. This ensures
that a transient failure will not trigger erroneously. As an example,
NEC manufactured a 330MHz test chip with a pseudo-defect circuit. The
cycle time is 3.03ns, and the warning band was set at 95ps, with a
threshold of ~2.93ns
The DPFF is used in conjunction with fine grained redundancy.
The entire logic portion of a chip can be broken up into small regions,
with a DPFF between each region. When a failure is likely, the DPFF
switches off the main logic block and has the back up take over; this
sort of logic failover prevents any errors from occurring. The only
way for a fatal error to occur is if both a logical block and the
redundant block are hit by errors. However, since the logic can be
divided into very small blocks, this is unlikely to happen (imagine
redundancy at the functional unit level, versus at the core level).
According to NECs experimental results, their early defect prediction
and fine grained redundancy is superior to 4 way redundancy without
prediction, but only uses 2.5x the area of a normal design. After 2
defects, 81% of the NEC test chips were still functional, and 59% were
functional after 5 defects, compared to 33% and 1% respectively for a 3
way redundant architecture.
Conclusion
Of the sessions we surveyed, there is quite a bit of diversity. Intels
presentation on Merom, though lackluster, focused on a key challenge;
reducing power consumption of caches, which are an increasingly large
portion of the overall die area. Caches are also, along with I/O, one
of the areas of the chip that is guaranteed to receive plenty of
attention and custom design. No doubt there are many details being left
out of Intels presentation for competitive reasons.
One challenge that Intels Haifa designers did not have to
contend with was integration. Sun and PA Semi both had to contend with
the difficulties of full system integration and everything that it
entails. The key challenge here is managing several different voltage
and clock domains, for caches, cores, and multiple (or in PAs case
reconfigurable) I/Os. This is doubly difficult since modern power
saving techniques often involve reducing the voltage and frequency on
the fly in the cores and caches.
Power saving of course, is a common theme throughout all MPU designs as
almost everyone suffers from thermal and power limits. Clock gating is
a given now, although the degree of clock gating varies from project to
project. Software power management is also de rigueur; pretty much
every design offers some form of software triggered sleep mode. One
clear trend is that dynamic, per-part optimization will likely be
standard in the not too distant future. While the first MPU to employ
such techniques, Montecito, suffered a few missteps, the underlying
ideas were present in force. The POWER6 and PA Semi both employ
techniques that are similar in concept. In fact, one of the design
forums held the day after ISSCC was titled Adaptive Techniques for
Dynamic Processor Optimization, and included speakers from AMD, IBM,
Intel, Texas Instruments and others.
Of all the presentations that we attended, the one that was most
clearly aimed at the future was the defect prediction techniques that
NEC showed. One of the fundamental issues going forward with CMOS
scaling is that soft error rates, on both logic and memory, are
expected to increase exponentially. Similarly, process variability is
expected to rise dramatically. This means that system designers are
inevitably going to be expected to build systems that can tolerate
failures. When failure becomes a given over the life time of a product,
techniques to predict failures before they occur will undoubtedly be
quite useful.
While this concludes our initial coverage of ISSCC 2007, the
other notable papers, from Intel, IBM and AMD will also be discussed in
later, more detailed articles.
Read the original article: http://www.realworldtech.com/page.cfm?ArticleID=RWT032607001221
|