Home Get Informed Processor News 2007-03 Real World Technolgies: ISSCC 2007: A Brief Survey

Real World Technolgies: ISSCC 2007: A Brief Survey

PDF Print E-mail
Written by David Kanter (Real World Technologies)   
Sunday, 25 March 2007 20:01

Every year the International Solid-State Circuits Conference (ISSCC) is held at the San Francisco Marriott, so that chip design companies can show off the tricks and techniques used in their latest technological marvels. ISSCC is sponsored by IEEE Solid-State Circuits Society, the local Santa Clara chapter and the University of Pennsylvania – however, the event is largely co-ordinated by graduate students and staff from the University of Toronto. However, the scope of the event is clearly international - for most European and Japanese companies, the long and expensive flights dictate that only a few events each year can be attended, and ISSCC is the conference of choice.

Introduction

Every year the International Solid-State Circuits Conference (ISSCC) is held at the San Francisco Marriott, so that chip design companies can show off the tricks and techniques used in their latest technological marvels. ISSCC is sponsored by IEEE Solid-State Circuits Society, the local Santa Clara chapter and the University of Pennsylvania – however, the event is largely co-ordinated by graduate students and staff from the University of Toronto. However, the scope of the event is clearly international - for most European and Japanese companies, the long and expensive flights dictate that only a few events each year can be attended, and ISSCC is the conference of choice.

Well over 4000 attendees showed up, for tracks ranging from PLL design, to the ever popular analog and MPU sessions. This year’s ISSCC was quite interesting and featured a wide variety of presentations, often scheduled at mutually exclusive times. There were two papers in the emerging technology track on the use of carbon nanotubes, both from IBM Research. The circuit design forum, which occurs the day before ISSCC had a very interesting session on 3D integration, with presentations from IBM, Intel, Georgia Tech, DARPA and others. While the sessions are usually quite valuable, one of the best parts about ISSCC is that it brings together an incredible range of talented engineers in a single venue, and the ensuing discussions are always very enlightening.

Our coverage of ISSCC will be split into several parts. This first part will cover several sessions from the MPU track in moderate detail, particularly those concerning PA Semi, Intel’s Merom, Sun’s Niagara II and NEC’s early defect prediction. Later articles will go in-depth on a single subject, such as Barcelona and Intel’s Polaris.

 

PA Semi

The fifth presentation in the microprocessor track was from PA Semi describing the techniques used to achieve a low power part with reasonable performance.

The PA6T-1682 is a system on a chip with 25W TDP that features a pair of three way superscalar out-of-order cores operating at 2GHz, a 2MB L2 cache, two integrated DDR2 controllers, and an I/O system connected through a coherent crossbar. The I/O portion of the chip contains two 10GBE MACs, four 1GBE MACs, 8 PCI Express lanes and several coprocessors. A previous article at RWT describes the architecture in much greater detail. The system on a chip is fabricated on a 65nm, triple Vt process with 8 metal layers. The entire design uses 200M transistors, 21M per core, and is 115mm2 with 23,000 clock gates. The device will be packaged in an 1156 ball BGA and has currently sampled to select customers.


Figure 1 – PA6T-1682

The design methodology heavily relied on internally developed standard cells that were optimized for power efficiency. Relatively few custom blocks are used, due to power constraints, and the high speed portions of the chip were done with a structured custom approach. PA also developed an internal tool that estimates power savings for clock gating based on the RTL. As the design moved farther along, commercial tools were used to verify the savings, and correlated well with estimates from the custom tools.

Figure 1 above shows a die micrograph, with different colors for the various voltage domains. Each core has an independent supply and adaptive control. Software specifies the frequency to each processor, and then the voltage is adjusted to the lowest level such that the desired frequency can be obtained. This tuning occurs on a per part basis, and therefore takes into account process variation. Adjusting the frequency based on demand is nothing new, as modern mobile chips have done that for a while; but the simultaneous per-part voltage optimization is novel. The cores can also be shut down, without any problems. The SRAM arrays have their own fixed Vdd supply, because the voltage must stay relatively high, to ensure that writes will function properly. Similarly, the memory and I/O system also have their own fixed Vdd.

Using a dynamic Vdd for each core saves a substantial amount of power, but also creates some headaches for clock distribution. Since the core voltage varies, while the bus voltage is fixed, the clock tree delays may not match up. To solve this problem, hardware tracks phase drift between the core and bus, and then will choose a path which is both synchronous and fixed latency. Over time, the appropriate path may change due to temperature or voltage, and when that happens, the bus will halt to make adjustments.

The memory controller is another large source of power consumption that PA Semi worked on. The scheduler for the controller works to put ranks to sleep based on the performance impact. The ranks with the most outstanding transactions are left on, while others are powered down. In the case of no outstanding transactions, the ranks would be closed relatively quickly. This straight forward optimization saves around 2 watts on multimedia or floating point workloads, while only losing 1-2% performance, an acceptable and attractive trade-off.

Through the use of novel power saving design techniques the PA6T-1682 ultimately is able to achieve a 13W typical, 25W maximum power at 2GHz. Software can lower the frequency to 1.5, 1, 0.5GHz, which reduces power to 16, 12 and 9W respectively. However, the real savings are in the three different sleep modes: doze, nap and sleep. The doze mode stops the core clock, but still snoops on the crossbar and offers immediate transition to an active state, while consuming around 2.5W. Nap and sleep mode go even lower to 2W, but require slight entry and recovery times, as they flush the data caches.

Merom

The next paper in the microprocessor track was from Intel, regarding Merom, or the Core microarchitecture. Merom implements two 4 issue superscalar, out-of-order microprocessors, sharing a 4MB L2 cache and a front-side bus interface in Intel’s high performance 65nm processor. The microarchitecture was previously disclosed at IDF and described in great detail here. Unfortunately, it appears that the ISSCC presentation ran afoul of some rather aggressive marketing staff, and was relatively light in terms of content. The highlight of the presentation was a discussion of the L2 cache.

Merom’s L2 cache is implemented as 1024 4KB sub-arrays, with 16 way associativity. The SRAM bit cells are 0.74um2 each and the cache access time (from when the address arrives to when data is sent out) is 2ns, including tag check, data read, and any error correction. During such an access, the cache only powers up 0.8% of all blocks.

The cache uses sleep transistors to set a virtual Vcc as much as 500mV below the actual Vcc, reducing leakage by 3x. The sleep transistors are also used in what Intel calls cache on demand mode. Essentially, the microarchitecture identifies the least frequently (or perhaps least recently) used cache blocks, and shuts them off, evicting the data to memory. This is a risky technique, as it would be easy to hurt performance and increase power draw (since fetching from memory is very expensive), but reduces leakage by 7x versus normal array operation. All these techniques contribute to an excellent idle power consumption of roughly 380mW/MB.


Figure 2 – Merom Die Micrograph

Niagara II

The last processor presented was Niagara II, Sun’s second throughput computing oriented processor. The architecture for Niagara II was first presented at Hot Chips 18, and previously described here. Niagara I implemented 8 scalar SPARC compatible cores, each supporting four threads, a single shared FPU and four integrated DDR memory controllers all in TI’s 90nm process. Niagara II takes advantage of the denser 65nm process to create a system on a chip with roughly twice the performance. Niagara II augments each core with a FPU pipeline, an integer pipeline and four more threads. At the system level, the device sports 4MB L2 cache, two 10GBE interfaces, wire speed cryptography, a PCI-Express x8 port for storage and 4 FB-DIMM memory controllers. The whole device is 342mm2, and uses 503M transistors in TI’s 65nm bulk process with 11 layers of metallization. The I/O portion of the chip mainly uses SerDes at 1.5V, and the core operates at 1.1V. At 1.1V, the device is targeted at 1.4GHz, with a worst case power draw of 84W.

It appears that in the aftermath of the Millenium project, Sun has really put a lot of emphasis on timely execution and delivery. Niagara II certainly put a heavy emphasis on ‘design for manufacturing’ techniques to increase yields. To avoid project risk and decrease power consumption, a static cell-based methodology was used for most of Niagara II. The only custom circuits were for SRAMs and analog and were proven on test chips prior to first silicon. As with all of the other MPU designs presented, low Vt transistors were used, but only sparingly and in crucial speed paths. Oftentimes, transistors were laid out using larger than minimum design rule, and critical areas were checked using OPC simulations to ensure correctness. Architectural DFM features include support for less than 8 SPARC cores or L2 cache banks; selectively disabling cores/banks on partially flawed dice increases the overall yield.

One of the more challenging areas that the presentation touched on was the clocking across the chip. Since Niagara II is a system on a chip, there are numerous regions of the chip that are running with varying degrees of synchronization.

Figure 3 – Clock Domains in Niagara II

The asynchronous clock crossings are handled by FIFOs that absorb any clock period or skew mismatches. An on-chip PLL generates ratioed synchronous clocks, with a wide range of fractional divisors (2-5.25 in 0.25 increments) to accommodate many of the clock domain crossings. Because the target frequency for Niagara II is relatively low, a less accurate global clock is tolerable. A combination of H-trees and grids were used for clock distribution, compromising between low skew and low power.

The ratioed synchronous clock crossings occur at interfaces between the SPARC cores, crossbar interconnect and other system elements; typically the latter run at a slower clock. Data is transferred between the fast and slow clock domains at the optimal fast clock cycle. Since the clocks are started based on the reference clock, there is a periodic alignment between the rising edges of each clock. An edge detection circuit is responsible for tracking this alignment (which is periodic in nature). It emits an ‘aligned’ signal, which tracks the fast clock latency, when the clocks will be aligned at the destination cluster, and a data transfer is initiated in both directions.

Niagara II incorporates three different high speed, serial I/O technologies: FB-DIMM for memory, PCI-Express and XAUI for 10GBE. These run respectively at 4.8GHz, 2.5GHz and 3.125GHz, and provide 921, 40 and 100Gb/s raw bandwidth respectively, over a terabit per second total. All three interfaces use a common SERDES microarchitecture. To accommodate the slight differences, specifically that FD-DIMM uses Vss signaling (rather than Vdd), a level shifter was employed so that all three SERDES could share the same NMOS-based receivers.

Naturally, a lot of emphasis went into techniques to reduce power consumption for Niagara II. Clocks are gated at both the cluster and local clock-header level. The circuit designers also employed ‘gate-bias’ cells, which have a 10% longer channel, but reduce leakage by 40%. Niagara II also incorporates dynamic power management; the operating system can turn off threads, and a power throttling mode alters the instruction issue rate for the SPARC cores to manage power consumption. This power throttling can reduce consumption by up to 30% at the most aggressive setting, with a suitable workload. Similarly, the memory controllers can throttle access rates, or enter DRAM power-down modes to reduce memory power consumption. Lastly, on-chip thermal diodes monitor the junction temperature, in cases of cooling failures, the operating system can use the various techniques above to ensure continuous (albeit slow) operation. All these factors help to keep power consumption under 84W at worst case, which is fairly remarkable for a high performance server system – it will be interesting to see the resulting server products.

 

NEC’s Early Defect Prediction

Presentation 22.3 in the Digital Circuit Innovations track was from researchers at NEC who demonstrated a method for detecting failures in integrated circuits. The motivation for this technique is twofold. In the long term, the semiconductor market will grow fastest in the embedded world; companies have identified medical and automotive applications as key targets. Both of these industries do not tolerate failure well; if a PC component dies, it’s really not a big deal. However, if a brake controller, or an implanted insulin monitor breaks, or worse yet, experiences silent data corruption, someone could easily die. The second trend is that as manufacturing moves to finer and finer geometries, the error rate increases exponentially. The end result is that designers must begin to actively plan for failures in the field, and figure out how to continue operation.


Figure 4 – Defect Prediction Flip Flop

NEC uses what they call a ‘defect prediction flip flop’ (DPFF) to sense the total path delay in a small logic block. Failures that occur gradually over time will manifest as increasing path delays, until the delay exceeds the cycle time. Detecting such a pattern is a relatively simple matter. If the path delay exceeds a threshold value for several consecutive cycles, then a failure is likely to occur. This ensures that a transient failure will not trigger erroneously. As an example, NEC manufactured a 330MHz test chip with a pseudo-defect circuit. The cycle time is 3.03ns, and the ‘warning’ band was set at 95ps, with a threshold of ~2.93ns

The DPFF is used in conjunction with fine grained redundancy. The entire logic portion of a chip can be broken up into small regions, with a DPFF between each region. When a failure is likely, the DPFF switches off the main logic block and has the back up take over; this sort of ‘logic failover’ prevents any errors from occurring. The only way for a fatal error to occur is if both a logical block and the redundant block are hit by errors. However, since the logic can be divided into very small blocks, this is unlikely to happen (imagine redundancy at the functional unit level, versus at the core level). According to NEC’s experimental results, their early defect prediction and fine grained redundancy is superior to 4 way redundancy without prediction, but only uses 2.5x the area of a normal design. After 2 defects, 81% of the NEC test chips were still functional, and 59% were functional after 5 defects, compared to 33% and 1% respectively for a 3 way redundant architecture.

 

Conclusion

Of the sessions we surveyed, there is quite a bit of diversity. Intel’s presentation on Merom, though lackluster, focused on a key challenge; reducing power consumption of caches, which are an increasingly large portion of the overall die area. Caches are also, along with I/O, one of the areas of the chip that is guaranteed to receive plenty of attention and custom design. No doubt there are many details being left out of Intel’s presentation for competitive reasons.

One challenge that Intel’s Haifa designers did not have to contend with was integration. Sun and PA Semi both had to contend with the difficulties of full system integration and everything that it entails. The key challenge here is managing several different voltage and clock domains, for caches, cores, and multiple (or in PA’s case reconfigurable) I/Os. This is doubly difficult since modern power saving techniques often involve reducing the voltage and frequency on the fly in the cores and caches.

Power saving of course, is a common theme throughout all MPU designs as almost everyone suffers from thermal and power limits. Clock gating is a given now, although the degree of clock gating varies from project to project. Software power management is also de rigueur; pretty much every design offers some form of software triggered sleep mode. One clear trend is that dynamic, per-part optimization will likely be standard in the not too distant future. While the first MPU to employ such techniques, Montecito, suffered a few missteps, the underlying ideas were present in force. The POWER6 and PA Semi both employ techniques that are similar in concept. In fact, one of the design forums held the day after ISSCC was titled “Adaptive Techniques for Dynamic Processor Optimization”, and included speakers from AMD, IBM, Intel, Texas Instruments and others.

Of all the presentations that we attended, the one that was most clearly aimed at the future was the defect prediction techniques that NEC showed. One of the fundamental issues going forward with CMOS scaling is that soft error rates, on both logic and memory, are expected to increase exponentially. Similarly, process variability is expected to rise dramatically. This means that system designers are inevitably going to be expected to build systems that can tolerate failures. When failure becomes a given over the life time of a product, techniques to predict failures before they occur will undoubtedly be quite useful.

While this concludes our initial coverage of ISSCC 2007, the other notable papers, from Intel, IBM and AMD will also be discussed in later, more detailed articles.


Read the original article: http://www.realworldtech.com/page.cfm?ArticleID=RWT032607001221

 
online pokies aussie South Africa bonus