Home arrow Get Informed arrow Blogs arrow Paul Murphy: The obligatory ‘Victoria Falls’ post
Paul Murphy: The obligatory ‘Victoria Falls’ post PDF Print E-mail
Written by Paul Murphy   
Tuesday, 15 April 2008

For those who don’t know, Sun’s new T2+ machines extend the T2’s CMT capabilities across multiple units to produce 16 and 32 core SMP machines capable of handling 128 and 256 concurrent threads respectively.

Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here

By itself the T2 continues to set new performance records - Sun’s bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records.

 

For those who don’t know, Sun’s new T2+ machines extend the T2’s CMT capabilities across multiple units to produce 16 and 32 core SMP machines capable of handling 128 and 256 concurrent threads respectively.

Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here

By itself the T2 continues to set new performance records - Sun’s bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records.

The new machines don’t offer the kind of quantum leap the T2 did - obviously because the T2+ is a continuation within the UltraSPARC SMP/CMT line and less obviously because market pricing constraints limit the throughput possible in other parts of the system.

The most illustrative benchmark result I’ve seen on this, also as reported by bmseer involves Lotus Domino. Here’s part of that report:

Lotus Domino 7.0.1 NotesBench R6iNotes Performance Chart (in increasing $/User order)

Users = number of users supported (bigger is better)
NotesMark = the benchmark metric (bigger is better)
$/User = cost per user (smaller is better)

System

Chip GHz

Cores/
Chip

OS

USERS

N-MARK

#Dom Part

AvRT

$/User

Complete benchmark results may be found at the Lotus NotesBench website http://www.notesbench.org.

Notice that doubling the CPU only produced about a fifty percent increase in throughput -an artifact of limitations elsewhere in the system. Users, however, don’t care about throughput in applications like this: they care about response time - and that’s where the T2+ really shines, reducing the average response time from 584ms to only 224ms - a 60% improvement.

That’s an artifact of the CMT architecture and a pointer, I think, to the markets that this thing will sell into in volume.

On the other hand.. the way the processors are coupled - done by replacing the the T2’s on board 10Gbyte facility - demonstrated that Sun can now produce highly customized versions of the core CPU set and suggests what I believe may be a unique performance opportunity for this product line.

On the hardware customization side: suppose you consider a couple of million bucks no object for getting T2 machines that do FFT on short (16 way) vector processors - Sun has now shown it can do that with COTS parts that can be produced in volume.

The performance opportunity is a bit esoteric ( :) ) but comes down to this: there are time critical applications in which the majority of the processing effort goes into moving data between process groups -and the Solaris/T2+ combination lets you move relatively lightweight processes instead of “heavyweight” data across, for the expected four-way machine, 256 threads and 64 direct PCI/E channels.

This possibility isn’t going to change how products like Apache or even compilers are built, but should make it possible to do some things no one could before.

Imagine, for example, that your application will get about 8GB worth of image data every three seconds -potentially 24 x 7; primary per image base processing now takes about 4.3 seconds on one of Mercury Computing’s dual cell blades; secondary processing now takes another 8 seconds on one of those blades; you want to keep a minute’s worth of data for instant replay; you generally expect to throw away more than 99.99% of all incoming data; and, you want to move the entire system around on a truck.

To do it now you’ll need a large vehicle because you’ll need to carry and power several rackmounts stuffed with cell blades - first because the things are incredibly fast at floating point, but terribly bad at throughput; and, equally importantly, because memory and bandwidth limitations combine with that playback requirement to force you to spend the majority of the effort you put into processing each arriving image just shuffling it around.

Choose the T2+ instead and you’ll get slower floating point but faster I/O and more storage flexibility - so, while the programming required might be a bit tricky (is there a Pulitzer for understatement?) I think success would give you something you could carry in a small launch or Hummer that would actually run faster and more reliably than anything else you could build.

 

Read the original article: http://blogs.zdnet.com/Murphy/?p=1117

Comments (0)add comment

Write comment
quote
bold
italicize
underline
strike
url
image
quote
quote
smile
wink
laugh
grin
angry
sad
shocked
cool
tongue
kiss
cry
smaller | bigger

security image
Write the displayed characters


busy
 
Next >
impersonal-mites
Generated in 0.338238954544 Seconds