|
For those who don’t know, Sun’s new T2+ machines extend the T2’s
CMT capabilities across multiple units to produce 16 and 32 core SMP
machines capable of handling 128 and 256 concurrent threads
respectively.
Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here
By itself the T2 continues to set new performance records - Sun’s bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records.
For those who don’t know, Sun’s new T2+ machines extend the T2’s
CMT capabilities across multiple units to produce 16 and 32 core SMP
machines capable of handling 128 and 256 concurrent threads
respectively.
Sun blogger Denis Sheahan provides a good overview of the current dual socket releases here
By itself the T2 continues to set new performance records - Sun’s bmseer usually has the latest; most recently a pair of new SPECint_rate2006 and SPECfp_rate2006 records.
The new machines don’t offer the kind of quantum leap the T2 did -
obviously because the T2+ is a continuation within the UltraSPARC
SMP/CMT line and less obviously because market pricing constraints
limit the throughput possible in other parts of the system.
The most illustrative benchmark result I’ve seen on this, also as reported by bmseer involves Lotus Domino. Here’s part of that report:
Lotus Domino 7.0.1 NotesBench R6iNotes Performance Chart (in increasing $/User order)
Users = number of users supported (bigger is better)
NotesMark = the benchmark metric (bigger is better)
$/User = cost per user (smaller is better)
|
System |
Chip GHz |
Cores/
Chip |
OS |
USERS |
N-MARK |
#Dom Part |
AvRT |
$/User |
Complete benchmark results may be found at the Lotus NotesBench website http://www.notesbench.org.
Notice that doubling the CPU only produced about a fifty percent
increase in throughput -an artifact of limitations elsewhere in the
system. Users, however, don’t care about throughput in applications
like this: they care about response time - and that’s where the T2+
really shines, reducing the average response time from 584ms to only
224ms - a 60% improvement.
That’s an artifact of the CMT architecture and a pointer, I think, to the markets that this thing will sell into in volume.
On the other hand.. the way the processors are coupled - done by
replacing the the T2’s on board 10Gbyte facility - demonstrated that
Sun can now produce highly customized versions of the core CPU set and suggests what I believe may be a unique performance opportunity for this product line.
On the hardware customization side: suppose you consider a couple of
million bucks no object for getting T2 machines that do FFT on short
(16 way) vector processors - Sun has now shown it can do that with COTS
parts that can be produced in volume.
The performance opportunity is a bit esoteric ( :) ) but comes down
to this: there are time critical applications in which the majority of
the processing effort goes into moving data between process groups -and
the Solaris/T2+ combination lets you move relatively lightweight
processes instead of “heavyweight” data across, for the expected
four-way machine, 256 threads and 64 direct PCI/E channels.
This possibility isn’t going to change how products like Apache or
even compilers are built, but should make it possible to do some things
no one could before.
Imagine, for example, that your application will get about 8GB worth
of image data every three seconds -potentially 24 x 7; primary per
image base processing now takes about 4.3 seconds on one of Mercury
Computing’s dual cell blades; secondary processing now takes another 8
seconds on one of those blades; you want to keep a minute’s worth of
data for instant replay; you generally expect to throw away more than
99.99% of all incoming data; and, you want to move the entire system
around on a truck.
To do it now you’ll need a large vehicle because you’ll need to carry and power
several rackmounts stuffed with cell blades - first because the things
are incredibly fast at floating point, but terribly bad at throughput;
and, equally importantly, because memory and bandwidth limitations
combine with that playback requirement to force you to spend the
majority of the effort you put into processing each arriving image just
shuffling it around.
Choose the T2+ instead and you’ll get slower floating point but
faster I/O and more storage flexibility - so, while the programming
required might be a bit tricky (is there a Pulitzer for
understatement?) I think success would give you something you could
carry in a small launch or Hummer that would actually run faster and
more reliably than anything else you could build.
Read the original article: http://blogs.zdnet.com/Murphy/?p=1117
|