|
So
I'm developing a framework for distributed computing called Java Object
Collaboration Platform (JOCP). If I miss the target, it'll be a decent
enough platform for running the multiplayer games my brother and I are
writing -- it already accomplishes this. If it meets ideals, it'll be a
possible contender in the hosted MMORPG space.
Message Passing Framework
So
I'm developing a framework for distributed computing called Java Object
Collaboration Platform (JOCP). If I miss the target, it'll be a decent
enough platform for running the multiplayer games my brother and I are
writing -- it already accomplishes this. If it meets ideals, it'll be a
possible contender in the hosted MMORPG space.
If I'm a
technical dork, it'll never amount to more than it is now: a message
passing framework. If I learn enough over the near term, I'll be able
to fill one of the goals of the project: full transparent and shared
state between multiple VM's, including .Net stacks.
IT Trends
As
I'm sure you're aware, the focus in IT seems to have shifted a lot over
the last few years from higher speed CPU's performing chores for
applications to multiple threads or multiple cores or even both packed
in the same package.
With Sparc, Intel, and AMD into a full-on
onslaught with their competing technologies for
multithreading/multi-core CPU's, I'm taking things as a clear sign that
I may have had just a little luck and prescience targetting a framework
architecture that capitalizes (nay, relies) on these types of CPU architectures.
It starts with CoolThreads
For
reasons I can't even begin to fathom myself, I've always been a fan of
the Sparc based Sun platforms. I initially had to sort of bear with
Solaris in my development projects on that platform -- although I've
come to absolutely love Solaris 10 with the major improvements found
there. So in the beginning of '06, along came the Coolthreads Prize for
Innovation contest, and all the ingredients were present to start work
on the framework. 6 grueling months later, the minimal framework that
is now JOCP wins the first round, and I have a 4-core, UltraSparc T1
based T2000 to continue proving out the core multithreading components
of JOCP.
During my efforts in phase two of the contest, I was
able to confirm that the platform indeed is more scalable as it stands
by the number of cores than by the speed of the CPU clock. My
performance tests during that phase showed dramatic differences between
the framework (and test clients) running against a 64bit Opteron
running at 1.8GHz and the 1.0GHz Ultrasparce T1 running in that
CoolThreads server. The tipping point for performance degredation
occured with as few as 5 clients, and the latency differential between
the two servers as the client count went up was most definitely not
linear.
With Sun's Open Performance Contest, I was offered yet
another chance to continue the quest for confirming whether I'm on the
right track with the application. I brought another 8 core T2000 into
the test lab to see if adding more cores means continued scalability.
Obviously, things aren't always as simple as throwing more hardware to
improve performance. Two forces come to mind which are going to vary
the results of the analysis I did on the new test bed: IT's expanding
need to condense, run cooler, and run cheaper; and the design criteria
necessary to build multi-threading into application in at the core. By
the latter I mean essentially: If the future for computing means
holding the line on clock speeds and just adding lots more cores, can
you avoid rework by designing for multiple cores from the start and how
much of a win would that be?
About me
Before
I go further and relay some of the analysis and findings I made with
the new test lab, let me please be forthright about my own credentials
as an analyst. I do not claim to be an expert on
multi-core/SMP/hypthreading technologies. I am most definitely not a
hardware engineer. I am a self-taught software developer with about 20
years of programming experience -- over 12 of those in a professional
settings. My past projects relating to the topic of scalable,
transaction-oriented architectures do include architecting and building
a real-time trade clearing system for a local clearing corporation,
your usual cadre of web applications, various automation control
projects, and all sorts of scary projects where people spend millions
and years on software that will never see a single real user.
Essentially, I'm a mid-level software veteran with distributed
computing science as my personal hobby.
Further Scalability Testing
There
has been a lot of writeup on the T2000 in regards to competitive
performance using canned packages as well the relative network
performance of that server in "typical" business environments and
applications. I believe I had a more unique situation: an application
that was intended for that architecture and that may, in fact, rely on
it. There were some basic fundamental questions and theories I had, and
I decided to run a number of performance tests to find some answers and
validations. These questions included:
1) Was basing the framework on SEDA a good choice?
2)
How much difference was there between a single core, high frequency
chip and multi-core/multi-thread (e.g. an AMD Opteron and the
Ultrasparc T1)
3) How much difference is there between the 4 core and 8 core UltraSparc T1 CPU's?
4) Is the T2000 relegated to serving out static content for web servers, or can you actually perform some computation?
If
you're unsure of what SEDA (Staged Event Driven Archtiecture) is, check
here. The basic byline for SEDA is that it breaks client/server
communication and processing up into stages. One of the primary drivers
for SEDA was that it be self-tuning and self-limiting, and prevent
resource exhaustion in the case of request overload (e.g. the "Slashdot
effect"). I'll spare you the details about why that fits with some of
the higher level design goals of JOCP, but there was a ton of
congruence with the goals of SEDA and JOCP which was why I chose it for
the basis for JOCP.
I could have gone very detailed, and examed
a myriad of Java VM tuning options. I decided instead to go take the
very basic approach and test using a more or less stock, "-server"
enabled JVM (with opportunities for HotSpot warmup occuring). I
decided, for now, to ignore the arguably platform specific switches
available through the -X options. I suspect my next test lab will
include some of that -- I'm fairly certain heap optimization is one of
the next design facets to tackle.
The Test Setup
So the basic components of my test lab for this recent test suite were:
- A number of HP Proliant DL360 G4 systems. One was selected as target server A. Some were used as test clients.
- A mid-range Single core Opteron (144) system as target server B.
- A T2000 with 4-cores and 8 GB memory as target server C.
- A T2000 with 8 cores and 16GB memory as target server D.
- Various other clients (mixed performance levels and cpus).
- JDK 1.5 (1.5_10).
- 5 test harnesses per physical test client host.
I
developed a custom test harness based on JOCP which submits (20) 1K
packets to a target server. Individual packet latency is measures as is
an average. The things pretty quite slick as I have a scanner that
picks up the test results (all XML files) and moves them for analysis.
I'll save the details for a later post.
The Results (preliminary)
So out of the testing, I found a few interesting test scenarios and results, and I'll go through them quickly for you.
- Opteron server versus the HP Proliant server scaling from 10 to 20 to 30 clients.
Since
the I had to deal with two different LANs with slightly different
client topologies, I was careful not to draw direct comparisons between
the various servers. Although I was tempted to compare performance
between the CoolThreads servers and the Proliant server, I was unable
to because of differences in client performance (and possible network
latency differences) between the two different LANs. Besides, I was
really more interested in the tipping point at which client count
starts to undermine responsiveness.
As you can tell from graph A
and graph B below, there is quite a difference in scalability in the
application on the 4 thread ((2) 3.2 GHz Xeons with HT) Proliant versus
the single core Opteron 144. I was looking for general trends, and I
found them quite easily from this simple testing.

graph A
 graph B
Conclusions:
We achieve a roughly 2 time latency hit while going from 10 to 30
clients on the dual 3.2GZ Xeons (Hyperthreaded), while taking a roughly
8x hit for doing the same on the single core Opteron.
What about the T2000's?
- 4core T2000 vs. 8core T2000 15 to 50 clients
So
is there benefit for going with Sun's 8 core model T2000 versus the 4
core model? For this particular application, there certainly is. Seen
in graphs C and D are the comparisions for going from 15 to 50 clients
on both target servers.

graph C.
graph D.
Conclusions:
We incur a roughly 13x latency hit going from 15 to 50 clients on the 4
core server, while only incuring less than a 2x hit for doing the same
on the 8 core server.
Where to go from here
I
know I've glossed over a ton of details in this rough presentation, but
I'm personally satisfied with my results to-date. I've shown that, at
least initially, it can pay to design from the start for
multi-core/multi-threaded architectures. Even at smaller client counts,
the T2000's easily beat the Opteron 144.
A large chunk of
analysis is missing and that's where I'm headed next: HotSpot compiler
and memory optimization. Both T2000's have plenty of memory for a 64bit
JVM (8GB and 16GB for the 4 core and 8 core models, respectively), so
there's another opportunity for design consideration when I target the
next layer of the JOCP framework: object caching and persistence. I
also want to take advantage of Java's native performance monitors and
run some more exaustive metrics intra-JVM. There's also a question
still lingering of what improvements Mustang has brought.
I'm
also going to look for ways to directly compare the performance of the
Proliant DL360 and the T2000's. I know they're only very relatively
comparible in price, but they look fairly close in many other aspects
(redundancy, I/O backplane, management, etc). I have no doubt that
comparing SWaP metrics will show a clear difference.
I should
also add that obviously I've got a lot of performance tuning to do. 50
simultaneous users was enough for a rough test bed to get rough test
metrics. I'd like to see the server supporting 1000's of clients.
Obviously, I'm a long ways from that. Keep in mind, though, that the
test harness was designed to simply stream packets as fast as possible
while measuring rough round trip times.
More to follow...
Read the original article: http://objcollab.blogspot.com/2006/12/java-performance-on-mulithreadedsupersc.html |