Home Get Informed Blogs 2006-12 Mark Martin: Java Performance on mulithreaded/superscalar/multicore

Mark Martin: Java Performance on mulithreaded/superscalar/multicore

PDF Print E-mail
Written by Mark Martin   
Sunday, 31 December 2006 13:33

So I'm developing a framework for distributed computing called Java Object Collaboration Platform (JOCP). If I miss the target, it'll be a decent enough platform for running the multiplayer games my brother and I are writing -- it already accomplishes this. If it meets ideals, it'll be a possible contender in the hosted MMORPG space.

Message Passing Framework

So I'm developing a framework for distributed computing called Java Object Collaboration Platform (JOCP). If I miss the target, it'll be a decent enough platform for running the multiplayer games my brother and I are writing -- it already accomplishes this. If it meets ideals, it'll be a possible contender in the hosted MMORPG space.

If I'm a technical dork, it'll never amount to more than it is now: a message passing framework. If I learn enough over the near term, I'll be able to fill one of the goals of the project: full transparent and shared state between multiple VM's, including .Net stacks.

IT Trends

As I'm sure you're aware, the focus in IT seems to have shifted a lot over the last few years from higher speed CPU's performing chores for applications to multiple threads or multiple cores or even both packed in the same package.

With Sparc, Intel, and AMD into a full-on onslaught with their competing technologies for multithreading/multi-core CPU's, I'm taking things as a clear sign that I may have had just a little luck and prescience targetting a framework architecture that capitalizes (nay, relies) on these types of CPU architectures.

It starts with CoolThreads

For reasons I can't even begin to fathom myself, I've always been a fan of the Sparc based Sun platforms. I initially had to sort of bear with Solaris in my development projects on that platform -- although I've come to absolutely love Solaris 10 with the major improvements found there. So in the beginning of '06, along came the Coolthreads Prize for Innovation contest, and all the ingredients were present to start work on the framework. 6 grueling months later, the minimal framework that is now JOCP wins the first round, and I have a 4-core, UltraSparc T1 based T2000 to continue proving out the core multithreading components of JOCP.

During my efforts in phase two of the contest, I was able to confirm that the platform indeed is more scalable as it stands by the number of cores than by the speed of the CPU clock. My performance tests during that phase showed dramatic differences between the framework (and test clients) running against a 64bit Opteron running at 1.8GHz and the 1.0GHz Ultrasparce T1 running in that CoolThreads server. The tipping point for performance degredation occured with as few as 5 clients, and the latency differential between the two servers as the client count went up was most definitely not linear.

With Sun's Open Performance Contest, I was offered yet another chance to continue the quest for confirming whether I'm on the right track with the application. I brought another 8 core T2000 into the test lab to see if adding more cores means continued scalability. Obviously, things aren't always as simple as throwing more hardware to improve performance. Two forces come to mind which are going to vary the results of the analysis I did on the new test bed: IT's expanding need to condense, run cooler, and run cheaper; and the design criteria necessary to build multi-threading into application in at the core. By the latter I mean essentially: If the future for computing means holding the line on clock speeds and just adding lots more cores, can you avoid rework by designing for multiple cores from the start and how much of a win would that be?

About me


Before I go further and relay some of the analysis and findings I made with the new test lab, let me please be forthright about my own credentials as an analyst. I do not claim to be an expert on multi-core/SMP/hypthreading technologies. I am most definitely not a hardware engineer. I am a self-taught software developer with about 20 years of programming experience -- over 12 of those in a professional settings. My past projects relating to the topic of scalable, transaction-oriented architectures do include architecting and building a real-time trade clearing system for a local clearing corporation, your usual cadre of web applications, various automation control projects, and all sorts of scary projects where people spend millions and years on software that will never see a single real user. Essentially, I'm a mid-level software veteran with distributed computing science as my personal hobby.

Further Scalability Testing


There has been a lot of writeup on the T2000 in regards to competitive performance using canned packages as well the relative network performance of that server in "typical" business environments and applications. I believe I had a more unique situation: an application that was intended for that architecture and that may, in fact, rely on it. There were some basic fundamental questions and theories I had, and I decided to run a number of performance tests to find some answers and validations. These questions included:

1) Was basing the framework on SEDA a good choice?
2) How much difference was there between a single core, high frequency chip and multi-core/multi-thread (e.g. an AMD Opteron and the Ultrasparc T1)
3) How much difference is there between the 4 core and 8 core UltraSparc T1 CPU's?
4) Is the T2000 relegated to serving out static content for web servers, or can you actually perform some computation?

If you're unsure of what SEDA (Staged Event Driven Archtiecture) is, check here. The basic byline for SEDA is that it breaks client/server communication and processing up into stages. One of the primary drivers for SEDA was that it be self-tuning and self-limiting, and prevent resource exhaustion in the case of request overload (e.g. the "Slashdot effect"). I'll spare you the details about why that fits with some of the higher level design goals of JOCP, but there was a ton of congruence with the goals of SEDA and JOCP which was why I chose it for the basis for JOCP.

I could have gone very detailed, and examed a myriad of Java VM tuning options. I decided instead to go take the very basic approach and test using a more or less stock, "-server" enabled JVM (with opportunities for HotSpot warmup occuring). I decided, for now, to ignore the arguably platform specific switches available through the -X options. I suspect my next test lab will include some of that -- I'm fairly certain heap optimization is one of the next design facets to tackle.


The Test Setup

So the basic components of my test lab for this recent test suite were:

  • A number of HP Proliant DL360 G4 systems. One was selected as target server A. Some were used as test clients.
  • A mid-range Single core Opteron (144) system as target server B.
  • A T2000 with 4-cores and 8 GB memory as target server C.
  • A T2000 with 8 cores and 16GB memory as target server D.
  • Various other clients (mixed performance levels and cpus).
  • JDK 1.5 (1.5_10).
  • 5 test harnesses per physical test client host.
I developed a custom test harness based on JOCP which submits (20) 1K packets to a target server. Individual packet latency is measures as is an average. The things pretty quite slick as I have a scanner that picks up the test results (all XML files) and moves them for analysis.
I'll save the details for a later post.

The Results (preliminary)

So out of the testing, I found a few interesting test scenarios and results, and I'll go through them quickly for you.

  • Opteron server versus the HP Proliant server scaling from 10 to 20 to 30 clients.

Since the I had to deal with two different LANs with slightly different client topologies, I was careful not to draw direct comparisons between the various servers. Although I was tempted to compare performance between the CoolThreads servers and the Proliant server, I was unable to because of differences in client performance (and possible network latency differences) between the two different LANs. Besides, I was really more interested in the tipping point at which client count starts to undermine responsiveness.

As you can tell from graph A and graph B below, there is quite a difference in scalability in the application on the 4 thread ((2) 3.2 GHz Xeons with HT) Proliant versus the single core Opteron 144. I was looking for general trends, and I found them quite easily from this simple testing.


graph A

graph B

Conclusions: We achieve a roughly 2 time latency hit while going from 10 to 30 clients on the dual 3.2GZ Xeons (Hyperthreaded), while taking a roughly 8x hit for doing the same on the single core Opteron.

What about the T2000's?

  • 4core T2000 vs. 8core T2000 15 to 50 clients

So is there benefit for going with Sun's 8 core model T2000 versus the 4 core model? For this particular application, there certainly is. Seen in graphs C and D are the comparisions for going from 15 to 50 clients on both target servers.


graph C.

graph D.

Conclusions: We incur a roughly 13x latency hit going from 15 to 50 clients on the 4 core server, while only incuring less than a 2x hit for doing the same on the 8 core server.

Where to go from here


I know I've glossed over a ton of details in this rough presentation, but I'm personally satisfied with my results to-date. I've shown that, at least initially, it can pay to design from the start for multi-core/multi-threaded architectures. Even at smaller client counts, the T2000's easily beat the Opteron 144.

A large chunk of analysis is missing and that's where I'm headed next: HotSpot compiler and memory optimization. Both T2000's have plenty of memory for a 64bit JVM (8GB and 16GB for the 4 core and 8 core models, respectively), so there's another opportunity for design consideration when I target the next layer of the JOCP framework: object caching and persistence. I also want to take advantage of Java's native performance monitors and run some more exaustive metrics intra-JVM. There's also a question still lingering of what improvements Mustang has brought.

I'm also going to look for ways to directly compare the performance of the Proliant DL360 and the T2000's. I know they're only very relatively comparible in price, but they look fairly close in many other aspects (redundancy, I/O backplane, management, etc). I have no doubt that comparing SWaP metrics will show a clear difference.

I should also add that obviously I've got a lot of performance tuning to do. 50 simultaneous users was enough for a rough test bed to get rough test metrics. I'd like to see the server supporting 1000's of clients. Obviously, I'm a long ways from that. Keep in mind, though, that the test harness was designed to simply stream packets as fast as possible while measuring rough round trip times.

More to follow...

 

Read the original article: http://objcollab.blogspot.com/2006/12/java-performance-on-mulithreadedsupersc.html

 
online pokies aussie South Africa bonus