The parallelization of computing, via
multi-threading cores, multi-core processors and multi-processor
systems is encouraging ever greater levels of application concurrency
to take advantage of the proliferating CPUs. Multi-core processors, in
particular, are fueling this phenomenon.
Beyond the dual- or
quad-core domain, manufacturers are starting to build many-core chips.
Examples include the Niagara T1 UltraSPARC from Sun Microsystems, which
has 8 cores and can support 32 threads (the next-generation Rock
processor will double this to 16 cores and 64 threads); Cavium Networks
16-core OCTEON MIPS64 processor for embedded applications; and Intel's
Polaris prototype processor, which sports 80 cores and boasts a peak
teraflop.
The Polaris prototype, which is part of Intel's
terascale computing initiative, is motivating researchers there to take
a hard look at a relatively new technology -- transactional memory or
TM, for short. Within the next ten years, the prospect of multi-core
teraflop processors like Polaris and new application domains to use
those processors will require vastly more parallel processing than ever
before. Even incorporating relatively low levels of concurrency in
today's applications is already challenging some of our best developers.
One
of the nastiest concurrency problems has to do with keeping data
thread-safe, that is maintaining global data integrity in the presence
of parallel executing threads. Failure to keep data thread-safe leads
to deadlocks, race conditions (data corruption) and priority inversion.
Worse, because these types of problems are time-sensitive, they are
often very hard to find during normal testing and in some cases go
undetected until after the application is deployed.
The typical
way to keep data thread-safe is to use global locks around objects that
are being accessed by more than one thread. Locks provide a
synchronization mechanism that blocks concurrent access of an object,
preventing the data race condition. Seems simple enough. But there
are a number of problems with this approach. Sometimes locks become
dependent on each other, such that each thread is holding a lock the
other thread needs. Or if a thread dies holding a lock it can block
other dependent threads.
Even for correctly implemented locks,
there's the issue of granularity. Coarse granularity protects larger
data objects and uses fewer locks. But as the number of threads scales
up, performance suffers. Finer granularity allows the programmer to
protect smaller data items and gives better performance as long (as
lock overhead is not overwhelming). It makes it possible, for example,
to lock individual record components rather than the entire record
structure. But finer granularity requires more complex algorithms and
more locks, so it is often much more difficult to implement correctly.
Transactional memory to the rescue
Terascale
computing, which relies on many-core parallelism, will be very
difficult to develop. The current languages only provide low-level
concurrency features. For Intel, terascale computing has become the
prime motivator to improve software concurrency technology. The
company's 80-core Polaris prototype will require much greater levels of
application concurrency than today. And the scaling up of multi-core
processors across time and product families will necessitate a solution
that doesn't require reprogramming based on core count.
"How can
the programmer write parallel code more effectively, that is, write
robust code that doesn't have bugs, but still scales and benefits from
the additional cores that each successive generation provides," asks
Ali-Reza Adl-Tabatabai, Intel Principal Engineer? "That's the big
challenge that we're going after."
To that end, Intel researchers
are looking to transactional memory as one of the key technologies that
will enable developers to write the terascale killer apps of the next
decade. The attraction of TM is that is appears to solve the most
annoying problems of global locks: application robustness and
scalability. These attributes are especially important for the type of
large-scale concurrency required by terascale applications.
Like
locks, transactional memory is a construct for concurrency control that
enables access to data shared by multiple threads. But unlike locks it
is an optimistic model. It assumes that in most cases only a single
thread will be contending for a given data item. A transaction is a
high-level construct that executes reads and writes to data as an
indivisible operation. From the application's point of view
intermediate states are not visible to other successful transactions.
So when a logical transaction is complete, the system verifies that
other logical transactions haven't made changes to the same memory that
would conflict with the first transaction. If they have, then the
transaction is re-executed until it succeeds.
Intel's view is
that TM should be encapsulated in language construct, and initially
implemented in software. At some point, it may be useful to provide a
hardware assist to provide better performance or functionality. But
this is not necessarily the case.
Adl-Tabatabai says that dynamic
memory garbage collection, a technology introduced about 15 years ago,
may be a good analogy of how transactional memory will evolve.
Initially, there were a number of techniques for implementing garbage
collection in software. People thought that they really would like
hardware support for this. But the software algorithms evolved and
eventually made it into mainstream languages like Java. It turned out
that hardware assistance wasn't really needed. In any case, once the
language semantics of TM are defined, the software/hardware
implementation should be transparent to the application developer.
Intel
researchers have prototyped extensions to Java and C that incorporate
transactional memory constructs. They've also investigated using
various compiler optimizations for transactional memory
implementations. Below are two trivial code examples that make data
object 'x' thread-safe. The first uses explicit locking, the second
uses the atomic block construct for transactional memory.
lock(L); x++; unlock(L);
atomic {x++;}
The
atomic construct guarantees the enclosed operations will be safe from
thread concurrency. The data transactions within the construct will
either execute completely or have no effect until it is safe to do so.
When the atomic block executes this happens as if in a single step in
relation to the other threads. In other words, from the programmer's
point of view, the transactions run in isolation.
The DARPA HPCS
research languages (Fortress, X10, Chapel) for high productivity
computing all provide atomic block construct in lieu of explicit lock
synchronization. A few other investigators have incorporated TM into
other research languages, but the final language to emerge from the
DARPA HPCS program may be the first one to formally introduce it as a
standard feature.
"The fact that the HPCS languages decided to
provide atomic constructs rather than locking constructs shows that
there's general consensus in the language design community that this is
the way to go for concurrency control in future languages," notes
Adl-Tabatabai.
According to Adl-Tabatabai, The HPCS language
effort represents a real step forward, since it will incorporate TM
from the beginning. Retrofitting transactions into older languages that
previously used locks can be problematic, since mixing explicit locking
and implicit locking via a low-level implementation of TM may introduce
conflicts when dealing with legacy code.
Scaling to advantage
On
many applications, coarse-grained locking -- putting locks around whole
data structures -- doesn't scale well. Threads that are concurrently
accessing the same data structure must wait unnecessarily when they are
reading or writing disjoint data within the structure. To increase
performance, the developer must re-program the algorithm using
fine-grained locking -- putting global locks around individual data
items in the data structures. Fine-grained locking allows different
threads to concurrently access disjoint data in the same data
structures. Transactional memory implicitly provides fine-grained
locking, automatically providing the associated performance benefits.
Adl-Tabatabai
describes an example using traditional fine-grained locking for a hash
table. Using the Java 5 class libraries, a professor of computer
science (who wrote a book on Java concurrency) was able to make the
appropriate modifications to the hash table algorithms to incorporate
fine-grained locking. The changes were subtle, yet quite complex. After
two years of reviews by the Java standards committees, it was approved
for inclusion into the Java class libraries.
"In general, doing
this kind of coding is very difficult," says Adl-Tabatabai. "You're not
going to be able to get your average Joe Programmer to write code like
this, and write it in a way that's error-free and doesn't introduce
data races or deadlocks."
Besides fine-grained locking, TM has
the additional advantage of allowing for concurrent reads of the same
data, which traditional locks cannot do. A special type of lock, called
a reader-writer lock can be used to overcome this deficiency but this
requires application code modifications. Transactional memory, on the
other hand provides a systematic way for programmers to take advantage
of these features.
With no existing base of software, how will TM
get mainstreamed into applications? Jerry Bautista, Director,
Microprocessor Technology Management at Intel says the evolution to
multi-core architectures will create a need in the marketplace for
technologies like transactional memory as the complexity of programming
creates a new set of challenges.
"As we introduce more high
performance hardware, there will be a demand generated in the
programming community for ways to get around these common problems,"
says Bautista. "People are already noting these challenges today. We're
just trying to run ahead here."