Processor Forum 2005
By John Latta, WAVE
San Jose , CA
October 25 -6 26, 2005
This used to be called the Micro Processor forum. A single
conference would have 10’s of new processors announced and now
there may be 2 or 3 processors announced. The day of a new GP processor
has largely faded as the realities of competing against Intel, IBM and
ARC have become near impossible. One result has been the expansion of
specialized processors such as the network processor. But even in the
graphics processor space, there are effectively only two companies: nVidia
and ATI. Through all of this innovation in processor development continues,
it is with a different pace and certainly shared on a global basis with
design efforts taking place everywhere. This Processor Forum has at its
theme The Road to Multicore. This transition has clearly created excitement
in the design community. It also has brought with it more attendees than
seen at the processor forum for a number of years.
The Implications of Multi-Core
Microprocessors, and in particular, the X86 family, have
hit a brick wall with power dissipation when clock speed is the means
to increase processor performance. In particular, it is power density
on the chip not just average power. Power densities are at 300 to 400
W/sq cm. The solution is to put more processor cores, at lower clock
speeds on the same die, to effectively provide greater computational
power at lower power densities. With this comes a penalty that requires
increased software complexity to take advantage of the multi-core architecture.
One of the issues is Where will this play out in the number of cores?
A compelling case was made by Azul that there is also a thermal barrier
with multi-core. Effectively a many core architecture will face the same
limits as single core does today. Today’s current issue is: what
are the benefits of multi-core to users of traditional desktop PCs? The
answer which emerged from the WAVE probing:
2X - Users will see a significant improvement
4X - Users will only see a slight improvement
>4X - Uncertain if Users will see any improvement,
especially given the current state of software
As Scott Sellers from Azul Systems stated, software developers
have become lazy in optimizing their products for performance when could
rely year after year of 40% performance improvement. Thus, the rationale
for the performance improvement implied above is the following:
A multi-threaded OS will be able to readily take advantage
of 2 cores and thus the user will see an immediate improvement. But
without a shift in the development of applications to utilize multi-core
the gains will rapidly erode. Yet, even here there will be limitations
due to the serial processing nature of many applications. Thus, the
user benefit of multi-cores as the number of cores rises will erode
with time and the desire to put more and more cores on a die.
The value proposition of multi-to-many core is an ecosystem
issue. In spite of the fact that today this is being driven by the power
density limitations of the designs of a single core faced by processor
companies, the solution to how to use the resulting designs lies with
the OS, applications and users.
Azul Systems – Bringing Disruption to Enterprise
Scott Sellers, VP of Hardware Engineering and CTO of Azul
Systems, made a case that mulitcore processors carried to the many core
levels provides significant value. His presentation included the following:
The server market in the enterprise is approximately
$50B but only grows about 3% per year. The Fortune 1000 companies spend
$7 to $8B per year on servers for VM based applications but this market
is growing at 20% to 30% per year as the VM model is dominate. Using
largely J2E as the development foundation, applications are being developed
which are distributed. But the problem is that the traditional servers
multiply at rapid rates to support the growth in applications. One
of the results is that the effective TCO also rises rapidly.
Azul Systems provides a network computing appliance which
has as a close parallel with NAS. At the center of their appliance
is the Azul Vega 1 processor which is capable of scaling to 384 coherent
threads per system – well beyond even the Intel IA64 Montecito.
It is estimated that 50% of the enterprise applications
are today being developed in Java and by 2006 80% will migrate to Java.
With J2E being fully multi-threaded it can be effectively employed
on a virtual machine targeted to executing the Java VM.
The Azul Vega processor does not expose its instruction
set because it executes the Java VM code. A major improvement made
with the Vega is “pauseless” garbage collection. The processor
has 24 cores per chip. The design supports multi-chip SMP where each
processor has complete and equal access to memory.
An appliance which is 11RU has up to 384 cores and 256GB
of memory. The appliance can respond to spikes in processor demand
in 10ms. The implementation requires no changes to existing Java applications
and the appliance is OS agnostic. The appliance can just be plugged
into the data center and it runs.
One of the first commercial installations is in travel
industry for reservations – the company is Pegasus. They had
8 X 8 SPARC and 4 X 2 way SPARC servers – 72 CPUs which were
running at 70% utilization. When an Azul appliance was added, 15 cores,
its utilization was only 3%. The system still included a 3 X 2 SPARC
server running at 70% utilization. The net result was a reduction in
CPUs from 72 to 6.
One of the major issues is how to license software. Traditional
software licensing based on the number of CPUs and applying this directly
to the Azul processors would make software costs prohibitive. Azul
is working the those that provide the enterprise software, and
BAE is the first one,to have a more reasonable license strategy. The
approach is to license the host server and not the pool of processors.
In summary, Azul stated that by 2010:
No one computer architecture will fit all.
That enterprise compute architecture will separate from
the client architecture.
The WAVE spoke with Scott Sellers and asked – how
does this architecture extend downstream to the workstation to the client?
The problem lies with the software architecture today. Even if some applications
can benefit from significant performance improvements from thread level
parallelism it may not be economical to rewrite the applications. Thus,
extending the benefits of many core to the client may be many years off.
Is there a Magic Solution to Using Multi-core?
The problem is uniform, decrease thermal density by going
to multi-core with lower clock speeds, but the solution on how to use
these multi-core processors is diverse. Solutions presented included:
programming to support parallelism, virtualization, an OS per core, and
We describe these solutions as:
Concurrency in software.
Virtualize the software and hardware
Implement one OS per core and
Create a many core processors.
Here is a sample.
Microsoft – Herb Sutter
Herb gave a compelling presentation that the hardware
and software community needs to work together to address the multi-core
issue. His theme was captured in the first and last slides:
The need for currency is here now. “the future
is now, everybody is doing it (concurrency) because they have to.”
Concurrency will affect the way we write software. “The
Free Lunch” is over. Only applications with lots of latent
concurrency will regain the performance free lunch.
The software industry has lots of work to do and we
estimate that the hardware industry vastly underestimates this.
The problem with concurrency lies on the client. There
are many threads per user “request.” The client has
not been optimized to run on a multi-core computer.
An appeal was made to
Not underestimate the programming problem.
Focus hardware semantics and operations on programmability
first and speed second.
Herb then outlined his work on Concur, a set of Object
Oriented extensions to support concurrency.
XenSource – Simon Crosby
The value of Xen is that it decouples the OS software
and applications from the underlying hardware.
Xen is capable of live relocation which enables a running
virtual machine to be moved in 50ms.
The hypervisor is less than 50K lines of code. One
of its advantages is that it virtualizes only the base platform of
the CPU, MMU and low level interrupts. It also supports the native
OS device drivers.
XenSource is an open source initiative.
When run on the Pacifica extensions by AMD Xen can
run Windows XP/s003 with any paravirtualization modifications.
AMD – Kevin McGrath
AMD described, with some glee, the innovation they
feel is present in “Pacifica” technology. This allows
for virtualization extensions to the X86 64bit AMD processors to
These features include:
Processors Guest Mode
New Instruction – VMRUN
New Data Structure – Virtual Machine Control Block
Enhanced Memory Management for virtualization
Interrupt architecture enhancements
Freescale – Toby Foster
Freescale was advocating the use of an OS per core
in embedded applications. The chip which supports this architecture
is the MPC8641D. It is based on the PowerPC e600 core. There is 1MB
of L2 cache per core. It was claimed that this approach supports
many of the embedded applications which have dedicated processing
requirements per core.
IBM – David Krolak and Alex Chow
IBM made it sound easy to program its cell processor.
Yet, the processes to realize the application development flow seem
not quite as easy.
An overview was given of the Cell processor and then
some of its programming considerations in this two part talk.
The cell computer implements 9 cores which run at a
3 – 4 GHz clock. It is called the Broadband Engine (BE). Control
is done with the PPE (Power Processor Element) and 8 Synergistic
Processor Elements (SPE) that use Synergistic Memory Flow Control
(SMF). There is a high bandwidth Element Interconnect Bus (EIB).
It is claimed that the BE can support:
Game console systems
Home Media Servers
The performance of the EIB is impressive:
4 16 Byte data rings
Operates at ½ processor core frequency
Peak rate 300GB/s at 3.2GHz processor clock with
Each EIB Bus supports 25.6GB/sec in each direction.
Each cell has two Rambus I/O controllers which are
capable of 30GB/s outbound and 35GB/s inbound.
In terms of programming, the data level parallelism
is SIMD. The task level parallelism is 8 SPEs and 2 PPE SMT. The
cell programming model is:
Local Store resident multi-tasking
Kernel-managed SPE scheduling and virtualization
In order to realize the power of the BE this application
development flow was recommended:
Iterative Development Steps
Complexity study of new or legacy algorithm
D ata traffic analysis
Experimental partitioning and mapping of the algorithm and program
structure to the architecture.
Additional hints were given:
Start simple – Develop PPE Control, PPE Scalar
Develop PPE Control, partitioned SPE scalar code
Transform SPE scalar code to SPE SIMD code
Re-balance the computational data movement
The thrust of the IBM presentation was that the BE
is easy to program.
When it comes to multi-core, hardware processors are well
ahead of the software to fully utilize them. As the WAVE heard at Processor
Forum, in spite of increasing the number of cores, the gains are not
assured. Yet, several speakers spoke of the poor state of enterprise
server utilization – 15%. The data center has specific thermal
and power limitations. Thus, an improvement in performance for a given
power usage is a large win.
Today, the market value for effective use of multi-core
lies in the enterprise. This is the low hanging fruit. Yet, we learned
that there is a limit to the gain from highly parallel architectures.
If applications continue to migrate away from the desktop
to services it is not clear that the client will continue to need the
benefits of Moore’s Law on the desktop. Thus, the gains in improving
performance lies in the data center – which is already the focus
of multi-core processors and virtualization. A good example of this is
Google. Yes, the desktop cannot do Internet crawling and then searching
but Google has shown how a massively parallel implementation can do what
the desktop cannot via an Internet service.
IBM was emphatic that the cell is easy to program. Yet,
we wondered if the cell is supercomputing all over again which has stumbled
because it can only solve specific classes of problems. The cell has
the advantage of a major application – the Sony PS3 – but
it remains to be seen if the technology can go beyond specialized processing.
What is important about the cell is that it is one look into the future
of many core processors (At present the cell has only 8 processors).
The cell has a very fast interconnect bus which supports its fast processors.
Thus, we already see multiple architecture approaches to multi and many
core. It remains to be see if processing will fragment or nest around
a single solution like what happened with the X86.