Processor Forum 2005

Click here to Subscribe

BPL
LMDS
GPU
VoP
OLED
DSP
Opera Browser
The FCC
More...

Other Services:

Search All Issues, Conference Reports and Tutorials

Web Services Summit

Fair Use or Copyright?

Deregulation Smoke and Mirrors

More...

Processor Forum 2005
By John Latta, WAVE 0547 11/25/05

San Jose , CA
October 25 -6 26, 2005

This used to be called the Micro Processor forum. A single conference would have 10’s of new processors announced and now there may be 2 or 3 processors announced. The day of a new GP processor has largely faded as the realities of competing against Intel, IBM and ARC have become near impossible. One result has been the expansion of specialized processors such as the network processor. But even in the graphics processor space, there are effectively only two companies: nVidia and ATI. Through all of this innovation in processor development continues, it is with a different pace and certainly shared on a global basis with design efforts taking place everywhere. This Processor Forum has at its theme The Road to Multicore. This transition has clearly created excitement in the design community. It also has brought with it more attendees than seen at the processor forum for a number of years.

The Implications of Multi-Core

Microprocessors, and in particular, the X86 family, have hit a brick wall with power dissipation when clock speed is the means to increase processor performance. In particular, it is power density on the chip not just average power. Power densities are at 300 to 400 W/sq cm. The solution is to put more processor cores, at lower clock speeds on the same die, to effectively provide greater computational power at lower power densities. With this comes a penalty that requires increased software complexity to take advantage of the multi-core architecture. One of the issues is Where will this play out in the number of cores? A compelling case was made by Azul that there is also a thermal barrier with multi-core. Effectively a many core architecture will face the same limits as single core does today. Today’s current issue is: what are the benefits of multi-core to users of traditional desktop PCs? The answer which emerged from the WAVE probing:

2X - Users will see a significant improvement

4X - Users will only see a slight improvement

>4X - Uncertain if Users will see any improvement, especially given the current state of software

As Scott Sellers from Azul Systems stated, software developers have become lazy in optimizing their products for performance when could rely year after year of 40% performance improvement. Thus, the rationale for the performance improvement implied above is the following:

A multi-threaded OS will be able to readily take advantage of 2 cores and thus the user will see an immediate improvement. But without a shift in the development of applications to utilize multi-core the gains will rapidly erode. Yet, even here there will be limitations due to the serial processing nature of many applications. Thus, the user benefit of multi-cores as the number of cores rises will erode with time and the desire to put more and more cores on a die.

The value proposition of multi-to-many core is an ecosystem issue. In spite of the fact that today this is being driven by the power density limitations of the designs of a single core faced by processor companies, the solution to how to use the resulting designs lies with the OS, applications and users.

Azul Systems – Bringing Disruption to Enterprise Computing

Scott Sellers, VP of Hardware Engineering and CTO of Azul Systems, made a case that mulitcore processors carried to the many core levels provides significant value. His presentation included the following:

The server market in the enterprise is approximately $50B but only grows about 3% per year. The Fortune 1000 companies spend $7 to $8B per year on servers for VM based applications but this market is growing at 20% to 30% per year as the VM model is dominate. Using largely J2E as the development foundation, applications are being developed which are distributed. But the problem is that the traditional servers multiply at rapid rates to support the growth in applications. One of the results is that the effective TCO also rises rapidly.

Azul Systems provides a network computing appliance which has as a close parallel with NAS. At the center of their appliance is the Azul Vega 1 processor which is capable of scaling to 384 coherent threads per system – well beyond even the Intel IA64 Montecito.

It is estimated that 50% of the enterprise applications are today being developed in Java and by 2006 80% will migrate to Java. With J2E being fully multi-threaded it can be effectively employed on a virtual machine targeted to executing the Java VM.

The Azul Vega processor does not expose its instruction set because it executes the Java VM code. A major improvement made with the Vega is “pauseless” garbage collection. The processor has 24 cores per chip. The design supports multi-chip SMP where each processor has complete and equal access to memory.

An appliance which is 11RU has up to 384 cores and 256GB of memory. The appliance can respond to spikes in processor demand in 10ms. The implementation requires no changes to existing Java applications and the appliance is OS agnostic. The appliance can just be plugged into the data center and it runs.

One of the first commercial installations is in travel industry for reservations – the company is Pegasus. They had 8 X 8 SPARC and 4 X 2 way SPARC servers – 72 CPUs which were running at 70% utilization. When an Azul appliance was added, 15 cores, its utilization was only 3%. The system still included a 3 X 2 SPARC server running at 70% utilization. The net result was a reduction in CPUs from 72 to 6.

One of the major issues is how to license software. Traditional software licensing based on the number of CPUs and applying this directly to the Azul processors would make software costs prohibitive. Azul is working the those that provide the enterprise software, and BAE is the first one,to have a more reasonable license strategy. The approach is to license the host server and not the pool of processors.

In summary, Azul stated that by 2010:

No one computer architecture will fit all.

That enterprise compute architecture will separate from the client architecture.

The WAVE spoke with Scott Sellers and asked – how does this architecture extend downstream to the workstation to the client? The problem lies with the software architecture today. Even if some applications can benefit from significant performance improvements from thread level parallelism it may not be economical to rewrite the applications. Thus, extending the benefits of many core to the client may be many years off.

Is there a Magic Solution to Using Multi-core?

The problem is uniform, decrease thermal density by going to multi-core with lower clock speeds, but the solution on how to use these multi-core processors is diverse. Solutions presented included: programming to support parallelism, virtualization, an OS per core, and many core.

We describe these solutions as:

Concurrency in software.
Virtualize the software and hardware
Implement one OS per core and
Create a many core processors.

Here is a sample.

Microsoft – Herb Sutter

Herb gave a compelling presentation that the hardware and software community needs to work together to address the multi-core issue. His theme was captured in the first and last slides:

The need for currency is here now. “the future is now, everybody is doing it (concurrency) because they have to.”

Concurrency will affect the way we write software. “The Free Lunch” is over. Only applications with lots of latent concurrency will regain the performance free lunch.

The software industry has lots of work to do and we estimate that the hardware industry vastly underestimates this.

The problem with concurrency lies on the client. There are many threads per user “request.” The client has not been optimized to run on a multi-core computer.

An appeal was made to

Not underestimate the programming problem.

Focus hardware semantics and operations on programmability first and speed second.

Herb then outlined his work on Concur, a set of Object Oriented extensions to support concurrency.

XenSource – Simon Crosby

The value of Xen is that it decouples the OS software and applications from the underlying hardware.

Xen is capable of live relocation which enables a running virtual machine to be moved in 50ms.

The hypervisor is less than 50K lines of code. One of its advantages is that it virtualizes only the base platform of the CPU, MMU and low level interrupts. It also supports the native OS device drivers.

XenSource is an open source initiative.

When run on the Pacifica extensions by AMD Xen can run Windows XP/s003 with any paravirtualization modifications.

AMD – Kevin McGrath

AMD described, with some glee, the innovation they feel is present in “Pacifica” technology. This allows for virtualization extensions to the X86 64bit AMD processors to support hypervisors.

These features include:

Processors Guest Mode
New Instruction – VMRUN
New Data Structure – Virtual Machine Control Block
(VMCB)
Enhanced Memory Management for virtualization
Interrupt architecture enhancements

Freescale – Toby Foster

Freescale was advocating the use of an OS per core in embedded applications. The chip which supports this architecture is the MPC8641D. It is based on the PowerPC e600 core. There is 1MB of L2 cache per core. It was claimed that this approach supports many of the embedded applications which have dedicated processing requirements per core.

IBM – David Krolak and Alex Chow

IBM made it sound easy to program its cell processor. Yet, the processes to realize the application development flow seem not quite as easy.

An overview was given of the Cell processor and then some of its programming considerations in this two part talk.

The cell computer implements 9 cores which run at a 3 – 4 GHz clock. It is called the Broadband Engine (BE). Control is done with the PPE (Power Processor Element) and 8 Synergistic Processor Elements (SPE) that use Synergistic Memory Flow Control (SMF). There is a high bandwidth Element Interconnect Bus (EIB). It is claimed that the BE can support:

Game console systems
Blades
HDTV
Home Media Servers
Supercomputers

The performance of the EIB is impressive:

4 16 Byte data rings
Operates at ½ processor core frequency
Peak rate 300GB/s at 3.2GHz processor clock with
200GB/s sustained

Each EIB Bus supports 25.6GB/sec in each direction.

Each cell has two Rambus I/O controllers which are capable of 30GB/s outbound and 35GB/s inbound.

In terms of programming, the data level parallelism is SIMD. The task level parallelism is 8 SPEs and 2 PPE SMT. The cell programming model is:

Local Store resident multi-tasking
Self-managed multi-tasking
Kernel-managed SPE scheduling and virtualization

In order to realize the power of the BE this application development flow was recommended:

Iterative Development Steps
Complexity study of new or legacy algorithm
D ata traffic analysis
Experimental partitioning and mapping of the algorithm and program structure to the architecture.

Additional hints were given:

Start simple – Develop PPE Control, PPE Scalar code
Develop PPE Control, partitioned SPE scalar code
Transform SPE scalar code to SPE SIMD code
Re-balance the computational data movement

The thrust of the IBM presentation was that the BE is easy to program.

WAVE Comments

When it comes to multi-core, hardware processors are well ahead of the software to fully utilize them. As the WAVE heard at Processor Forum, in spite of increasing the number of cores, the gains are not assured. Yet, several speakers spoke of the poor state of enterprise server utilization – 15%. The data center has specific thermal and power limitations. Thus, an improvement in performance for a given power usage is a large win.

Today, the market value for effective use of multi-core lies in the enterprise. This is the low hanging fruit. Yet, we learned that there is a limit to the gain from highly parallel architectures.

If applications continue to migrate away from the desktop to services it is not clear that the client will continue to need the benefits of Moore’s Law on the desktop. Thus, the gains in improving performance lies in the data center – which is already the focus of multi-core processors and virtualization. A good example of this is Google. Yes, the desktop cannot do Internet crawling and then searching but Google has shown how a massively parallel implementation can do what the desktop cannot via an Internet service.

IBM was emphatic that the cell is easy to program. Yet, we wondered if the cell is supercomputing all over again which has stumbled because it can only solve specific classes of problems. The cell has the advantage of a major application – the Sony PS3 – but it remains to be seen if the technology can go beyond specialized processing. What is important about the cell is that it is one look into the future of many core processors (At present the cell has only 8 processors). The cell has a very fast interconnect bus which supports its fast processors. Thus, we already see multiple architecture approaches to multi and many core. It remains to be see if processing will fragment or nest around a single solution like what happened with the X86.

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005