Debunking the 100X GPU vs. CPU Myth – really?

by apoppin · June 24, 2010

At the International Symposium on Computer Architecture in Saint-Malo, France, Intel presented a technical paper where they showed that application kernels run up to 2.5 times faster overall on a NVIDIA GeForce GTX 280 as compared with an Intel Core i7 960. The .pdf which can be downloaded from the above link is very technical but well worth reading.

What is strange is that Intel admits that NVIDIA’s last generation GTX 280 is (at least) 2.5 times faster than their Core i7 960 CPU. From their own chart:

There is a 14x speedup in using GTX 280 over the Core i7 960 in one of the tests although the paper does not really specify testing methodology other than the Core i7 was heavily optimized:

We measured the performance of our kernels on (1) a 3.2GHz Core i7-960 processor running the SUSE Enterprise Server 11 operating system with 6GB of PC1333 DDR3 memory on an Intel DX58SO motherboard, and (2) a 1.3GHz GTX280 processor (an eVGA GeForce GTX280 card with 1GB GDDR3 memory) in the same Core i7 system with Nvidia driver version 19.180 and the CUDA 2.3 toolkit.

From their own blog, NVIDIA believes the codes that were run on the GTX 280 were run right out-of-the-box, without any optimization. It is somewhat unclear from the technical paper exactly what codes were run and how they were compared between the GPU and CPU. The Intel paper is called “Debunking the 100x GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU”. It is a fact that not every applications can see this kind of speed up by using the parallel processing of the GPU. However, below are a few examples that can be found on CUDA Zone, of other developers that have achieved speed ups of more than 100x in their application.

Developer

Speed Up

Reference

Massachusetts

General Hospital
300x http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-17-22-20178

University of Rochester 160x http://cyberaide.googlecode.com/svn/trunk/papers/08-cuda-biostat/vonLaszewski-08-cuda-biostat.pdf

University of Amsterdam 150x http://arxiv.org/PS_cache/arxiv/pdf/0709/0709.3225v1.pdf

Harvard University 130x http://www.springerlink.com/content/u1704254764133t5/?p=c5eead9af73340e58a313d95581cfd40?=49

University of Pennsylvania 130x http://ic.ese.upenn.edu/abstracts/spice_fpl2009.html

Nanyang Tech, Singapore 130x http://www.opticsinfobase.org/abstract.cfm?URI=oe-17-25-23147

University of Illinois 125x http://www.nvidia.com/object/cuda_apps_flash_new.html#state=detailsOpen;aid=c24dcc0f-c60c-45f9-8d57-588e9460a58f

Boise State 100x http://coen.boisestate.edu/senocak/files/BSU_CUDA_Res_v5.pdf

Florida Atlantic University 100x http://portal.acm.org/citation.cfm?id=1730836.1730839&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264

Cambridge University 100x http://www.wbic.cam.ac.uk/~rea1/research/AIRWC.pdf

Intel states for their testing, “We typically find that the highest performance is achieved when multiple threads are used per core. For Core i7, the best performance comes from running 8 threads on 4 cores. For GTX280, while the maximum number of warps that can be executed on one GPU SM is 32, a judicious choice is required to balance the benefit of multithreading with the increased pressure on registers and on-chip memory resources. Kernels are often run with 4 to 8 warps per core for best GPU performance.”

Well, what did Intel find after optimizing their Core i7 960 CPU?

For example, the previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X.

“Only 5X” faster! This is the first time that this editor has seen such an admission from any company about their product being radically slower than a competitor’s. Worst of all for Intel, they are testing with NVIDIA’s last generation GPU; the new GTX 480 is probably well over twice faster than GTX 280 for this type of parallel computing.

Intel has some suggestions for developers to improve their CPU performance – none of which are easy nor practical to implement.

CONCLUSION
In this paper, we analyzed the performance of an important set of throughput computing kernels on Intel Core i7-960 and Nvidia GTX280. We show that CPUs and GPUs are much closer in performance (2.5X) than the previously reported orders of magnitude difference. We believe many factors contributed to the reported large gap in performance, such as which CPU and GPU are used and what optimizations are applied to the code. Optimizations for CPU that contributed to performance improvements are: multithreading, cache blocking, and reorganization of memory accesses for SIMDification. Optimizations for GPU that contributed to performance improvements are: minimizing global synchronization and using local shared buffers are the two key techniques to improve performance. Our analysis of the optimized code on the current CPU and GPU platforms led us to identify the key hardware architecture features for future throughput computing machines – high compute and bandwidth, large caches, gather/scatter support, efficient synchronization, and fixed functional units. We plan to perform power efficiency study on CPUs and GPUs in the future.

We have been following Fermi and Tesla Computing since our very first article on NVISION08. With NVIDIA’s next conference, we could see their emphasis was now being placed on GPU computing on Day 1, Day 2 and Day 3 of their GTC, last September. We also looked very deeply into the Fermi GF100 architecture here and their video card GTX 480 which has met their goal of a General Purpose Unit (GPU aka Graphics Processing Unit) that renders amazing graphics for gaming. Clearly NVIDIA is moving into Intel’s territory and it will be quite interesting to see what happens. We promise to stay on top of this developing story for our readers.

Mark Poppin

ABT Senior Editor

Please join us in our Forums

Become a Fan on Facebook

Follow us on Twitter

For the latest updates from ABT, please join our RSS News Feed

Join our Distributed Computing teams

Folding@Home – Team AlienBabelTech – 164304

SETI@Home – Team AlienBabelTech – 138705

World Community Grid – Team AlienBabelTech

Debunking the 100X GPU vs. CPU Myth – really?

Leave a Reply

Web News

Recent Posts

Developer	Speed Up	Reference
Massachusetts General Hospital	300x	http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-17-22-20178
University of Rochester	160x	http://cyberaide.googlecode.com/svn/trunk/papers/08-cuda-biostat/vonLaszewski-08-cuda-biostat.pdf
University of Amsterdam	150x	http://arxiv.org/PS_cache/arxiv/pdf/0709/0709.3225v1.pdf
Harvard University	130x	http://www.springerlink.com/content/u1704254764133t5/?p=c5eead9af73340e58a313d95581cfd40?=49
University of Pennsylvania	130x	http://ic.ese.upenn.edu/abstracts/spice_fpl2009.html
Nanyang Tech, Singapore	130x	http://www.opticsinfobase.org/abstract.cfm?URI=oe-17-25-23147
University of Illinois	125x	http://www.nvidia.com/object/cuda_apps_flash_new.html#state=detailsOpen;aid=c24dcc0f-c60c-45f9-8d57-588e9460a58f
Boise State	100x	http://coen.boisestate.edu/senocak/files/BSU_CUDA_Res_v5.pdf
Florida Atlantic University	100x	http://portal.acm.org/citation.cfm?id=1730836.1730839&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
Cambridge University	100x	http://www.wbic.cam.ac.uk/~rea1/research/AIRWC.pdf

Debunking the 100X GPU vs. CPU Myth – really?

Share

Related

Leave a Reply

Web News

Recent Posts