Galaxy’s GTX 660 arrives!
Architecture and Features
We have covered Fermi’s GK104 architecture in a lot of detail previously. You can can read our GTX 680 introductory article and and its follow-up. We also covered the launch of the GTX 690, the launch of the GTX 670 and the launch of the GTX 660 Ti. The new Kepler architecture builds on Fermi architecture with some important improvements and refinements that we will briefly cover here before we get into performance testing.
SMX architecture
As Nvidia’s slide indicates, the new Kepler architecture is called SMX and it emphasizes 2x the performance per Watt over Fermi. Their multi-threaded engine handles all of the information using three graphics processing clusters including the raster engine and two streaming multi-processors.
The Fermi SM is now called the SMX cluster. Each SMX cluster includes a Polymorph 2.0 engine, 192 CUDA cores, 16 texture units and a lot of high-level cache. In the GTX 680, four raster units and 128 Texture units comprise 32 ROPs; eight geometry units each have a tessellation unit, and more lower-level cache. Both the GTX 670 and the GTX 660 Ti each have 4 graphics engines but one less SMX unit and only 24 ROPs.
The other main differentiation between the GTX 670/680 and the GTX 660 Ti and GTX 660 is that the 660 and the Ti bus is much narrower at 192-bit, cut down from 256-bit. Nvidia has really improved their memory controller over last generation as there is a 192-bit wide GDDR5 memory interface at 6Gbps declared throughput.
The GeForce GTX 660’s memory speed is 6008MHz data rate. The base clock speed of the GeForce GTX 670 and the GTX 660 Ti is 915MHz and the typical Boost Clock speed is 980MHz whereas the GTX 660 boosts a higher base clock of 980MHz with a typical Boost speed of 1033MHz and sometimes near 1100MHz under good conditions. The Galaxy GC’s factory clock is set to 1006MHz with a typical boost to 1072MHz (and beyond as we have observed). Galaxy did not adjust the memory speed.
The GeForce GTX 660 Ti ships with 1344 CUDA Cores and 7 SMX units whereas the GTX 660 ships with 960 CUDA cores and 5 SMX units. In addition to its five SMX units and GPU Boost, the GeForce GTX 660 also ships with three 64-bit memory controllers (192-bit), 384K L2 cache, and 24 ROP units.
Like certain earlier GeForce GPUs, the GeForce GTX 660’s memory controller supports mixed density memory module operation. This feature allows Nvidia to outfit the board with 2GB of memory instead of the expected 1.5GB while utilizing a 192-bit memory interface.
The memory controller logic divides the GTX 660’s eight memory modules as follows:
- Memory Controller 1: 4 pcs: 128M x 16 GDDR5 (1GB, 16-bit mode)
- Memory Controller 2: 2 pcs: 64M x 32 GDDR5 (512MB, 32-bit mode)
- Memory Controller 3: 2 pcs: 64M x 32 GDDR5 (512MB, 32-bit mode)
The three memory controllers divide the memory into equal size fragments of 512MB each to create a frame buffer size of 1.5GB and 192-bit interface. The remaining 512MB of memory is accessed in an additional memory transaction by memory controller 1 with a 64-bit width. This gives the GPU access to a full 2GB of video memory with minimal latency. We are using a 2GB HIS HD 7850 to compare with the Galaxy GTX 660 although there are 1GB versions of the HD 7850 available. The GTX 660 design allows for either 2 or 3GB of memory – never 1GB.
The GeForce GTX 680 reference board measures 11″ in length whereas the GTX 670, GTX 660 Ti and GTX 660 are each 9.5″. Display outputs include two dual-link DVIs, HDMI and one mini-DisplayPort connector. Two 6-pin PCIe power connectors are required for both the GTX 660 Ti’s and the GTX 670’s operation but only one is required for the GTX 660. Nvidia’s partners may use a full-sized PCB as Galaxy does (below right) for their GTX 660 or a short one as EVGA does below left).
Under load, the GeForce GTX 660 typically draws 115W of power in most non-TDP apps. This is with the power target slider set at its default 100% setting. We maxed the slider out at +110% using Galaxy’s own Xtreme Tuner overclocking software for all of our benching just as we maxed out the Power Tune setting for the HIS Radeon HD 7850 in Catalyst Control Center. At this +10% maximum power setting, the GTX 660 will draw around 127W in non-TDP apps.
This is a very brief overview of Kepler architecture as presented to the press at Kepler Editor’s Day in San Francisco a few months ago. We also attended Nvidia’s GPU Technology Conference (GTC) and you can find a lot more details about the architecture in our GTC 2012 report.
GPU Boost
GPU Boost was invented by Nvidia to improve efficiency and to raise the GTX 660 clocks automatically in response to dynamically changing power requirements. Up until now, Nvidia engineers had to select clock speeds on a specific “worst case” power target – often a benchmark.
Unfortunately, all apps are not equal in their power requirements and some applications are far more power-hungry than others. That means that in some games with lower power requirements, the game is not optimized for higher core frequency because it is limited by a global power target.
With GPU Boost, there is real time dynamic clocking with polling every millisecond. In this way, clocks can be ramped up to meet the power target of each application – not held back by the most stressful application, which is usually a benchmark, not a game.
As we found with the GTX 680, the GTX 670, the GTX 660 Ti and now the GTX 660, GPU Boost goes hand-in-hand with overclocking and it delivers additional frequency in addition to the clocks set by the end user. GPU Boost continues to work with the GTX 660 while overclocking to the maximum allowed by the ever-changing power envelope.
Moving the voltage higher also moves the frequency and boost higher. In practice, if you monitor the frequencies, they constantly change up and down.
Adaptive VSync
Traditional VSync is great for eliminating tearing until the frame rate drops below the target – then there is a severe drop from usually 60 fps down to 30 fps if it cannot meet exactly 60. When that happens, there is a noticeable stutter.
Nvidia’s solution is to dynamically adjust VSync – to turn it on and off instantaneously. In this way VSync continues to prevent tearing but when it drops below 60 fps, it shuts off VSync to reduce stuttering instead of drastically dropping frame rates from 60 to 30 fps or even lower. When the minimum target is again met, VSync kicks back in. In gaming, you never notice Adaptive VSync is happening; you just notice less stutter (especially in demanding games).
Adaptive VSync is a good solution that works well in practice. We spent more time with Adaptive VSync by playing games and it is very helpful although we never use it when benching.
FXAA & TXAA
TXAA
There is a need for new kinds of anti-aliasing as many of the modern engines use differed lighting which suffers a heavy performance penalty when traditional MSAA is applied. The alternative, to have jaggies is unacceptable. TXAA – Temporal Anti-Aliasing is a mix of hardware mult-sampling with a custom high quality AA resolve that use temporal components (samples that are gathered over micro-seconds are compared to give a better AA solution). It’s main advantage is that it reduces shimmering and texture crawling when the camera is in motion.
There is TXAA 1 which extracts a performance cost similar to 2xMSAA which under ideal circumstances give similar results to 8xMSAA. Of course, from what little time we have spent with it, it appears to be not quite as consistent as MSAA but works well in areas of high contrast. TXAA 2 is supposed to have a similar performance penalty to 4xMSAA but with higher quality than 8xMSAA.
TXAA was the subject of a short IQ analysis of the Secret World – the first game to use it. So far, it appears to be a great option for situations where MSAA doesn’t work efficiently and it almost completely eliminates shimmering and texture crawling when the camera is in motion. It works particularly well for the Secret World as the slight blur gives the game a cinematic look.
FXAA
Nvidia has already implemented FXAA – Fast Approximate Anti-Aliasing. In practice, it works well in some games (Duke Nukem Forever/Max Payne 3), while in other games text or other visuals may be a bit blurry. FXAA is a great option to have when MSAA kills performance. We plan to devote a entire evaluation to comparing IQ between the HD 7000 series and the GTX 600 series as well as comparisons with the older series video cards.
Specifications
Here are Nvidia’s specifications for the reference GTX 660:
As discussed, the GTX 660 is very similar to the GTX 660 Ti but with less CUDA cores and a higher clockspeed to partially compensate.. The GeForce GTX 660 was also designed from the ground up to deliver exceptional tessellation performance which Nvidia claims is higher than the HD 7850’s tessellation performance. Tessellation is the key component of Microsoft’s DirectX 11 development platform for PC games.
Tessellation allows game developers to take advantage of the GeForce GTX 660’s GPU’s tessellation ability to increase the geometric complexity of models and characters to deliver far more realistic and visually rich gaming environments. Needless to say, the new GTX 660 brings a lot of features to the table that current Nvidia’s customers will appreciate, including improved CUDA’s PhysX, 2D and 3D Surround plus the ability to drive up to 3 LCDs plus a 4th accessory display from a single GTX 660; superb tessellation capabilities and a really fast and power efficient GPU in comparison to their previous GTX 460 and GTX 570.
Surround plus an Accessory display from a single card
One of the criticisms that Kepler has addressed from Fermi was that two video cards in SLI are required to run 3-panel Surround or 3D Vision Surround. From a single card, the GTX 670, 680, the GTX 660 Ti and now the GTX 660 can run three displays plus an accessory display. Interestingly, Nvidia has changed their taskbar from the left side to the center screen. We now prefer the taskbar in the center; it might be more convenient for some users rather than clicking all the way over to the left for the start menu as with Eyefinity.
One thing that we did notice. Suround and 3D Vision Surround are now just as easy to configure as AMD’s Eyefinity. And AMD has no real answer to 3D Vision or 3D Vision Surround – HD3D lacks basic support in comparison.
One new option with the GTX 660/660 Ti/670/680/690 is in the bezel corrections. In the past, the in-game menus would get occluded by the bezels and it was annoying if you use the correction. Now with Bezel Peek, you can use hotkeys to instantly see the menus hidden by the bezel. However, this editor does not ever use bezel correction in gaming.
One thing that we are still noting – Surround suffers from less tearing than Eyefinity although AMD appears to be working on a solution with their latest drivers. The only true solution to tearing in Eyefinity is to have all native DisplayPort displays or opt for the much more expensive active adapters. And you will need two adapters for Eyefinity for most HD 7850s to run Eyefinity, whereas you only need one for Surround with the GTX 660, GTX 660 Ti, GTX 670 and the GTX 680.
Nvidia also claims a faster experience with the custom resolutions because of a faster center display acceleration.
A look at the Galaxy GTX 660 GC
The GTX 660 is on a short PCB especially compared to the GTX 680. With the GeForce GTX 660, Nvidia’s board partners have the option to produce custom GTX 660 boards on launch day. Galaxy uses a full-sized PCB with dimensions of 9.6″ long by 4.37″ wide by 1.51″ thick. Just like the GeForce GTX 670 and GTX 660 Ti were made into a smaller form factor chassis, Nvidia made a number of adjustments to the 660 reference board to save space by moving the power supply closer to GPU.
Display outputs include two dual-link DVIs, one HDMI, and one DisplayPort connector. One 6-pin PCIe power connector is required for operation. If a user fails to connect the power connector properly, a brief message is displayed at boot-up instructing them to plug-in the power connector.
With the reference GTX 660, the power circuitry moved to the other side of the board, the area on the right side of the PCB was removed to save board space. In contrast, the Galaxy GTX 660 features a custom PCB design with the following benefits.
Galaxy custom PCB design features:
- Extended PCB length for better layout and signal path
- Thru-board ventilation near MOSFETs for better component cooling
- High quality shielded inductors maintain cleaner signal and eliminate coil whine
- Improved power handling in Galaxy’s custom PCB design increases energy efficiency and OC potential
Galaxy also uses a “Force Air” bracket to help exhaust hot air out of the case. They use a shielded DVI-I port while reference cards do not, meaning that the analog signal when using the included adapter will be cleaner. Secondly, there’s more to the custom cooling in that the fan blades themselves are a special aerodynamic/acoustic design that further reduces noise output. Finally, the PCB uses a 6 phase power design vs. 5 phase on reference cards for higher power output and more stable current.
SLI
The GTX 660 is set up for 2-way SLI by using two GTX 660s. We hope to bring you a follow-up evaluation comparing GTX 660 SLI performance scaling over a single GTX 660 Ti and perhaps compared to a single GTX 680. We received our second EVGA GTX 660 Super Overclock on Monday evening, too late to do any SLI benching although we installed them into our case.
Super-Widescreen 5760 x1080, Surround, 3D Vision Surround, and PhysX
The Galaxy GTX 660 is set up exactly the same way as the more expensive GTX 660 Ti, GTX 670 and GTX 680. Since the GTX 660 is considerably slower than the GTX 670 overall, one can reasonably expect the performance delta to be much lower for super-widescreen resolutions as well as for Surround, 3D Vision Surround and for PhysX as in our last evaluation of the GTX 670 in May.
For 3D Vision and for Surround, many games need to have their settings reduced. Just remember that you are playing across three screens and are also rendering each scene twice for 3D Vision!! And turning on PhysX on a GTX 660, although affecting the frame rate, it is enough to play the game with fully maxed out details and FXAA or AAA compared to the GTX 460 it replaces.
Overclocking
Our Galaxy GTX 660 GC edition is already overclocked +26MHz over the Nvidia reference clocks. We were able to overclock a further +40MHz with complete stability even though we did not adjust the voltage nor our fan profile. We also managed +170MHz on the memory clocks which were lower than the GTX 660 Ti’s +190MHz and considerably lower than the +400MHz we managed on the GTX 670 and the +550MHz on the GTX 680 and the GTX 690.
Even with overclocking further, temperatures generally stayed below 60C and the fan rarely exceeded 40%. The Galaxy GTX 660 GC is a quiet card.
The GTX 650
Based on Nvidia’s “GK107” GPU, the GeForce GTX 650 is for GeForce GTX entry level gaming. The reference GTX 650 has 384 CUDA Cores running at 1058MHz, while its TDP is just 64 watts. It can draw its entire power from the PCIe slot although it ships with a PCIe connector for overclocking. And with up to 2GB of 5GHz GDDR5 memory, the GeForce GTX 650 has the ability to play the latest DX11 games at 1080p HD resolution with reasonable detail settings.
The GeForce GTX 650 ships with 2 SMX units containing 384 CUDA Cores and 32 texture units. The memory subsystem of the GeForce GTX 660 consists of two 64-bit memory controllers (128-bit) with either 1GB or 2GB of GDDR5 memory. The graphics core clock speed of the GeForce GTX 650 is 1058MHz. GPU Boost is not available. The GeForce GTX 650’s memory speed is 5000MHz data rate.
Basically the GTX 650 is a GT 640 with higher clock speeds and DDR5 instead of DDR3. Here is the GTX 650’s specification chart:
With a TDP of just 64 watts, the GeForce GTX 650 draws very little power, yet it ships with an external power connector. This power connector provides additional headroom for overclocking. Nvidia’s board partners will offer GeForce GTX 650 OC SKUs at a variety of speeds at launch. Nvidia claims that many GeForce GTX 650 boards are capable of hitting speeds in excess of 1200MHz and we are looking forward to testing this for ourself versus our two overclocked gHD 7770s and HD 7750 which generally overclock well.
The idle power of GeForce GTX 650 is ~5W and HD video playable is ~13W, again representing excellent conservation of power. The GeForce GTX 650 reference board measures 5.7″ in length. Display outputs include two dual-link DVIs and one mini-HDMI. One 6-pin PCIe power connector is required for operation. We hope to have a GTX 650 evaluation published next week.
Check out the performance summary charts and particularly the overclocking charts to note how well the Galaxy GTX 660 scales. Besides overclocking it further than the Galaxy clocks, we also underclocked it to GTX 660 reference clocks . The specifications look good with solid improvements over the Fermi-based EVGA GTX 460 FTW. Let’s check out performance after unboxing our Galaxy GTX 660 GC. Head to the next page for the unboxing and then to the test configuration .