AMD’s Carrizo architecture detailed and explored

AMD’s Carrizo architecture detailed and explored
AMD has taken the wraps off its Carrizo APU — its next-generation mobile architecture and final iteration of the Bulldozer architecture that debuted in 2011.

AMD Carrizo

AMD has taken the wraps off its Carrizo APU — its next-generation mobile architecture and final iteration of the Bulldozer architecture that debuted in 2011. And overall, Carrizo is in an odd position. On the one hand, it represents the pinnacle of AMD’s long effort to improve the Bulldozer architecture’s efficiency and reduce its power consumption. It incorporates a number of genuine firsts for AMD, including the company’s first implementation of Adaptive Voltage and Frequency Scaling (as opposed to DVFS, the technology that Intel uses), its first High Efficiency Video Decoder to be implemented in hardware, and the first APU to utilize the color compression and bandwidth-saving technologies that AMD first debuted with its stand-alone Tonga / R9 285 last fall. It’s also the first “big-core” SoC that AMD has ever built, and the first part the company has designed with High Density Libraries for both the CPU and GPU.

On the other hand, it’s the last of its line and ships even as eyes and tongues have turned towards AMD’s next-generation Zen architecture. We haven’t had the opportunity to test shipping hardware yet, but AMD’s performance claims for its new hardware are significant. If the chip performs as claimed, it could offer stronger competition for low-power laptop wins than AMD has fielded ever before.


Carrizo: Under the hood

We’ve already covered some of Carrizo’s power optimization features and new capabilities in previous articles, so we’ll focus here on the details that AMD hasn’t previously disclosed, including the specifics of certain core improvements.

Carrizo1

As we previously disclosed, L1 data caches on Excavator are doubling in size to 32K, up from 16K. The L1 data caches on Excavator and previous Bulldozer-class processors are unique to each core (the L1 instruction cache is shared between a pair of cores). AMD claims to have kept cache latencies identical despite doubling its size, while improving the CPU’s prefetch mechanisms.

It’s not clear how much absolute power AMD saved by improving the L1 cache, but cache memory tends to be quite power hungry. Each level of cache is designed to make different tradeoffs between cost, access latency, and power consumption — L1 cache, which sits closest to the CPU, likely draws the most power per bit of storage.

Branch prediction has also been improved, with larger branch target buffers (which improves branch predictor efficiency). There’s also support for new instructions including, most notably, AVX2. The MOVBE, SMEP, and BMI2 instructions are more specialized and deal with specific capabilities or bit manipulation. Performance in general-purpose applications isn’t likely to change much from these capabilities, though there’s a chance that AVX2 could deliver its own performance gains.

The new APU also supports what Microsoft is now calling Modern Standby (old title: Connected Standby). Intel had a great deal of trouble shipping 64-bit Connected Standby drivers for Windows 8 / 8.1, which is why certain systems remained on 32-bit versions of the OS for quite some time. In contrast, AMD told us that it prioritized 64-bit support for Modern Standby in Windows 10 and that it expects the OS to support the mode on its hardware out of the box.


Video playback and power consumption

The CPU, of course, is just half the equation. AMDhas made a number of low-level improvements to the GPU side of Carrizo as well, including several changes that should significantly improve overall battery life. Previous AMD chips used the GPU to scale and process images during video playback, but this is a relatively inefficient use of that silicon. By rearchitecting the video path, AMD was able to substantially cut power consumption.

According to AMD’s footnotes, a 15W Carrizo using Metro Video Player to play back a 1920×1080 video draws 2.43W without underlay support and 1.90W with underlay. A 19W Kaveri 3GHZ reference design draws 4.8W to perform the same task (underlay support is not available on Kaveri). Thus, even in the worst-case, without an underlay, Carrizo should draw half the power of its predecessor. The graph in the upper-right hand corner shows where AMD draws its improvements from — some of them are architectural, some come from Carrizo’s SoC design, and some use the aforementioned underlay support.

Next up, there’s the improved Unified Video Decoder, or UVD. Carrizo is the first AMD APU to offer support for H.265/HEVC decoding in hardware. 4K decode for H.264 is also supported, along with 4K MJPEG. What’s more impressive about the UVD unit is that it’s been redesigned for higher bandwidth and much improved power gating.

According to AMD, adding dynamic power gating to the video decoder can improve battery life during video playback by up to 30 minutes. The result of these improvements is a platform power reduction of 50% for a 15W Carrizo compared with a 19W Kaveri reference platform.

Given the importance of online video, whether that means YouTube, Twitch, or Netflix, AMD’s focus on reducing its platform power consumption makes perfect sense.


GCN improvements

Carrizo’s GCN implementation is a third-generation core that incorporates the improvements that AMD first introduced with Tonga, aka the R9 285. Carrizo is also the first AMD APU to implement the GPU compute pre-emption, context switching, and quality of service features that AMD first laid out in its HSA roadmaps years ago.

Exactly what impact these features will have on APU performance in HSA workloads is still unclear. When we reviewed Kaveri, we found that AMD made a number of additional improvements to access latencies and compute workload performance that it hadn’t initially disclosed. If Carrizo further builds on these gains, heterogeneous workload performance should be quite impressive. AMD alludes to such performance gains indirectly, but didn’t offer much detail.

AMD didn’t just do a straight port of Tonga over to an APU— it redesigned the chip to use lower-power transistors, and it implemented separate voltage islands for each area of the SoC. The result is a GPU that can use all eight CUs, even in 15W power envelopes, while hitting graphics performance targets as much as 60% higher than Kaveri within the same power envelope:

The Futuremark results quoted here don’t necessarily tell us what to expect from real-world gaming performance, but they certainly point in a positive direction. According to AMD’s footnotes, Carrizo’s delta color compression improves performance by an average of 5-7% when measured in 24 games at 1920×1080.

The actual FutureMark 3DMark 11 scores between Kaveri and Carrizo were (at 19W at 15W): 1220 (Kaveri) vs. 1581 (Carrizo) with skin temperature power-aware management (STAPM) enabled and 1220 vs. 2120 with STAPM disabled. At the 35W power envelope for both cores, Kaveri scored a 2184 in 3DMark 11 vs. 2500 for Carrizo. Total gains in the 15W envelope were 30-73% depending on STAPM, compared to a 14.4% gain in the 35W envelope.


Carrizo’s CPU performance

AMD didn’t hand out a great deal of benchmark data, but the slides it did share are rather interesting. If you’re familiar with the popular Cinebench rendering test, you likely know that that the test is anything but flattering for AMD. According to Sunnyvale, Carrizo delivers a huge performance gain in Cinebench at the 15W TDP target, thanks to a mixture of improved IPC and higher frequencies.

The improvements are smaller at 35W because Carrizo isn’t optimized for that power envelope, but the chip still turns in more than 10% improved performance in the single-threaded workload, despite a lower base frequency compared to Kaveri. The multi-threaded version manages gains of ~15%.

AMD doubled up on the decode pipes in its UVD 6 engine and its claiming that this, combined with HSA support in applications like Handbrake, will deliver transcode performance that can outstrip the Core i5-5300U using QuickSync. Again, we’ll have to verify these claims.


Conclusion

Ever since Bulldozer debuted, AMD has been fighting a two-front war to improve its flagship CPU’s performance and power efficiency — all while hemorrhaging market share and launching an initiative designed to promote entirely new methods of computing. Every APU revision from Trinity forward has offered better performance per watt, lower TDPs, and better battery life. Carrizo seems poised to carry that trend forward. That’s a tentative conclusion, to be sure — I want to see shipping hardware before I declare the chip a winner — but AMD has been clear and specific about how it targeted lower power envelopes and the improvements that it made to hit its targets.

Currently, Carrizo systems are expected in-market by the end of June or early July. If the platform delivers on its promise, we’ll see AMD laptops with much better battery life and overall performance characteristics in 19W envelopes that can compete more effectively with Intel’s Core i3 and Core M platforms.

Comments