GDC 2024 - AMD Ryzen™ Processor Software Optimization

Hello! My name is Ken Mitchell and I am thrilled to present the following slides on AMD Ryzen™ processors which power today's game consoles and PCs. Let's get started! Today's agenda includes: abstract, speaker biography, products, data flow, microarchitecture, best practices and optimizations. Abstract. Break through CPU bottlenecks to reach higher frames-per-second and more builds-per-day, equipped with AMD Game Engineering's knowledge, forged through its rich history of partnerships with AAA

game developers! Learn about exciting AMD Ryzen™ products featuring advanced processor and graphics technologies powering today's laptop, desktop, and workstation PCs. Dive into data flow, simultaneous multi-threading, resource sharing, instruction set evolution, cache hierarchies, and coherency. Unlock powerful profiling tools and application analysis techniques using the Windows Performance Analyzer, Concurrency Visualizer, and AMD μProf. Discover best practices and lessons learned. Upgrade sy

stem software and build settings. Attack valuable code optimization opportunities. Examples include C, C++, assembly, and hardware performance monitoring counters. Hello again! My name is still Ken Mitchell. I am a Fellow and Technical Lead in the AMD Software Performance Engineering team where I collaborate with Microsoft® Windows® and AMD engineers to optimize AMD processors for better performance-per-watt. In my previous roles, I have worked on optimizing PC games for AMD processors, analyzin

g PC applications for future product performance projections, and developing system benchmarks. And by the way, just in case you were wondering, this is an image of me holding a metal armadillo. Products. Today's presentation will focus on AMD Zen 4 based processors. Fortunately, many of our recommendations also apply to previous CPU architecture products. Here are some AMD former code name examples for mobile, desktop, and workstation form factors. Not all former code names for each CPU archite

cture are shown. Some code code name families span multiple form factors. For example, "Phoenix" was designed primarily as a mobile product. However, "Phoenix" also scales up to the Ryzen™ 8000G series desktop products. The AMD Ryzen™ 8040 series mobile processors, formerly codenamed "Hawk Point", are built on the cutting edge 4nm process node delivering up to 14.5 hours of video playback. Not only do these processors include powerful "Zen 4" cores, but they also include the highest performing i

ntegrated graphics you can get thanks to RDNA3. Bettter yet, they support Microsoft® Windows® Studio Effects powered by AMD Ryzen™ AI, our neural processing unit. AMD is committed to years of future support for its socket AM5 platform packed with state-of-the-art technologies such as high-speed DDR5 memory, PCIe Gen 5 support, and AMD Expo one-touch memory overclocking. These desktop processors, formerly codenamed "Raphael", are also powered by the "Zen 4" core architecture. Processors named wit

h the "X3D" suffix feature exclusive AMD 3D V-Cache™ technology for a massive gaming performance boost. Additionally, Eco Mode and 65-watt low-power models have leadership efficiency. Design, build, accelerate on the ultimate workstation processor! Formerly codenamed "Storm Peak", AMD Ryzen™ Threadripper Pro 7000 series "Zen 4" processors deliver battle-tested performance and capability to enable artists, architects, and engineers with the ability to get more done in less time. The TRX50 platfor

m boasts an impressive capacity of up to 4 memory channels and 48 PCIe Gen 5 lanes, while the WRX90 platform takes it to the next level with up to 8 memory channels and 128 PCIe Gen 5 lanes. Wow! That is a lot of IO! Dataflow. First, this slide shows an abstract diagram of a "Zen 4" based AMD Ryzen™ 9 8945 HS mobile processor formerly known as codenamed "Hawk Point", a member of the "Phoenix" family. Imagine data moving from DRAM on the far right through the data fabric, caches, and cores on the

left. Each unified memory controller manages 32 bits of DRAM. DDR5-5600 and LPDDR5-7500 are supported. In this monolithic die SOC design, the orange-colored complex L3 cache connects to the blue-colored data fabric at 32 bytes per cycle read and 32 bytes per cycle write. RDNA3-based integrated graphics, a multimedia hub, and a neural processing unit are present in this mobile processor. Some product configurations may feature up to 8.6 teraflops of single-precision FMAC graphics performance at

120 GBps memory bandwidth and up to 16 TOPS of int8 NPU performance. Next, is a "Zen 4" based AMD Ryzen™ 9 7950X desktop processor, formerly known as codenamed "Raphael-AM5" Each core complex die in this chiplet design has a unified 8-core cluster with a massive 32 MB L3 cache. Each unified memory controller manages 32 bits of DRAM. A DIMM is composed of two 32-bit sub-channels. DDR5-5200 JDEC is supported. Faster memory speeds are possible using AMD Expo technology. Some product configurations

may feature up to 1.1 teraflops of single-precision FMA integrated graphics performance at 83 GBps JDEC memory bandwidth. Thus the included RDNA2 based integrated graphics are for basic desktop and office use. We recommend a discrete graphics card for gaming. Finally, we have a "Zen 4" based AMD Ryzen™ Threadripper Pro 7995WX 96-core, 192 logical processor, formerly codenamed "Storm Peak". Thanks to the new WRX90 platform, this product has up to 8 channels of DDR5 memory and 128 lanes of PCIe Ge

n 5! You can add many GPUs, NVMe drives, and high-speed networking devices to this battle station! Each unified memory controller manages two 32 bits of DRAM. A DIMM is composed of two 32-bit sub-channels. RDDR5-5200 JDEC is supported. Again, faster memory speeds are possible using AMD Expo technology. Note, RDDR5 provides more memory bandwidth than RDDR4 so it can feed more cores. This topology with up to 12 CCDs is so large that I have shown it in quadrants for simplicity! This product require

s a discrete GPU. Microarchitecture. Advances in "Zen 4" microarchitecture include: A 13% IPC improvement using typical desktop applications. A larger opcache to deliver more ops per cycle. A larger L2 cache. An improved load store unit with a larger load queue. Branch prediction improvements with larger branch target buffers. And AVX512 instruction support using a 256-bit data path. To improve instruction throughput, the processor implements Simultaneous Multi-Threading (SMT). Single-threaded a

pplications do not always occupy all the resources of the processor. The processor can take advantage of the unused resources to execute a second thread concurrently. Although each thread has a program counter and architectural register set, core resources may be shared while operating in two-threaded mode. The core is in two-threaded mode while its two logical processors execute program threads. If either of the core's logical processors execute the operating system's idle thread, it may return

to single-threaded mode. Disabling SMT in system BIOS menu options will reduce the number of logical processors in the system and limit cores to operate only in single-threaded mode. Resource entries such as queue entries, caches, pipelines, and execution units can be competitively shared, watermarked, or statically partitioned while in two-threaded mode. These categories are defined as: Competitively Shared: Resource entries are assigned on demand. A thread may use all resource entries. Waterm

arked: Resource entries are assigned on demand when in two-threaded mode. A thread may not use more resource entries than are specified by a watermark threshold. Statically Partitioned: Resource entries are partitioned when entering two-threaded mode. A thread may not use more resource entries than are available in its partition. Resource entries are unpartitioned when exiting two-threaded mode. Caches and TLBs are competitively shared for "Zen 2", "Zen 3", and "Zen 4". For "Zen 3", the integer

scheduler, integer register file, and load queue have changed from competitively shared to watermarked. For "Zen 4", the floating point physical register file has also been changed from competitively shared to watermarked. These changes improved SMT fairness. Software system requirements for recommended CPU and minimum CPU are often from different product generations with differences in SMT fairness, structure sizes, etc. Consider profiling both recommended and minimum system requirements. "Zen

4" added GFNI and AVX512 support. More details are on the next slide. Please note, AMX instructions are not supported. "Zen 4" AVX512 support includes many ISA extensions which may benefit workloads such as light baking, texture compression, and neural networks. AVX512_FP16 is not supported. AVXVNNI using VEX rather than EVEX encoding is also not supported. Use software prefetch instructions on linked data structures experiencing cache misses. Use NTA on use once data. While in two-threaded mode

, beware too many software prefetches may evict the working set of the other thread from their shared L1 data and L2 caches. Prefetch T0 and NTA fill into the L1 data cache. Prefetch T1 and T2 fill into the L2 cache, a new feature for "Zen 4". For previous generations of "Zen" processors, prefetch T1 and T2 filled into the L1 data cache. Cache details including ways, associativity, inclusion policy, and write policy are described in AMD Software Optimization Guides. Designing data structures tha

t match hardware prefetcher access patterns may improve performance. The L1 data cache stream and stride prefetchers are my favorites. Stream prefetchers may fetch additional sequential lines in ascending or descending order. Stride prefetchers may fetch additional lines when each access is a constant distance from the previous. The L2 cache also has a stream prefetcher. Additionally, the up-down prefetcher may fetch the next or previous line. This simple example may trigger a streaming hardware

prefetcher as it iterates the contiguous array. Similarly, iterating std::vector may also trigger this prefetcher. In our next code sample, two strides are detected. Again, iterating std::vector may also trigger this prefetcher. However, the stream and stride code snippets may not trigger a hardware prefetcher if they iterated instead using a linked data structure with nodes scattered randomly across memory addresses. The cache hierarchy has evolved considerably since "Zen 1" was introduced, es

pecially for AMD desktop SoCs. "Zen 2" increased the L3 cache size and micro-op cache size. "Zen 3" further increased the L3 cache size. "Zen 4" grew the micro-op cache and L2 cache sizes. Some "Zen 3" and "Zen 4" products feature "X3D" technology for larger L3 caches which are great for gaming. For example, the Ryzen™ 7 7800X3D 8-core processor has a single 96 MB L3 cache! The AMD cache coherency protocol is MOESI (Modified, Owned, Exclusive, Shared, Invalid). Instruction execution, speculative

execution, prefetching, and external bus transactions may change the cache's MOESI state. Read hits do not cause a MOESI state change. However, write hits generally cause a MOESI state change into the modified state. If the cache line is already in the modified state, a write hit does not change its state. The AMD "Zen 4" microarchitecture implements a large L3 cache shared by up to eight cores inside each CPU complex, abbreviated as CCX. The L3 cache maintains shadow tags for each L2 cache wit

hin its complex. Shadow tags determine if a fast cache-to-cache transfer between cores within the CCX is possible. Cache coherency probe latency responses may be slower from cores within another CCX. Two CCX are shown in this example. For "Storm Peak", imagine up to 12 CCX attached to the data fabric. Minimize ping-ponging modified cache lines between cores, especially in another CCX. Here are a few tips: One, minimize using Read-Modify-Write instructions. Use a single atomic add with a local su

m rather than many atomic increment operations. Two, improve lock efficiency. Test and Test-and-Set in user spinlocks with a pause instruction. Note, this is especially important for "Zen 1" based processors. Or replace user spinlocks entirely with modern sync APIs. And three, use a memory allocator optimized for multithreading. The memory manager is a big repeat offender for contention of locking primitives. Try memalloc or jemalloc. Some AMD products have cores that are faster than other cores

. AMD calls this feature preferred core. The system BIOS may describe the "ACPI CPPC Highest Performance" ranking for each logical processor. These values are the basis for the GetSystemCpuSetInformation functions, SchedulingClass values in Windows® 10 and later for some AMD products. Windows® may use SchedulingClass during thread scheduling. SchedulingClass values may change during runtime. Logical processor 0 may not be the fastest core. CCD0 may not contain the fastest core. Thread affinity m

asks may interfere with thread scheduling and power management optimizations on Windows PCs. For these reasons, I typically recommend not setting process affinity or thread affinity masks in PC applications. Best Practices. While CPU profiling, prefer Shipping or Test configuration builds over Development and Debug configuration builds. It is important to keep in mind that development builds may significantly reduce performance and give rise to false alarms that may waste your time. Additionally

, collecting stats may pollute the cache, leading you to investigate cache issues in the wrong places. Logging can also create serialization points. Moreover, many Debug builds may disable multi-threading optimizations, which can further impact performance. During the investigation of open issues, developers may request changes that enable Debug features on Shipping and Test configurations. However, it is critical to disable debug features before you ship your software. Anti-tamper and Anti-chea

t technologies may prevent CPU debugging and profiling tools from working correctly, especially while loading and retrieving simple information. Consequently, we recommend creating a CPU profiling friendly build configuration similar to the Shipping configuration, but with Anti-tamper and Anti-cheat technologies disabled. Add this build as a launch option during development. Remove this build before release. It is important to test the cold shader cache First Time User Experience. If the applica

tion has a shader cache, make sure to clear it. Remember that the end user might not run the same scene repeatedly as developers do during debugging. The example provided clears the shader caches for Microsoft, AMD, and NVIDIA. After running the script, it is recommended to reboot the system to ensure that any remaining shaders are cleared from memory. Keep in mind that applications and games can have different configurations of shader caches on disk, leading to varying results. Additionally, th

e GPU vendor and driver versions used can also affect the outcomes. Use the latest compiler in Windows® SDK. This ensures you get the latest build and link time improvements. Rebuilding UE4.27 is much faster on Visual Studio 2022 and 2019 compared to Visual Studio 2017. Some developers have experienced larger benefits than those shown. Also, ensure you're using the latest C runtime optimizations, especially for memcpy and memset. Some Visual Studio 2022 updates improved indexing performance and

vectorization. Windows® Defender scans can greatly slow down some workflows. Windows® 11 enabled the Windows Defender sandbox feature by default, which greatly improved security, but also increased file copy and file compress times. Add project folders to virus and threat protection settings exclusions for faster build times. This system showed a 20% reduction in build time by adding folder exclusions. That said, WARNING: This recommendation is for personal development systems and not for contin

uous integration and continuous deployment systems. Putting all this together, using the latest compiler and SDK, with virus and threat exclusions configured on your local build project folder, you can greatly reduce the time spent waiting on builds each day. A binary may have better code generation using AVX or later ISA by using the Microsoft Visual C compiler option /arch:[AVX|AVX2|AVX512. The minimum hardware requirements for Windows® 10 include SSE2. For Windows 11, it's only SSE4.1. The Wi

ndows® 10 supported processor list includes AMD products which support AVX but not AVX2. The Windows® 10 supported processor list may include products from other CPU vendors which do not support AVX. Enable AVX512 in development tools such as light baking, texture compression, and mesh to signed distance fields. We observed a 17% performance increase thanks to AVX512 while using the Intel Embree Path Tracer with ISPC. Audit content. Ask artists to recommend scenes of interest for profiling. For

example, an indoor dungeon with heavy occlusion, an outdoor city, an outdoor forest with alpha transparency, large crowds, or a specific time of day. Unreal Engine developers may find some performance issues simply by running MapCheck – especially for issues related to Actor Shadows. Unity developers may enforce minimum standards using the AssetPostprocessor. Check stats before CPU profiling. If a scene far exceeds its draw budget or as many duplicate objects, especially duplicate physics object

s, report the issue to its artists and consider profiling a different scene. Otherwise, you may risk profiling hotspots which may not be hot after the art issues are resolved. Additional considerations may be necessary to ensure the expected GPU is utilized in hybrid graphics platforms. The Windows® 10 Spring 2018 update added the EnumAdapterByGpuPreference function. Use DXGI_GPU_PREFERENCE_HIGH_PERFORMANCE for game applications. The user may change preferences per application in Windows graphic

s settings. Prefer H.264 video and AAC audio codecs as recommended by the Unreal Engine Electra Media Player. Hardware accelerated codecs may increase hours of battery life and reduce CPU work. Please note, Radeon graphics devices released since 2022 no longer accelerate WMV3 decoding. If you are using WMV3 content, please replace it with H.264 content which may provide a better user experience. Optimizations. Sync APIs. Avoid user spinlocks that starve payload work on other ready threads, consu

me excessive power, and drain laptop batteries. User spinlocks may waste CPU time since the OS scheduler cannot determine if it should yield to another program thread or continue to spin. In this exclusive lock test, user spinlock implementations consumed 100% of a Threadripper 96 core, 192 logical processor. However, legacy and modern sync APIs consumed only 5% or less to do the same work. Prefer std::mutex with good performance and low CPU utilization. Modern sync APIs like std::mutex may leve

rage AMD's mwaitx instruction. This instruction can efficiently wait on an address or timeout. Better yet, it can execute at any privileged level. However, legacy sync APIs like WaitForSingleObject may rely on expensive syscall instructions. The syscall instruction invokes an OS system call handler at privilege level 0 from a user-privileged level 3 application. Transitioning between address spaces or privilege domains may require additional OS and hardware work. Core Isolation Memory Integrity,

also known as Virtualization Based Security Hypervisor-protected Code Integrity, may sometimes lower performance. Windows® 11 ships with this feature enabled by default to provide increased protection against malware. Here is the source code for the main function used in the exclusive lock microbenchmark. We will reuse it in the next few slides. Basically, it times how long it took all threads to execute the callback function. Here's an example of a bad user spinlock. This lock may waste CPU ti

me since the OS scheduler cannot determine if it should yield to another program thread or continue to spin. One of the most common traits of a bad user spinlock is a while loop that lacks a pause instruction. Such loops can consume an excessive amount of core resources, performing non-payload work, and starving the other logical processor, sharing the same core resources. Remember, spinlocks can also consume a lot of power. For notebook gaming, this can reduce hours of battery life, make the sy

stem uncomfortable to touch, and steal power budget from the GPU. If you decide to use a user spinlock regardless, here are a few tips. Test, then test-and-set. Add one or more pause instructions, and align the lock variable. Some developers tune the number of pause instructions within the while loop for their target hardware or use an exponential backoff. But beware, CPU time may be wasted unless the spinning thread is eventually put into a wait state. WaitForSingleObject with CreateMutex are s

hown in this example. These two APIs have been around for a very long time. The WaitForSingleObject function checks the current state of the specified object. If the object's state is non-signaled, the calling thread enters the wait state until the object is signaled, or the timeout interval elapses. Thus, WaitForSingleObject may not suffer from OS thread scheduling and core resource sharing issues caused by user spinlocks. Better yet, std::mutex is even faster than WaitForSingleObject. This mod

ern sync API is based on a SRWLock in Microsoft's implementation, which can be verified using the Windows Performance Analyzer. In my testing, std::shared_mutex and EnterCriticalSection also outperformed WaitForSingleObject. Here is a list of some preferred modern sync APIs which include the efficient mwaitx instruction: Std::mutex, AcquireSRWLock, SleepConditionVariable, and EnterCriticalSection. Avoid or minimize functions with syscall instructions such as WaitForSingleObject and WaitForMultip

leObjects. Windows® ships with the Windows Performance Recorder built in! No additional tools are required for users to collect these logs. Install the Windows Performance Analyzer from the Windows Store to open these files. Generally, WPA is the first tool I use when analyzing a system or workload. It is highly configurable with excellent filtering and pivoting capabilities. In this example, we see the improved user spin lock test uses all logical processors at the start of the test. Total cpu

usage decreases as threads finish execution. For std::mutex, we see very little CPU usage throughout the exclusive lock test. We can also see there are no other processes consuming significant CPU time. Although not shown in this example, these logs can also be used to analyze thread priority. Because it downloads and caches symbols so quickly, I use WPA at least once before using other profiling tools. Frequently, I use Windows Performance Analyzer, Visual Studio Concurrency Visualizer, AMD uPr

of, and WinDbg. Warning! Observer effect from profiling tools increases as the number of counters and sampling rate increase. Beware User Spin Locks obstruct synchronization analysis. The Visual Studio Concurrency Visualizer is unable determine the unblocking stack in our improved user spin lock example. This can hide other multi-threading performance issues such as task granularity or load balancing. Fortunately, the Visual Studio Concurrency Visualizer can determine the unblocking stack in our

std::mutex example. Although the current stack is in user code, the unblocking stack still does a syscall into ntoskrnl. More importantly, we can now see worker threads spend most of their time blocked rather than in execution. We can also see a waterfall of serial work. These issues may have gone unnoticed had we used a user spin lock rather than synchronization APIs. Threading. This advice is specific to PCs with AMD processors and is not general guidance for all processor venders. Profile yo

ur game to determine the optimal thread pool size for both game initialization and gameplay. Utilizing all logical processors in SMT dual-thread mode may benefit game initialization, including decompressing assets, compiling, and warming shaders. However, SMT and cache contention on the main render threads may lower performance during gameplay. Tuning the thread pool size based on the number of physical cores may reduce this contention and improve performance. For gameplay, we recommend using th

e physical core count on systems with at least eight Ryzen™ CPU cores. This recommendation is based on our experience and subject to change. Thanks to this optimization, some games increase frame rates by 5 to 10 percent. Results may vary. A code sample for detecting core counts is available at gpuopen.com. Avoid hard affinity masks on PC. Hard affinity masks represent the only logical processors allowed to run the thread. Typically, these masks interfere with OS power management and thread sche

duling optimizations, especially for notebook and heterogeneous systems. Restricting where a thread can be scheduled can harm performance when other PC applications are running such as browsers, media players, system monitoring tools, and RGB software. These masks can also reduce hours of battery life. APIs using hard affinity masks include: the Windows® XP function SetThreadAffinityMask, and the Windows® 7 function SetThreadGroupAffinity. However, CPU Sets provide APIs to declare application af

finity in a soft manner that is compatible with OS power management. APIs using soft affinity CPU Sets include the Windows® 10 function SetThreadSelectedCpuSets, and the Windows® 11 function SetThreadSelectedCpuSetMasks. Avoiding hard affinity masks on PC may improve performance and hours of battery life while gaming. Thread priority describes the order in which threads are scheduled. Each thread has a dynamic priority. The system boosts the dynamic priority under certain conditions such as fore

ground window change, user input, timer messages, and satisfied wait conditions. My colleagues have observed cases where temporary priority boosted threads switched-in before threads intended to be higher priority by the developer. In these cases, the user experience improved by disabling priority boost. The SetProcessPriorityBoost and SetThreadPriorityBoost functions can be used for this purpose. Data Access. Update your compiler for the latest memcopy, memset, and other C runtime optimizations

. Memcopy behavior is undefined if the destination and source overlap. However, the compiler may generate Rep Move String instructions which have defined overlapping behavior. Alignas(64) may allow for faster rep move string microcode. Alignas(4096) may reduce store-to-load conflicts. The processor uses linear address bits 0 thru 11 to determine Store-To-Load-Forward eligibility. StliOther in AMD uProf counts store-to-load conflicts where a load was unable to complete due to a non-forwardable co

nflict with an older store. Additionally, alignas(4096) may benefit probe filtering on AMD Threadripper™ and EPYC™ processors. Finally, aligning to the bit_floor -- clamped between 4 bytes and 4096 bytes -- may provide a good balance of cache hits and alignment. False sharing may occur when two or more cores modify different data within the same cache line. Finding and fixing false sharing issues may have great performance benefits. Common solutions to false sharing in multi-threaded application

s include: Alignas(64), using a local variable when possible, and processing a range rather than a single element. This microbenchmark showed its execution time reduced by about 90% after optimization using alignas(64). For this example, each thread accesses a single ThreadData element within an array. Before the optimization, the array was compact. But unfortunately, different threads frequently modified the same cache line while using their ThreadData in the callback function. Simply padding e

ach ThreadData using a line as of the native cache line size resolved this issue. If and only if Virtualization-Based Security is disabled, AMD uProf offers a cache analysis profiling feature. Using this tool, we can see the top function hotspots sorted by L1 data cache miss latency. Before our optimization, our function fn is at the top of this list by many orders of magnitude. uProf can determine the top shared cache lines including their cache line address, offset, thread, and function. Befor

e the false sharing optimization, we see several threads doing loads and stores to different offsets at the same cache line address. We can double-click on the function name to open the source code view for a deeper investigation. The sources view shows C++ source lines and disassembly of interest. Often modifying shared cache lines may result in high L1 data cache miss latency. The disassembly shows this latency is high at the add instruction before optimization. After our optimization, our fun

ction fn is no longer at the top of this list. Use software prefetch instructions for linked data experiencing high cache misses. In this example, performance improved over 60% using software prefetch optimizations. Results may vary significantly for systems with different cache sizes or different memory latency. The Nvidia PhysX Kapla Demo iterates a std::vector of pointers. Consequently, the string and stride hardware prefetchers are of little help due to the pointer chase. Before the optimiza

tion, many of the data accesses in the innermost loop miss the 4MB last-level cache of the AMD Ryzen™ 7 4700G. Since the Convex class member data is not public, offsetof keywords cannot be used, and thus literal offsets are shown. Although not illustrated in this example, additional performance may be possible by reordering the hot member data to occupy only one cache line rather than four. For “Zen 2” and “Zen 3” CPUs, there is a significant penalty for mixing SSE and AVX instructions when the

upper 128 bits of the YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from AVX code to SSE or unknown code. In this example, benchmark execution time was reduced by o

ver 60% after a VZeroUpper optimization. Thanks to its microarchitecture improvements, "Zen 4" does not have this penalty. Use the Floating Point Dispatch Faults PMC event to find code which may be missing VZeroUpper or VZeroAll instructions during AVX to SSE and SSE to AVX transitions. In AMD uProf 4.2, this is simply called SSE_AVX_STALLS. Here are three simple optimizations which may mitigate this performance issue. First, use the /arch:AVX compiler flag. AVX is supported by 97% of users acco

rding to the January 2024 Steam Hardware and Software Survey. However, this may not be a viable option for software with very old SSE2 minimum hardware requirements. Second, return a __m256 value using pass-by-reference in the function parameter list rather than the function return type. And third, use __forceinline on the function definition. Any of these changes may allow compiler optimizations to reduce Floating Point Dispatch Faults. Here is an excellent example of the Floating-Point Dispatc

h Faults issue encountered in the MeshToSDF benchmark. Both the before and after optimization examples are compiled using the default /arch:SSE2 flag. Before the optimization, the function return type is a __m256 value. After the optimization, the __m256 value is returned using pass-by-reference in the function parameter list. Using AMDuProf, we can quickly find the hot line of source code and its assembly code. In this example, the column "SSE_AVX_STALLS" Per Thousand Cycles is greater than zer

o! Looking at the assembly code, we can see there is a Variable Blend Packed Single-Precision AVX instruction which is soon followed by a Move Aligned Packed Single-Precision SSE instruction where the upper 128-bits have not been cleared by a VZeroUpper or VZeroAll instruction. This needs optimization! After the optimization, the __m256 value is returned using pass-by-reference in the function parameter list. Using AMDuProf again, we can find the same line of source code and its assembly code. W

e can quickly observe Floating-Point Dispatch Faults Per Thousand Cycles is zero! Awesome! The compiler has inserted a VZeroUpper instruction between the AVX and SSE2 transition. This performance issue is resolved! Do you want to know more? Discover your best graphics performance by using our #opensource effects, SDKs, tools, and tutorials at GPUOpen.com Software optimization guides for “Zen 1”, “Zen 2”, “Zen 3”, and “Zen 4” are available in the AMD Documentation Hub at amd.com. Got Questions? E

mail us! We’d love to hear from you! This is John. This is me. Those are our email addresses. Design faster. Render faster. Iterate faster. We can’t wait to see what you create!

Comments

@rheumakai86

Can you please put in time stamps for each chapter? Appreciated

@yuehuang3419

Hi. Great info for devs. Could you clarify if the multiple ccx L3 cache are shared? If one ccx could oveflow to other ccx without sending back to main memory?

@HarimeNuiChan

I wonder if we will have an Radeon 7950 XTX i would love to buy one snd replace my aged 6900 XT big love and respect im very satisfied from your products !

@thedeathkeeper23

AMD you are Our last hope for gaming tech be #1

@LuisMoutinho

Does anyone else have the values of the processor metrics on the amd adrenalin without any changes, mine have no changes to the voltage values, speed and power supply, the only values that change are the usage values, I have already checked them in other programs such as cpu-z and others and but the metrics work

@KenjiPolkAudio

🎉🎉🎉🎉🎉🎉🎉

@jasper123t

AMD YES!

@Gielderst

Hey AMD. I just wanted to say how much i like your stuff. I built my all AMD RiG throughout the past year and a half, and have gotten it to this stage: MOBO: GIGABYTE X670E AORUS MASTER REV.1.X Latest BIOS version F22b CPU: Ryzen 9 7950X3D Cooler: ARCTIC Liquid Freezer III 420 ARGB Thermal Paste: KingPin KPx RAM: 128GB 4x 32 G.SKILL Trident Z5 RGB White 6400MHz CL32-39-39-102 1.40V XMP VideoCard: PowerColor RX 7900 XTX Red Devil Limited Edition 1098 of 1500, RePasted with KingPin KPx paste Power Supply: 1250W Fully Modular brand " Segotep " model GP1350G 80+ Gold Storage Drives: 3x 1TB M.2 NVMe SSDs + 1x 512GB HDD PC Case: COOLER MASTER COSMOS C700M Monitor: 1x 32 inch 4K 160Hz GIGABYTE M32UC Keep on being awesome and innovative AMD 😀❤️

@TheSaadtut

Can you please help tarkov devs THE GAME IS SO BROKEN ON DUAL CCD PROCESSORS!!!!

@mzamroni

the audio is bad i gues he used laptop microphone

@ryderbrooks1783

ZERO value add here. You’re just reading marketing materials. Tell us WHY. Explain the engineering challenges and constraints. This medium should be used to communicate nuance and to motivate.

@ktozeses

pls buy me a cpu and mobo (i got intel i3 10gen)

GDC 2024 - AMD Ryzen™ Processor Software Optimization

Related articles

Comments