Main

How to Choose AI Model Quantization Techniques | AI Model Optimization with Intel® Neural Compressor

Learn the fundamentals of AI model quantization. Your application and project have unique requirements, so there are a variety of quantization techniques. See an overview of each technique, tradeoffs, and recommended applications. AI model quantization is one of the most popular ways to optimize models for deployment. Reducing word lengths of weights and activations reduces the model size and can speed up inference. However, there are a variety of techniques to choose from. Learn the first principles of what is required to quantize floating-point models to integer formats. This is followed by an overview of each of the main model quantization approaches, covering the amount of effort required and the benefits of each. Subsequent videos in this series will cover each technique and how to use them in Intel Neural Compressor. Intel® Neural Compressor: https://bit.ly/3Nl6pVj Intel® Neural Compressor GitHub: https://bit.ly/3NlBgkH Intel® Developer Cloud: https://cloud.intel.com About the AI Model Optimization with Intel® Neural Compressor Series: Learn how to choose and get started with AI model optimization techniques. Get started with examples using Intel® Neural Compressor, which works within PyTorch*, TensorFlow*, and ONNX* Runtime About Intel Software: Intel® Developer Zone is committed to empowering and assisting software developers in creating applications for Intel hardware and software products. The Intel Software YouTube channel is an excellent resource for those seeking to enhance their knowledge. Our channel provides the latest news, helpful tips, and engaging product demos from Intel and our numerous industry partners. Our videos cover various topics; you can explore them further by following the links. Connect with Intel Software: INTEL SOFTWARE WEBSITE: https://intel.ly/2KeP1hD INTEL SOFTWARE on FACEBOOK: http://bit.ly/2z8MPFF INTEL SOFTWARE on TWITTER: http://bit.ly/2zahGSn INTEL SOFTWARE GITHUB: http://bit.ly/2zaih6z INTEL DEVELOPER ZONE LINKEDIN: http://bit.ly/2z979qs INTEL DEVELOPER ZONE INSTAGRAM: http://bit.ly/2z9Xsby INTEL GAME DEV TWITCH: http://bit.ly/2BkNshu Powered by oneAPI #intelsoftware #ai #oneapi How to Choose AI Model Quantization Techniques | AI Model Optimization with Intel® Neural Compressor

Intel Software

8 months ago

Welcome back to AI Model Optimization with Intel Neural Compressor. Model quantization reduces the word lengths of parameters. Not only does this reduce the model size, allowing for deployment to smaller edge devices, it can also speed up inference by reducing memory bottlenecks and by taking advantage of AI-optimized instruction sets and accelerators. For instance with shorter word lengths, the AMX technology in 4th Generation Intel Xeon Scalable processors can process more parameters through i
ts tiled multiplies. Of course reducing word length reduces the accuracy of the model by some amount. So how do you choose the right model quantization approach for your application’s needs? It helps to start with some first principles of quantization. Most models are developed and trained with 32-bit floating-point parameters. Moving to 16-bit floating-point reduces both the exponent and the mantissa. So now the representable precision is more coarse-grained, and we lose some of the representab
le dynamic range. BFloat16 was developed to preserve the range of representable values, while giving up a bit more of the precision. It makes for simpler conversion from FP32, truncating the mantissa. And the precision is usually good enough for deep learning, making it an easy way to optimize both inference and training. Moving to integer formats like INT8 can optimize models even further, but this requires mapping all the possible data points to a finite range of integer values. So not only do
you lose granularity, you also have to choose what range of floating point values to map to these 255 discrete integer values. So the quantization process needs to figure out the right range of values for a given set of parameters and that might include clipping some outliers to control the size of the range and based on that calculate the scale factor for the mapping. Figuring this all out, along with the accuracy effects on a given model with a given dataset, requires some extra effort, which
is why there is a variety of model quantization techniques. To fully quantize to INT8 and maintain the highest accuracy possible, use quantization-aware training. This technique actually trains using FP32 data, but the FP32 values are rounded to mimic the precision of INT8. Hence the training is aware that the parameters will be quantized to INT8 afterward, so it can properly tune the weight values. This comes at a cost of extra time and effort in the training phase. Post-training static quanti
zation avoids that training overhead, just adding a calibration step. This runs inference passes on a subset of data, observing what the ranges are for the parameters so it can map to the integer range as well as possible, then it evaluates the accuracy and possibly iterates until it meets its criteria. Post-training dynamic quantization is simpler – it only quantizes the weights ahead of time. The activations are quantized during inference based on their ranges, so the matrix operations can be
performed with INT8. There’s some overhead with that dynamic conversion, plus they’re written to and read from memory as FP32. This technique is good for models whose runtime is dominated by loading weights from memory, such as transformers or LSTMs with small batch size. Finally, mixed-precision is used more in training because it maintains good enough floating point accuracy while optimizing the model, increasing your training capacity. But it can be a really simple way to achieve some model o
ptimization during inference without much effect on accuracy. All of these techniques are available in Intel Neural Compressor. Subsequent videos in this series will show how to use each of them. To get started now, check out the resources linked below, or scan the QR code here to learn more and download. And be sure to check out the other videos in the AI Model Optimization series.

Comments