• Latest
  • Trending
Optimizing Your Model for Inference with PyTorch Quantization

Optimizing Your Model for Inference with PyTorch Quantization

March 2, 2022
Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

July 29, 2025
French Telco Orange Hit by Cyber-Attack

French Telco Orange Hit by Cyber-Attack

July 29, 2025
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023
Data Leak Hits Thousands of NHS Workers

Data Leak Hits Thousands of NHS Workers

February 20, 2023
EU Cybersecurity Agency Warns Against Chinese APTs

EU Cybersecurity Agency Warns Against Chinese APTs

February 20, 2023
How Your Storage System Will Still Be Viable in 5 Years’ Time?

How Your Storage System Will Still Be Viable in 5 Years’ Time?

February 20, 2023
The Broken Promises From Cybersecurity Vendors

Cloud Infrastructure Used By WIP26 For Espionage Attacks on Telcos

February 20, 2023
Instagram and Facebook to get paid-for verification

Instagram and Facebook to get paid-for verification

February 20, 2023
YouTube CEO Susan Wojcicki steps down after nine years

YouTube CEO Susan Wojcicki steps down after nine years

February 20, 2023
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Wednesday, 18 February, 2026
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Optimizing Your Model for Inference with PyTorch Quantization

by ITECHNEWS
March 2, 2022
in Data Science, Leading Stories
0 0
0
Optimizing Your Model for Inference with PyTorch Quantization

Quantization is a common technique that people use to make their model run faster, with lower memory footprint and lower power consumption for inference without the need to change the model architecture. In this blog post, we will briefly introduce what quantization is and how to apply quantization to your PyTorch models.

What is Quantization

Typical neural networks run in 32-bit floating point (float32) precision, which means both the activation and weight Tensors are in float32, and computations are performed in float32 precision as well. Quantization tries to reduce the precision of the model to a more compact data type that requires less memory for storage and performs computation faster, for example, 8-bit integer (int8). Taking int8 as an example, after we quantize the model, both activation and weight Tensors can be stored in int8 and the computations will be performed in int8 which is typically more efficient than float32 computations.

YOU MAY ALSO LIKE

French Telco Orange Hit by Cyber-Attack

ATC Ghana supports Girls-In-ICT Program

We can view quantization as a compression for the model, and it is not a lossless compression, since the lower precision data type may have less dynamic range and resolution. Therefore we need to have a trade-off between the accuracy of the model and the speedup, memory, and power consumption benefit that we get from quantization.

How to Use PyTorch Quantization

How do we obtain a quantized model from a floating point model? There are two ways in general:

  • Post Training Quantization: After we have a trained model, we can convert the model to a quantized model, this is typically easy to apply, but we may see some accuracy loss for some types of models.
  • Quantization Aware Training: During training, we insert fake quantization operators into the model to simulate the quantization behavior and convert the model to a quantized model after training based on the model with fake quantize operators. This is harder to apply than post-training quantization since it requires retraining the model, but typically gives better accuracy.

Since we are reducing the precision of the Tensors, we need to establish a mapping from the float32 Tensor and the quantized Tensor, a typical mapping function is affine transformation. For example, to quantize a float32 Tensor to an int8 Tensor, we can divide the float32 value by a `scale` and add a `zero_point` it, then we will clamp the value to int8, therefore we will need to figure out the `scale` and `zero_point` parameters for each Tensor that we want to quantize. In general, we have the following process (Post Training Quantization):

  1. Prepare: we insert some observers to the model to observe the statistics of a Tensor, for example, min/max values of the Tensor
  2. Calibration: We run the model with some representative sample data, this will allow the observers to record the Tensor statistics
  3. Convert: Based on the calibrated model, we can figure out the quantization parameters for the mapping function and convert the floating point operators to quantized operators

Currently, PyTorch offers two different ways of quantization: Eager Mode Quantization and FX Graph Mode Quantization. Here I’ll show an example using FX Graph Mode Quantization to quantize a resnet50 model from torchvision:

import copy
from torch.ao.quantization import get_default_qconfig
from torch.ao.quantization.quantize_fx import convert_fx, prepare_fx
from torchvision.models import resnet50

fp32_model = resnet50().eval()
model = copy.deepcopy(fp32_model)
# `qconfig` means quantization configuration, it specifies how should we
# observe the activation and weight of an operator
# `qconfig_dict`, specifies the `qconfig` for each operator in the model
# we can specify `qconfig` for certain types of modules
# we can specify `qconfig` for a specific submodule in the model
# we can specify `qconfig` for some functioanl calls in the model
# we can also set `qconfig` to None to skip quantization for some operators
qconfig = get_default_qconfig("fbgemm")
qconfig_dict = {"": qconfig}
# `prepare_fx` inserts observers in the model based on the configuration in `qconfig_dict`
model_prepared = prepare_fx(model, qconfig_dict)
# calibration runs the model with some sample data, which allows observers to record the statistics of
# the activation and weigths of the operators
calibration_data = [torch.randn(1, 3, 224, 224) for _ in range(100)]
for i in range(len(calibration_data)):
   model_prepared(calibration_data[i])
# `convert_fx` converts a calibrated model to a quantized model, this includes inserting
# quantize, dequantize operators to the model and swap floating point operators with quantized operators
model_quantized = convert_fx(copy.deepcopy(model_prepared))
# benchmark
x = torch.randn(1, 3, 224, 224)
%timeit fp32_model(x)
%timeit model_quantized(x)

Output:

38.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

8.65 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As we can see, the quantized model achieved around 4.5x speedup over the original float32 model.

In the above example, we used `qconfig_dict` to control how to quantize a model, empty string means global configuration. We can use it to decide the types of quantization we want for each individual operator that is used in the model.

How to Do Numerical Debugging after Quantization

After we quantize the model, we may find that we got a great speedup, but the accuracy could suffer because we quantized too many operators. Then how do we find the operators that are most sensitive to quantization and skip quantizing these operators in order to recover the accuracy?

We can use Numeric Suite in PyTorch to find the impact of quantization on the activation and weight of the model. We also have Eager Mode Numeric Suite and FX Graph Mode Numeric Suite. It has three features:

  • Compare the quantization loss for weight
  • Compare the accumulative quantization loss for activation
  • Compare the per operator quantization loss for activation

We’ll show a simple example comparing the quantization loss for weight of resnet50 model with FX Graph Mode Numeric Suite below.

# Compare weights of float_model and qmodel.
import torch.ao.ns._numeric_suite_fx as ns
# Note: when comparing weights in models with Conv-BN for PTQ, we need to compare
# weights after Conv-BN fusion for a proper comparison.  Because of this, we use
# `prepared_model` instead of `float_model` when comparing weights.
# Extract conv and linear weights from corresponding parts of two models, and save
# them in `wt_compare_dict`.
resnet50_wt_compare_dict = ns.extract_weights(
   'fp32',  # string name for model A
   model_prepared,  # model A
   'int8',  # string name for model B
   model_quantized,  # model B
)
# calculate SQNR between each pair of weights
# SQNR is a measure of quantization loss, large SQNR value means the quantization loss is small
ns.extend_logger_results_with_comparison(
   resnet50_wt_compare_dict,  # results object to modify inplace
   'fp32',  # string name of model A (from previous step)
   'int8',  # string name of model B (from previous step)
   torch.ao.ns.fx.utils.compute_sqnr,  # the function to use to compare two tensors
   'sqnr',  # the name to use to store the results under
)
# massage the data into a format easy to graph and print
# Note: no util function for this since use cases may be different per user
# Note: there is a lot of debugging data, and it will be challenging to print all of it
# and fit on a laptop screen.  It is up to the user to decide which data is useful for them.
resnet50_wt_to_print = []
for idx, (layer_name, v) in enumerate(resnet50_wt_compare_dict.items()):
   resnet50_wt_to_print.append([
       idx,
       layer_name,                                                   
       v['weight']['int8'][0]['prev_node_target_type'],                      
       v['weight']['int8'][0]['values'][0].shape,
       v['weight']['int8'][0]['sqnr'][0],
   ])
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
# a simple line graph
def plot(xdata, ydata, xlabel, ylabel, title):
   fig = plt.figure(figsize=(10, 5), dpi=100)
   plt.xlabel(xlabel)
   plt.ylabel(ylabel)
   plt.title(title)
   ax = plt.axes()
   ax.plot(xdata, ydata)     
# plot the SQNR between fp32 and int8 weights for each layer
# Note: we may explore easier to read charts (bar chart, etc) at a later time, for now
# line chart + table is good enough.
plot([x[0] for x in resnet50_wt_to_print], [x[4] for x in resnet50_wt_to_print], 'idx', 'sqnr', 'weights, idx to sqnr')

After plotting the SQNR for all the weights, we can find the layer with the lowest SQNR, which means it has the largest quantization error, and skip quantizing that layer by changing the `qconfig_dict` settings. We can find an optimal point for our use case by gradually skipping quantization for the layers that are most sensitive to quantization.

Source: Jerry Zhang is a Software Engineer in the PyTorch Architecture Optimization
Tags: Optimizing PyTorch Quantization
ShareTweet

Get real time update about this post categories directly on your device, subscribe now.

Unsubscribe

Search

No Result
View All Result

Recent News

Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

July 29, 2025
French Telco Orange Hit by Cyber-Attack

French Telco Orange Hit by Cyber-Attack

July 29, 2025
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa

July 29, 2025
French Telco Orange Hit by Cyber-Attack

French Telco Orange Hit by Cyber-Attack

July 29, 2025

Recent News

  • Absa and Visa Extend Strategic Partnership to Advance Growth and Innovation Across Africa July 29, 2025
  • French Telco Orange Hit by Cyber-Attack July 29, 2025
  • ATC Ghana supports Girls-In-ICT Program April 25, 2023
  • Vice President Dr. Bawumia inaugurates ICT Hub April 2, 2023
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© Copyright 2026, All Rights Reserved | iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In

Add New Playlist

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© Copyright 2026, All Rights Reserved | iTechNewsOnline.Com - Powered by BackUPDataSystems

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?
Go to mobile version