Quantization (post-training quantization) your (custom mobilenet_v2) models .h5 or .pb models using TensorFlow Lite 2.4

Alex G.

Follow

Published in

Analytics Vidhya

4 min readDec 24, 2020

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_8z4lTcSfyEad87GGsf9IPw.png

Short summary:

In this article, I will explain what is quantization, what types of quantization exist at this time, and I will show how to quantization your (custom mobilenet_v2) models .h5 or .pb using TensorFlow Lite only for CPU Raspberry Pi 4.

Code for this article is available here.

Note before you start:

So, let’s start :)

Hardware preparation:

Software preparation:

1) What the quantization model in the context of TensorFlow?

This is a model which doing the same as the standard model but: faster, smaller, with similar accuracy. Like in the gif below :)

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_-bmgkbQNGwVkDJJq64yVxQ.gif

or this :)

Big turtle is simple CNN (.h5 or .pb ) model.
A baby turtle is a quantization model.
:)

Put it simply, usually, we use models with next weights (465.949494 and etc …) these weights usually are floating-point numbers.
In the screenshot below, you can see these floating-point numbers for weights— it’s just an example.

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_KvHwa5eUfyVTaNzEm3TJ8A.png

But after quantization, these weights can be changed to …

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_3bYW4pjMPEAN47aXLrDVvA.png

So it this is a rough example of what quantization is.

2) So what the difference between the quantization model and simple (.h5 or .pb or etc …)?

1. Is decrease model size

For example:
We had .h5 or tflite or etc … simple CNN model (for image-classification) file with size 11.59 MB.
After quantization model he will next result:
model will be 3.15 MB.

2. Less latency for recognizing one image

Fow example:
We had .h5 or tflite simple CNN model (for image-classification) and latency for recognize one image 300ms on CPU.
After quantization model he will next result:
170ms on CPU

3. Loss of productivity up to 1 percent

Fow example:
We had .h5 or tflite simple CNN model (for image-classification) and accuracy was 99%
After quantization model he will next result:
accuracy will 99%

3) What types of quantization exist?

Basically exist 2 types of quantization
- Quantization-aware training;
- Post-training quantization with 3 different approaches (Post-training dynamic range quantization, Post-training integer quantization, Post-training float16 quantization ).

p.s.: In this article, I will explain the second approach.

4) How I can do post-training quantization?

Notebook with these examples here.

1 How to convert .h5 to quantization model tflite ( 8-bits/float8):

1.0 using Optimize.DEFAULT

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_DEFAULT_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

1.1 using Optimize.OPTIMIZE_FOR_SIZE

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_SIZE_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

1.2 using Optimize.OPTIMIZE_FOR_LATENCY

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_LATENCY]tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_LATENCY_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2 How to convert .h5 to quantization model tflite ( 16-bits/float16):

2.0 using Optimize.DEFAULT

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_DEFAULT_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2.1 using Optimize.OPTIMIZE_FOR_SIZE

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.target_spec.supported_types = [tf.float16]tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_SIZE_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2.2 using Optimize.OPTIMIZE_FOR_LATENCY

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_LATENCY]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_LATENCY_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

5) What the result will be after post-training quantization model mobilenet_V2?

I created 8 python scripts that will testing speed recognize for 100 images on the Raspberry Pi 4. To run these tests you can run the next commands, but only sequential.

#clone my repo
git clone https://github.com/oleksandr-g-rock/Quantization.git#go to directory
cd Quantization#run testspython3 test_h5.pypython3 test_tflite.pypython3 OPTIMIZE_FOR_SIZE_float16.pypython3 DEFAULT_float16.pypython3 OPTIMIZE_FOR_LATENCY_float16.pypython3 OPTIMIZE_FOR_SIZE_float8.pypython3 DEFAULT_float8.pypython3 OPTIMIZE_FOR_LATENCY_float8.py

These 8 scripts will be testing the next models:

Tensorflow .h5
Just converted Tensorflow Lite .tflite
Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float16 quantization model tflite
Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float8 quantization model tflite

My result after testing below and here:

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_P6oEnzouLdQzcsDZW3SZnQ.png

Result:

1 Results about resizing model:

After converted models from .h5 to tflite — model size was 11.59 MB and became 10.96 MB
After quantization models from .h5 to tflite using Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float16 — model size was 10.96 MB and became 5.8 MB
After quantization models from .h5 to tflite using Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float8 — model size was 5.8 MB and became 3.3 MB

So I guess the results of the resizing model are perfect.

2 Results about speed for recognize, the results are not so good as I expected, but speed recognition is optimal (because it 99%) for Raspberry Pi.