Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building…

Follow publication

Quantization (post-training quantization) your (custom mobilenet_v2) models .h5 or .pb models using TensorFlow Lite 2.4

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_8z4lTcSfyEad87GGsf9IPw.png

Short summary:

In this article, I will explain what is quantization, what types of quantization exist at this time, and I will show how to quantization your (custom mobilenet_v2) models .h5 or .pb using TensorFlow Lite only for CPU Raspberry Pi 4.

Code for this article is available here.

Note before you start:

So, let’s start :)

Hardware preparation:

Software preparation:

1) What the quantization model in the context of TensorFlow?

This is a model which doing the same as the standard model but: faster, smaller, with similar accuracy. Like in the gif below :)

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_-bmgkbQNGwVkDJJq64yVxQ.gif

or this :)

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_8z4lTcSfyEad87GGsf9IPw.png

Big turtle is simple CNN (.h5 or .pb ) model.
A baby turtle is a quantization model.
:)

Put it simply, usually, we use models with next weights (465.949494 and etc …) these weights usually are floating-point numbers.
In the screenshot below, you can see these floating-point numbers for weights— it’s just an example.

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_KvHwa5eUfyVTaNzEm3TJ8A.png

But after quantization, these weights can be changed to …

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_3bYW4pjMPEAN47aXLrDVvA.png

So it this is a rough example of what quantization is.

2) So what the difference between the quantization model and simple (.h5 or .pb or etc …)?

1. Is decrease model size

For example:
We had .h5 or tflite or etc … simple CNN model (for image-classification) file with size 11.59 MB.

After quantization model he will next result:
model will be 3.15 MB.

2. Less latency for recognizing one image

Fow example:
We had .h5 or tflite simple CNN model (for image-classification) and latency for recognize one image 300ms on CPU.

After quantization model he will next result:
170ms on CPU

3. Loss of productivity up to 1 percent

Fow example:
We had .h5 or tflite simple CNN model (for image-classification) and accuracy was 99%

After quantization model he will next result:
accuracy will 99%

3) What types of quantization exist?

Basically exist 2 types of quantization
- Quantization-aware training;
- Post-training quantization with 3 different approaches (Post-training dynamic range quantization, Post-training integer quantization, Post-training float16 quantization ).

p.s.: In this article, I will explain the second approach.

4) How I can do post-training quantization?

Notebook with these examples here.

1 How to convert .h5 to quantization model tflite ( 8-bits/float8):

1.0 using Optimize.DEFAULT

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_DEFAULT_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

1.1 using Optimize.OPTIMIZE_FOR_SIZE

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_SIZE_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

1.2 using Optimize.OPTIMIZE_FOR_LATENCY

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_LATENCY]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_LATENCY_8bit_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2 How to convert .h5 to quantization model tflite ( 16-bits/float16):

2.0 using Optimize.DEFAULT

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_DEFAULT_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2.1 using Optimize.OPTIMIZE_FOR_SIZE

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_SIZE_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

2.2 using Optimize.OPTIMIZE_FOR_LATENCY

import tensorflow as tfmodel = tf.keras.models.load_model("/content/test/mobilenetv2.h5")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_LATENCY]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()
#save converted quantization model to tflite format
open("/content/test/quantization_OPTIMIZE_FOR_LATENCY_float16_model_h5_to_tflite.tflite", "wb").write(tflite_quant_model)

5) What the result will be after post-training quantization model mobilenet_V2?

I created 8 python scripts that will testing speed recognize for 100 images on the Raspberry Pi 4. To run these tests you can run the next commands, but only sequential.

#clone my repo
git clone https://github.com/oleksandr-g-rock/Quantization.git
#go to directory
cd Quantization
#run testspython3 test_h5.pypython3 test_tflite.pypython3 OPTIMIZE_FOR_SIZE_float16.pypython3 DEFAULT_float16.pypython3 OPTIMIZE_FOR_LATENCY_float16.pypython3 OPTIMIZE_FOR_SIZE_float8.pypython3 DEFAULT_float8.pypython3 OPTIMIZE_FOR_LATENCY_float8.py

These 8 scripts will be testing the next models:

  • Tensorflow .h5
  • Just converted Tensorflow Lite .tflite
  • Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float16 quantization model tflite
  • Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float8 quantization model tflite

My result after testing below and here:

Quantization your model .h5 or tflite using TensorFlow Lite (Photo,GIF by Author) https://github.com/oleksandr-g-rock/Quantization/blob/main/1_P6oEnzouLdQzcsDZW3SZnQ.png

Result:

1 Results about resizing model:

  • After converted models from .h5 to tflite — model size was 11.59 MB and became 10.96 MB
  • After quantization models from .h5 to tflite using Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float16 — model size was 10.96 MB and became 5.8 MB
  • After quantization models from .h5 to tflite using Tensorflow Lite OPTIMIZE_FOR_SIZE/DEFAULT/OPTIMIZE_FOR_LATENCY float8 — model size was 5.8 MB and became 3.3 MB

So I guess the results of the resizing model are perfect.

2 Results about speed for recognize, the results are not so good as I expected, but speed recognition is optimal (because it 99%) for Raspberry Pi.

Code for this article is available here.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Alex G.
Alex G.

Written by Alex G.

ML DevOps engineer. 🙂 I am always open to new opportunities and offers. 🖖 I trying to help the world 🌏 with machine learning.

Responses (1)

Write a response