Optimized and Quantized deepset/roberta-base-squad2 with a custom handler.py

This repository implements a custom handler for question-answering for ๐Ÿค— Inference Endpoints for accelerated inference using ๐Ÿค— Optiumum. The code for the customized handler is in the handler.py.

Below is also describe how we converted & optimized the model, based on the Accelerate Transformers with Hugging Face Optimum blog post. You can also check out the notebook.

expected Request payload

{
    "inputs": {
        "question": "As what is Philipp working?", 
        "context": "Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
    }
}

below is an example on how to run a request using Python and requests.

Run Request

import json
from typing import List
import requests as r
import base64

ENDPOINT_URL = ""
HF_TOKEN = ""


def predict(question:str=None,context:str=None):
    payload = {"inputs": {"question": question, "context": context}}
    response = r.post(
        ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
    )
    return response.json()


prediction = predict(
    question="As what is Philipp working?",
    context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science."
)

expected output

{
    'score': 0.4749588668346405,
    'start': 88,
    'end': 102,
    'answer': 'Technical Lead'
}

Convert & Optimize model with Optimum

Steps:

  1. Convert model to ONNX
  2. Optimize & quantize model with Optimum
  3. Create Custom Handler for Inference Endpoints
  4. Test Custom Handler Locally
  5. Push to repository and create Inference Endpoint

Helpful links:

Setup & Installation

%%writefile requirements.txt
optimum[onnxruntime]==1.4.0
mkl-include
mkl
!pip install -r requirements.txt

0. Base line Performance

from transformers import pipeline

qa = pipeline("question-answering",model="deepset/roberta-base-squad2")

Okay, let's test the performance (latency) with sequence length of 128.

context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}
from time import perf_counter
import numpy as np 

def measure_latency(pipe,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  pipe(question=payload["inputs"]["question"], context=payload["inputs"]["context"])
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(qa,payload)}")
#     Vanilla model Average latency (ms) - 64.15 +\- 2.44

1. Convert model to ONNX

from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer
from pathlib import Path


model_id="deepset/roberta-base-squad2"
onnx_path = Path(".")

# load vanilla transformers and convert to onnx
model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)

2. Optimize & quantize model with Optimum

from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig

# Create the optimizer
optimizer = ORTOptimizer.from_pretrained(model)

# Define the optimization strategy by creating the appropriate configuration
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations

# Optimize the model
optimizer.optimize(save_dir=onnx_path, optimization_config=optimization_config)
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(onnx_path, file_name="model_optimized.onnx")
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.quantize(
    save_dir=onnx_path,
    quantization_config=dqconfig,
)

3. Create Custom Handler for Inference Endpoints

%%writefile handler.py
from typing import  Dict, List, Any
from optimum.onnxruntime import ORTModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline


class EndpointHandler():
    def __init__(self, path=""):
        # load the optimized model
        self.model = ORTModelForQuestionAnswering.from_pretrained(path, file_name="model_optimized_quantized.onnx")
        self.tokenizer = AutoTokenizer.from_pretrained(path)
        # create pipeline
        self.pipeline = pipeline("question-answering", model=self.model, tokenizer=self.tokenizer)

    def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
        """
        Args:
            data (:obj:):
                includes the input data and the parameters for the inference.
        Return:
            A :obj:`list`:. The list contains the answer and scores of the inference inputs
        """
        inputs = data.get("inputs", data)
        # run the model
        prediction = self.pipeline(**inputs)
        # return prediction
        return prediction

4. Test Custom Handler Locally

from handler import EndpointHandler

# init handler
my_handler = EndpointHandler(path=".")

# prepare sample payload
context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value." 
question="As what is Philipp working?" 

payload = {"inputs": {"question": question, "context": context}}

# test the handler
my_handler(payload)
from time import perf_counter
import numpy as np 

def measure_latency(handler,payload):
    latencies = []
    # warm up
    for _ in range(10):
        _ = handler(payload)
    # Timed run
    for _ in range(50):
        start_time = perf_counter()
        _ =  handler(payload)
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f}"

print(f"Optimized & Quantized model {measure_latency(my_handler,payload)}")
#     

Optimized & Quantized model Average latency (ms) - 29.90 +\- 0.53 Vanilla model Average latency (ms) - 64.15 +\- 2.44

5. Push to repository and create Inference Endpoint

# add all our new files
!git add * 
# commit our files
!git commit -m "add custom handler"
# push the files to the hub
!git push
Downloads last month
13
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.