Serving inferences from your machine learning model with Sagemaker and TensorFlow

Jesus Larrubia
5 min readJan 27, 2020

In Summary

Get your apps to make use of your trained machine learning model via standard REST requests with SageMaker and TensorFlow.

When working on a project that requires building and training a custom machine learning model, TensorFlow makes the work much easier. It provides you with most of the tools you’ll need to:

  • Handle your datasets.
  • Pre-process the data and post-process the model results.
  • Train your model.
  • Transform the information into useful visualizations (especially, when running on a Jupyter notebook).
  • Save your pre-trained models.
  • Test inferences.
  • Etc.

TensorFlow’s possibilities are really well documented, from complete beginner to an advanced point of view.

Making real use of our trained model

Once your model has been designed, built and trained to infer predictions about the problem on the table, you’ll likely need to make use of it from one of your applications.

For example, let’s say you’ve trained the model on this TensorFlow example to retrieve the segmentation mask of a pet, given an image. When a user uploads a new picture of a pet, we’ll need to pre-process the image to a valid format that the model can use and post-process the result to convert the segmentation mask to something that makes sense to be used internally by our backend or visualised externally in an app. Note: in this article, we’ll focus on a model based on a convolutional neural network built with Keras but, basically, it should apply to any other model built with TensorFlow.

Sagemaker to serve model inferences

Although TensorFlow already provides some tools to serve your model inferences through its API, with AWS SageMaker you’ll be able to complete the rest of it:

  • Host the model in a docker container that can be deployed to your AWS infrastructure.
  • Take advantage of one of the machine learning optimised AWS instances. They are super-powered with different options of CPU, network performance or memory and are capable to rely on GPUs for accelerated computing.
  • Create an endpoint that can be externally invoked to request predictions.

You can proceed by directly selecting one of the available pre-built docker images when building a SageMaker model or you can build your own container, following a specific structure and deploy it via ECR. The advantages of opting for the latter are:

  • Local testing of your inference endpoint
  • Total control over the machine

Amazon has simplified the job of creating your container, publishing and documenting projects like SageMaker TensorFlow Serving Container in GitHub with all the code you need to run the container locally pretty much out the box, what is much appreciated.

Preparing the SageMaker TensorFlow Serving Container

Clone the mentioned repository and choose the TensorFlow version and the required architecture to start running in your machine the container where your model will be placed.

Save the already trained model:

import tensorflow as tf, "./savedmodels/imagesegmentation/1/")

Check the model is saved correctly:

!saved_model_cli show --dir ./savedmodels/imagesegmentation/1/ --tag_set
serve --signature_def serving_default
# Output:
The given SavedModel SignatureDef contains the following input(s):
inputs[‘input_2’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 128, 128, 3)
name: serving_default_input_2:0
The given SavedModel SignatureDef contains the following output(s):
outputs[‘conv2d_transpose_4’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 128, 128, 3)
name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

And place it on the container to start testing it. The expected location is /opt/ml/model in the container which is shared with your (host) machine through test/resources/models

├── models
├── petsegmentation
├── saved_model.pb
└── variables
└── variables.index

At this point, your model is already reachable and predictions can be requested via Http…

curl -X POST --data-binary @test/resources/inputs/test-image-segmentation.json -H 'Content-Type: application/json' -H 'X-Amzn-SageMaker-Custom-butes: tfs-model-name=imagesegmentation' http://localhost:8080/invocations

… if you have data examples following exactly the structure expected by the model, in our case, a 4d array tensor of floats.

However, the goal is making the endpoint as accessible as possible to our apps and take any heavy calculations away from them. So, we’ll simplify the design of the API to work with standard application/json requests as follows:

The pre and post inference data processing can be achieved by implementing the methods input_handler and output_handler (or just handler that would cover both) in an file that must be placed in a folder named code, alongside the requirements.txt with the packages required to be installed. The final structure would look like:

├──   models
├── code
│ ├──
│ └── requirements.txt
└── petsegmentation
├── saved_model.pb
└── variables
└── variables.index

And the and requirements.txt files like:

# requirements.txt
import tensorflow as tf
import numpy as np
import requests
import base64
import json
import os


def handler(data, context):
"""Handle request.
data (obj): the request data
context (Context): an object containing request and configuration details
(bytes, string): data to return to client, (optional) response content type
decoded_data ='utf-8')

processed_input = _process_input(data, context)
response =, data=processed_input)
return _process_output(response, context)

def _process_input(data, context):
if context.request_content_type == 'application/json':
decoded_data ='utf-8')
# Converts the JSON object to an array
# that meets the model signature.
image_array = image_to_array(decoded_data)
return json.dumps({"inputs": [image_array]})

raise ValueError('{{"error": "unsupported content type {}"}}'.format(
context.request_content_type or "unknown"))

def _process_output(data, context, request_data):
if data.status_code != 200:
raise ValueError(data.content.decode('utf-8'))

response_content_type = context.accept_header
prediction = json.loads(data.content.decode('utf-8'))
segmented_image = mask_to_image(prediction["outputs"])

return json.dumps({"segmented_image": segmented_image}), response_content_type

Now, our container is ready to be used by real applications.

Deploying the container and creating the endpoint

Push your custom image to ECR by executing the script:

./scripts/ --version 1.14 --arch gpu --region eu-west-x

Compress and upload the saved model and code folder to a S3 bucket:

tar -czvf model.tar.gz imagesegmentation code
aws s3 cp model.tar.gz s3://model-bucket/mymodel/model.tar.gz

Create the model in Sagemaker, choosing our published custom image and uploaded model.tar.gz and setting other parameters like TensorFlow version and architecture type:

# Create the model in Sagemaker

# See the following document for more on SageMaker Roles:

aws sagemaker create-model \
--model-name $MODEL \
--primary-container Image=$IMAGE,ModelDataUrl=s3://model-bucket/mymodel/model.tar.gz \
--execution-role-arn $ROLE_ARN

Create the endpoint configuration (referencing to the created SageMaker model) and spin up the endpoint. You’ll need to choose the instance/s that fit better your needs:

# You need to change the timestamp value with the output of

# It creates endpoint configuration.
aws sagemaker create-endpoint-config \
--endpoint-config-name $ENDPOINT_CONFIG_NAME \
--production-variants VariantName=TFS,ModelName=$MODEL_NAME,InitialInstanceCount=$INITIAL_INSTANCE_COUNT,InstanceType=$INSTANCE_TYPE

# Create the endpoint
aws sagemaker create-endpoint \
--endpoint-name $MODEL \
--endpoint-config-name $ENDPOINT_CONFIG_NAME

And voilà, the endpoint will be live and available for serving inferences in a few minutes. The invoke-endpoint command will allow you to double-check the deploy was successful:

aws sagemaker-runtime invoke-endpoint \
--endpoint-name petsegmentation \
--content-type=application/json \
--body "file://./test/resources/inputs/test-image-segmentation.json \

If everything works as expected you should retrieve a json with the processed prediction:



Originally published at