Recognising indoor scenes with Custom Vision

Jesus Larrubia
10 min readNov 16, 2020

In Summary

In this article, I’ll run through the possibilities and tradeoffs of Microsoft Custom Vision. I’ll use an old MIT research that aimed to classify indoor images to compare results and base the final conclusion on empirical observations.

Custom Vision has been on my radar for a while. The platform, created by Microsoft and part of the Azure ecosystem, allows users to easily upload and tag images to build and train custom Machine Learning (ML) models that can be used to perform classification and object detection. Once a model has been sufficiently trained, it can be deployed with a few clicks and be used as an API.

Custom Vision logo

Sounds exciting, doesn’t it? Would this tool be capable of saving us some of the most arduous steps when building a custom ML model? And what would the accuracy of the predictions be? In this article, I’ll share my findings.

Recognising indoor scenes

I decided to use old MIT scientific research from 2009 to test the capabilities of Custom Vision.

Obviously, the progress of AI and ML since this period has been enormous. But the research caught my attention due to a couple of reasons:

  • It states the difficulty of classifying indoor images in comparison with, for example, outdoor scenes.
  • They published the labelled image dataset used in the research.

The paper

Details and conclusions from the research can be found in the paper Recognizing Indoor Scenes, but here are a few interesting notes from it:

  • They created a dataset of 15620 images, classified into 67 different categories.
  • Images were obtained from different online sources (Google, Altavista, Flickr…) with heterogeneous sizes and proportions (minimum resolution of 200 pixels in the smallest axis).
  • All images in the dataset have the same file format (jpg).


Custom Vision can be used via their console, a UI that allows users to drag-and-drop, tag images and train/test the models or via their SDK.

From my point of view, the console looks very simple, intuitive and easy to use. It achieves what it’s meant to, it makes AI accessible to a wider audience — people with no deep knowledge of how machine learning works behind the scenes.

However, the SDK (available in a considerable number of languages), allowed us to script some of the steps. I opted for using a Jupyter notebook to carry out the steps required to:

  • Create the training and validation subsets
  • Upload and tag the dataset images
  • Train and deploy the model
  • Validate the model via API inferences

Step 1 — Creating the training/validation subsets

After creating a new image classification project on Custom Vision and downloading the images (2.4GB), the first step would involve selecting the subset of images used for training vs validation.

As an interesting aside, Custom Vision offers a free tier where you can use up to 5000 images and 50 tags per project. These figures seemed reasonable for the prototype. Following their recommendations, I used 50 images per category which means we’d train our model with a total of 50 x 50 = 2500 images.

The following code will create 2 separate folders to divide images into training and validation.

import os
import shutil
import os.path
from os import path
# Change this parameter to modify the number of training images.
number_images_training = 50
images_folder = ‘indoor_Images’
training_folder = str(number_images_training) + ‘_training_’ + images_folder
validation_folder = str(number_images_training) + ‘_validation_’ + images_folder
# Create the training/validation folders
if path.isdir(training_folder):
if path.isdir(validation_folder):
# Divide images into training/validation
categories = sorted(os.listdir(images_folder))
print(“Creating validation vs training folders…”)
for category in categories:
copied_count = 0

# Create the category directories.
if not category.startswith(‘.’):
category_training_folder = training_folder + “/” + category
category_validation_folder = validation_folder + “/” + category

category_images_folder = images_folder + “/” + category
category_images = sorted(os.listdir(category_images_folder))
for category_image in category_images:
copied_count+= 1
if copied_count <= number_images_training:
shutil.copy(category_images_folder + “/” + category_image, category_training_folder + “/” + category_image)
shutil.copy(category_images_folder + “/” + category_image, category_validation_folder + “/” + category_image)


Step 2 — Uploading and tagging the dataset images

2.1. Set up the project

Before we can start working with our dataset, we’ll need to set up the Custom Vision project. You can either create a new one or just make use of an existing project created via the console by using its id (our case).

from import CustomVisionTrainingClient
from import ImageFileCreateBatch, ImageFileCreateEntry
from msrest.authentication import ApiKeyCredentials
ENDPOINT = ""# If you need to create a new project
# print ("Creating project...")
# project = trainer.create_project("My New Project")
# Replace with valid values
project_id = "<your project id>"
training_key = "<your training key>"
prediction_key = "<your prediction key>"
prediction_resource_id = "<your prediction resource id>"
publish_iteration_name = "iteration name"
credentials = ApiKeyCredentials(in_headers={"Training-key": training_key})
trainer = CustomVisionTrainingClient(ENDPOINT, credentials)

The required parameters can be found under the settings tab of the project via the console.

2.2. Tags

Now, we are in a position to start creating the tags that will be used to categorise the images.

When creating our project, we selected the option “Multiclass (Single tag per image)” as the Classification type since it best fitted our problem. We’ll follow the original dataset folder structure to create one tag per category (keeping in mind the maximum 50 tags limitation).

IMPORTANT: tags are referenced by using the id (instead of name) so we’ll have to store them in some kind of data structure so they can be subsequently used.

# Create the tag from our image categories (retrieve if it exists).
existing_tags = trainer.get_tags(project_id)
limit = 50
tag_dictionary = {}
categories = sorted(os.listdir(images_folder))
for category in categories:
tag_exists = next((x for x in existing_tags if == category), None)
if tag_exists:
tag_dictionary[category] =

elif not tag_exists and not category.startswith('.') and len(tag_dictionary) < limit:
print('Creating tag ' + category + '...')
tag = trainer.create_tag(project_id, category)
tag_dictionary[category] =

2.3. Uploading the training images

Once we have our tags in Custom Vision, we can upload the images. We’ll iterate through our categories to upload tagged images. The images will be uploaded as batches of 64 elements (the maximum SDK limit) in order to speed up the process.

import timedef upload_batch(image_batch):
start = time.time()
upload_result = trainer.create_images_from_files(project_id, ImageFileCreateBatch(images=image_batch))
total_time = time.time() - start
print('Batch succesfully uploaded in ' + str(total_time) + ' ms')
if not upload_result.is_batch_successful:
print("Image batch upload failed.")
for image in upload_result.images:
print("Image status: ", image.status)
# Upload and tag images
print("Adding images...")
training_categories = os.listdir(training_folder)
image_batch = []
batch_limit = 64
batch_count = 0
for training_category in tag_dictionary.keys():
if not training_category.startswith('.'):
category_training_folder = training_folder + "/" + training_category
category_images = os.listdir(category_training_folder)
for category_image in category_images:
image_path = category_training_folder + '/' + category_image
with open(image_path, "rb") as image_contents:
image_batch.append(ImageFileCreateEntry(name=category_image,, tag_ids=[tag_dictionary[training_category]]))
if len(image_batch) == batch_limit:
print("Reached batch limit " + str(batch_limit) + '. Uploading images beloging to batch ' + str(batch_count) + '...')
image_batch = []
if (len(image_batch) > 0):
print('Uploading images beloging to batch ' + str(batch_count) + '...')
image_batch = []


One of the tasks that can take a substantial amount of time when working with image recognition models is preprocessing. Usually, the custom model will only accept images in a specific format, with a fixed size. So it is the responsibility of the software/data engineer to create the mechanisms required to convert the images into the correct shape. This is not needed with Custom Vision, so we can directly use our images against the service and they’ll work! This is something I consider a big step forward, especially when time and budget constraints play an important role in the project.

Uploaded indoor images

Step 3 — Training and deploying the model

To fit our model, we’ll just need to call the train_project method and the magic will happen. Custom Vision will do the hard work for us, without the need to choose the best learning algorithm, the NN layout or tweaking the model parameters.

# Train and publish the project
import time
print ("Training...")
start = time.time()
iteration = trainer.train_project(project_id)
while (iteration.status != "Completed"):
iteration = trainer.get_iteration(project_id,
print ("Training status: " + iteration.status)

total_time = time.time() - start
print('Training was succesfully completed in ' + str(total_time) + ' ms')
# The iteration is now trained. Publish it to the project endpoint
trainer.publish_iteration(project_id,, publish_iteration_name, prediction_resource_id)
print ("Done!")

The training iteration took 11 minutes 28 seconds for our dataset composed of 2500 images. The free tier allowed us to use up to 1 hour of training and 20 iterations per month.

The results

From the console, we can access the performance of our trained model. In our case, after 2 iterations, the model showed very good Precision (85.6%) and AP (89.9%) values.

Model performance summary

Testing the inference API

So far so good. However, when creating an ML model we should keep an eye on how it’ll behave when deployed to production. Although ideally, we should use images retrieved from a totally different source to calculate the accuracy of the production model, we’ll make use of our separated validation dataset for this purpose.

The following script will select 10 random images from each category to check if the inferences provided by the inference API are valid or not. Following the default CV configuration, we’ll consider as true positive predictions with a probability score of more than 50% as correct.

from import CustomVisionPredictionClientfrom msrest.authentication import ApiKeyCredentials
import time
from random import randrange
# Now there is a trained endpoint that can be used to make a prediction
prediction_credentials = ApiKeyCredentials(in_headers={"Prediction-key": prediction_key})
predictor = CustomVisionPredictionClient(ENDPOINT, prediction_credentials)
category_predictions = 10
correct = 0
failed = 0
api_errors = 0
total_prediction_time = 0
results = {}
# Create the structure to calculate the inference results.
for validation_category in tag_dictionary.keys():
results[validation_category] = {
'true_positive': 0,
'false_positive': 0,
'false_negative': 0

for validation_category in tag_dictionary.keys():
if not validation_category.startswith('.'):
category_validation_folder = validation_folder + "/" + validation_category
category_images = os.listdir(category_validation_folder)
current_category_prediction = 0

while current_category_prediction < category_predictions:
random_image = category_images[randrange(len(category_images))]
image_path = category_validation_folder + '/' + random_image
print('Image: ' + image_path)

start = time.time()
with open(image_path, "rb") as image_contents:
prediction_result = predictor.classify_image(project_id, publish_iteration_name,

prediction = prediction_result.predictions[0]

if prediction:
prediction_time = time.time() - start
total_prediction_time += prediction_time

print('Prediction: ' + prediction.tag_name)
print('Prediction confidence: ' + str(prediction.probability))
print('Prediction time: ' + str(prediction_time))

if (prediction.tag_name == validation_category and (prediction.probability * 100 > 50)):
results[validation_category]['true_positive'] += 1
correct += 1
print('Is correct! :)')
results[validation_category]['false_negative'] += 1
results[prediction.tag_name]['false_positive'] += 1
failed += 1
print('Failed! :(')
api_errors += 1
print('API ERROR :/')

# Ensure use of free tier maximum (2 inferences per second).
total = correct + failed + api_errors
total_predicted = correct + failed
precision_sum = 0
# Calculate precision
for category in results.keys():
if results[category]['true_positive'] > 0:
category_precision = results[category]['true_positive'] / (results[category]['true_positive'] + results[category]['false_positive'])
category_precision = 0
precision_sum += category_precision
precision = precision_sum / len(results)
# Calculate recall
recall_sum = 0
for category in results.keys():
if results[category]['true_positive'] > 0:
category_recall = results[category]['true_positive'] / category_predictions
category_recall = 0
recall_sum += category_recall
recall = recall_sum / len(results)
print('Total: ' + str(total))
print('API errors: ' + str(api_errors))
print('Correct predictions: ' + str(correct))
print('Failed predictions: ' + str(failed))
print('Precision: ' + str(precision))
print('Recall: ' + str(recall))
print('Average prediction time: ' + str(total_prediction_time / (total_predicted)))

The script shows the following results:

  • Total: 500
  • API errors: 0
  • Correct predictions: 363
  • Failed predictions: 137
  • Precision: 0.75
  • Recall: 0.73
  • Average prediction time: 0.53 seconds

As we can observe, the precision and recall metrics have significantly lowered but we can assume the model still behaves fairly well (although this accuracy could be insufficient depending on the requirements of our project) if we take into account the size of the training dataset and reduced number of iterations.

2009 results

As an interesting note, the solution with better results than the experiments described in the original paper, published in 2009, achieved a precision average of 26%.

When they are compared with the results of our quick 2020 prototype (implemented in a few hours making use of a highly automated tool), it highlights the progress of AI over the last few years.


Unfortunately, it’s not always sunny in Philadelphia. The service has some tradeoffs that should be taken into account if you are choosing the best tool to cover the requirements of your project.

The main limitation you need to accept when using Custom Vision is you won’t be able to export your trained model to be used in an external environment unless you select to use a “compact version”. This means you’ll always rely on Microsoft Azure to provide inferences and won’t have control to tweak model parameters or scale as you need (though this might not be an issue unless the requirements of your project demand it). To be fair, this seems a logical move from Microsoft to ensure their clients stay with them after building a custom model with minimum effort.

The aforementioned “compact” models are reduced versions with some limitations that aim to be deployed and used in devices with limited resources (IoT devices or mobile phones). Very useful when this type of use is needed but not enough when you want to export and run the model on a third-party platform (or train it locally). In a different article, I’ll test and compare the results of both the standard and compact versions.

Besides, one of its main strengths — that makes the whole training and deploying process to be transparent to the user a possibility — could be one of its main weaknesses. It is quite likely Custom Vision won’t reach the same precision level than a model created from scratch, which would allow you to optimise and change every learning parameter depending on the results obtained in each iteration. That being said, this doesn’t seem to be the market they want to reach, at least in the short term.


My general impression of the service and its capabilities is quite positive. In spite of its limitations, I think Custom Vision is a tool to consider for projects that require a custom model but are perhaps limited by time or budget. Custom Vision would also be a good option for prototyping, or for companies without the required resources or knowledge to dive deep into the challenges of AI and ML.