Analyzing a Stored Video Using a Custom Model

Identify scene-based features and objects, and detect faces and label frames in a video by calling a video analysis custom model.

The maximum size and duration of each video is shown in the Limits section.

For more information about the video analysis, see the section on Stored Video Analysis.

Follow these steps to use a custom model in Vision.Metrics are available to analyze the custom model's performance.

Create the Dataset

Vision custom models are intended for users without a data science background. By creating a dataset, and instructing Vision to train a model based on the dataset, you can have a custom model ready for your scenario.

The key to building a useful custom model is preparing and training it with a good dataset. Vision supports the following dataset format:Collect a dataset that's representative of the problem and space you intend to apply the trained model on. While data from other domains might work, a dataset generated from the same intended devices, environments, and conditions of use, outperforms any other.

Data labeling is the process of identifying properties of records, such as, documents, text, and images, and annotating them with labels to identify those properties. The caption of an image and identification of an object in an image are both examples of a data label. You can use Oracle Cloud Infrastructure Data Labeling to do the data labeling. For more information, see the Data Labeling service guide. Here is an outline of the steps to take:

  1. Collect enough of images that match the distribution of the intended application.

    When choosing how many images are needed for your dataset, use as many images as you can in your training dataset. For each label to be detected, provide at least 10 images for the label. Ideally provide 50 or more images per label. The more images you provide the better the detection robustness and accuracy. Robustness is the ability to generalize to new conditions such as view angle or background.

  2. Collect a few varieties of other images to capture different camera capture angles, lighting conditions, backgrounds, and others.

    Collect a dataset that's representative of the problem and space you intend to apply the trained model on. While data from other domains might work, a dataset generated from the same intended devices, environments, and conditions of use, outperforms any other.

    Provide enough perspectives for the images, as the model uses not only the annotations to learn what is correct, but also the background to learn what is wrong. For example, provide views from different sides of the object detected, with different lighting conditions, from different image capture devices, and so on.
  3. Label all instances of the objects that occur in the sourced dataset.
    Keep the labels consistent. If you label many apples together as one apple, do so consistently in each image. Don't have space between the objects and the bounding box. The bounding boxes must closely match the objects labeled.
    Important

    Verify each of these annotations as they're important for the model's performance.

Building a Custom Model

Build custom models in Vision to extract insights from images without needing data scientists.

You need the following before building a custom model:
  • A paid tenancy account in Oracle Cloud Infrastructure.
  • Familiarity with Oracle Cloud Infrastructure Object Storage.
  • The correct policies.

Train the Custom Model

After creating your dataset, you can train your custom model.

Train your model using one of Vision's custom model training modes. The training modes are:
  • Recommended training: Vision automatically selects the training duration to create the best model. The training might take up to 24 hours.
  • Quick training: This option produces a model that's not fully optimized but is available in about an hour.
  • Custom duration: This option lets you set your own maximum training duration.

The best training duration depends on the complexity of your detection problem, the typical number of objects in an image, the resolution, and other factors. Consider these needs, and allocate more time as the training complexity increases. The minimum amount of training time recommended is 30 minutes. A longer training time gives greater accuracy, but diminishing returns in accuracy with time. Use the quick training mode to get an idea of the smallest amount of time it takes to get a model that provides reasonable performance. Use the recommended mode to get a base optimized model. If you want a better result, increase the training time.

Call the Custom Model

Custom models can be called the same as you would call the pretrained model.

You can call the custom model to analyze images as a single request, or as a batch request. You must have done these steps first:

Custom Model Metrics

The following metrics are provided for custom models in Vision.

mAP@0.5 score
The mean Average Precision (mAP) score with a threshold of 0.5 is provided only for custom object detection models. calculated by taking the mean Average Precision over all classes. It ranges from 0.0 to 1.0 where 1.0 is the best result.
Precision
The fraction of relevant instances among the retrieved instances.
Recall
The fraction of relevant instances that were retrieved.
Threshold
The decision threshold to make a class prediction for the metrics.
Total images
The total number of images used for training and testing.
Test images
The number of images from the dataset that were used for testing and not used for training.
Training duration
The length of time in hours that the model was trained.