Multimodel Serving

Multimodel Serving (MMS) introduces a new capability where you can deploy and manage several machine learning models as a group through a construct called a Model Group.

This approach marks a shift from traditional single-model deployments to more scalable, efficient, and cost-effective model deployments. By sharing infrastructure components such as compute instances and load balancers across several models, you can dramatically reduce overheads.

Also, this feature supports live updates, enabling you to update or replace models in an existing deployment without causing downtime or requiring infrastructure recreation. This ensures high availability and operational continuity for production workloads.

The model grouping mechanism is also designed with immutability and versioning, for robust lifecycle tracking, reproducibility, and safe iteration of deployments. Use cases such as stacked LLM inferencing, where several fine-tuned models share a base model, are now supported, bringing better performance and resource utilisation on shared GPUs.

Benefits:

  • Deploy up to 500 models (subject to instance shape limitations) in a single deployment.

  • Share infrastructure (CPU, GPU, memory) across models.

  • Seamless Live Update capability.

  • Enhanced support for Stacked LLM inferencing.

  • Improved cost-efficiency and manageability.

A Model Group is a logical construct used to encapsulate several machine learning models into a single, version-controlled unit. With a model Group, you can group deployments, share resources, and perform live updates while maintaining immutability and reproducibility. For more information on what is a Model Group, see the Model Groups documentation.

Step 1: Deploy the Model Group

  • Ensure that the appropriate policies are applied. For more information, see Model Group Policies.
    1. Follow the steps in .Creating a Model Deployment as far as the model section.
    2. Select Models.
    3. Select Model Groups.
    4. Select the Model Group to deploy.

      Find the model group by selecting the default compartment and project, or by selecting Using OCID and searching for the model group by entering its OCID.

    5. Select Submit.
  • Create the model deployment:
    # 1. Create model group configuration details object
        model_group_config_details = ModelGroupConfigurationDetails(
        model_group_id="ocid1.modelgroup.oc1..exampleuniqueID"
        bandwidth_mbps=<bandwidth-mbps>,
        instance_configuration=<instance-configuration>,
        scaling_policy=<scaling-policy>
        )
     
     
    # 2. Create infrastructure configuration details object
        infrastructure_config_details = InstancePoolInfrastructureConfigurationDetails(
        infrastructure_type="INSTANCE_POOL",
        instance_configuration=instance_config,
        scaling_policy=scaling_policy
        )
     
    # 3. Create environment configuration
        environment_config_details = ModelDeploymentEnvironmentConfigurationDetails(
        environment_configuration_type="DEFAULT",
        environment_variables={"WEB_CONCURRENCY": "1"}
        )
      
    # 4. Create category log details
        category_log_details = CategoryLogDetails(
        access=LogDetails(
            log_group_id=<log-group-id>,
            log_id=<log-id>
        ),
        predict=LogDetails(
            log_group_id=<log-group-id>,
            log_id=<log-id>
        )
        )
     
    # 5. Bundle into deployment configuration
        model_group_deployment_config_details = ModelGroupDeploymentConfigurationDetails(
        deployment_type="MODEL_GROUP",
        model_group_configuration_details=model_group_config,
        infrastructure_configuration_details=infrastructure_config_details,
        environment_configuration_details=environment_config_details
        )
     
    # 6. Set up parameters required to create a new model deployment.
        create_model_deployment_details = CreateModelDeploymentDetails(
        display_name=<deployment_name>,
        description=<description>,
        compartment_id=<compartment-id>,
        project_id=<project-id>,
        model_deployment_configuration_details=model_group_deployment_config_details,
        category_log_details=category_log_details
        )
     
    # 7. Create deployment using SDK client
        response = data_science_client.create_model_deployment(
        create_model_deployment_details=create_model_deployment_details
        )
     
    print("Model Deployment OCID:", response.data.id)
  • Create the model group deployment:
    {
            "displayName": "MMS Model Group Deployment",
            "description": "mms",
            "compartmentId": compartment_id,
            "projectId": project_id,
            "modelDeploymentConfigurationDetails": {
                "deploymentType": "MODEL_GROUP",
                "modelGroupConfigurationDetails": {
                    "modelGroupId": model_group_id
                },
                "infrastructureConfigurationDetails": {
                    "infrastructureType": "INSTANCE_POOL",
                    "instanceConfiguration": {
                        "instanceShapeName": "VM.Standard.E4.Flex",
                        "modelDeploymentInstanceShapeConfigDetails": {
                            "ocpus": 8,
                            "memoryInGBs": 128
                        }
                    },
                    "scalingPolicy": {
                        "policyType": "FIXED_SIZE",
                        "instanceCount": 1
                    }
                },
                "environmentConfigurationDetails": {
                    "environmentConfigurationType": "DEFAULT",
                    "environmentVariables": {
                        "WEB_CONCURRENCY": "1"
                    }
                }
            },
            "categoryLogDetails": {
                "access": {
                    "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba",
                    "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq"
                },
                "predict": {
                    "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba",
                    "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq"
                }
            }
        }
    }

Step 2: Perform Inferencing

Supported routes:

  • /predict with modelOCID(model-ocid) or inferenceKey(model-key)

  • For more information on inference keys, see Inference Keys in Model Group.
  • Inference routing is handled internally based on these keys. Provide these keys in the HTTP header.

  1. Update the HTTP header. For example:
    predict_headers = {
            'Content-Type': 'application/json',
            'opc-request-id': 'test-id',
            "model-ocid" : model_ocid or "model-key" : model_key
        }
  2. Open the inferencing URL. Use a post command:
    https://modeldeployment.<region>.oci.customer-oci.com/<model_deployment_ocid>/predict

Step 3: Update Model Deployment

    1. From the model deployments page, select the model deployment .
    2. Navigate to Model Deployment details page.
    3. Select Edit.
    4. Change the options by selecting one field at a time.
      1. Select Select models.
      2. Select Model Group.
      3. Select the active model group you want to update from the model catalog.
    5. Select Select to select the update type.
      Note

      update type = LIVE is only supported for updating the model group alone.
    6. Change the options one at a time by using Select, Change shape or Show advanced options.
    7. Select Submit.
      The progress of the Update operation of the Model Group Deployment can be checked using the work request percentage completion displayed on the Console. To check the status of models in Multimodel Deployment, see Step 4: Model Deployment Status API.
  • Keep all the configuration same for model deployment. Only change the Model Group OCID.
    1. Update the model group configuration details:
      update_model_group_configuration_details = UpdateModelGroupConfigurationDetails( model_group_id=update_model_group_id )
    2. Create the model group deployment configuration with updateType = LIVE:
      model_deployment_configuration_details = ModelGroupDeploymentConfigurationDetails(
       deployment_type="MODEL_GROUP",
       update_type="LIVE",
       model_group_configuration_details=update_model_group_configuration_details,
       infrastructure_configuration_details=infra_config,
       environment_configuration_details=environment_config
       )
    3. Build the payload update:
      update_model_deployment_details = UpdateModelDeploymentDetails(
       display_name="MMS Model Group Deployment - Test",
       description="mms", compartment_id=compartment_id,
       project_id=project_id,
       model_deployment_configuration_details=model_deployment_configuration_details,
       category_log_details=category_log_details
       )
    4. Submit the update request :
      response = data_science_client.update_model_deployment(
       model_deployment_id=md_id,
       update_model_deployment_details=update_model_deployment_details
       ) 
      print("Update submitted. Status:", response.status)
  • Create a model group. For example:
    {
            "displayName": "MMS Model Group Deployment - Test",
            "description": "mms",
            "compartmentId": compartment_id,
            "projectId": project_id,
            "modelDeploymentConfigurationDetails": {
                "deploymentType": "MODEL_GROUP",
                "updateType": "LIVE",
                "modelGroupConfigurationDetails": {
                    "modelGroupId": update_model_group_id
                },
                "infrastructureConfigurationDetails": {
                    "infrastructureType": "INSTANCE_POOL",
                    "instanceConfiguration": {
                        "instanceShapeName": "VM.Standard.E4.Flex",
                        "modelDeploymentInstanceShapeConfigDetails": {
                            "ocpus": 8,
                            "memoryInGBs": 128
                        }
                    },
                    "scalingPolicy": {
                        "policyType": "FIXED_SIZE",
                        "instanceCount": 1
                    }
                },
                "environmentConfigurationDetails": {
                    "environmentConfigurationType": "DEFAULT",
                    "environmentVariables": {
                        "WEB_CONCURRENCY": "1"
                    }
                }
            },
            "categoryLogDetails": {
                "access": {
                    "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba",
                    "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq"
                },
                "predict": {
                    "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba",
                    "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq"
                }
            }
        }
    }

Step 4: Model Deployment Status API

Use this API to view the multimodel deployment state to see which models are active and which have failures.
Method
GET
API
{endpoint}/modelDeployments/{modelDeploymentId}/modelStates
Parameters
Pagination (Offset and Count)
Response Payload
A list of models with their model state.
  • Verb: GET
  • URL: /modelDeployments/{modelDeploymentId}/models/modelState
  • Parameters
    • Headers
      • RequestIdHeader
      • RetryTokenHeader

    • Query Parameters:

      • page
      • limit
      • sortOrder
      • sortBy
      • compartmentId
      • projectId
      • displayName
      • inferenceKey
      • modelId
An example request snapshot:
url = f"{endpoint}/modelDeployments/{md_id}/models/modelState"
response = requests.request("GET", url, headers=util.headers, auth=auth)
An example response snapshot:
GET /20190101/modelDeployments/ocid1.datasciencemodeldeploymentdev.oc1.iad.aaaaaaaah4wlp2v4rwzz7qgbmdad4w4m4g3xygzhfhrv7mquxvyylajmh6ra/models/modelState
[ {
  "modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaaumqu5snfwbsfuiy6xvs6mvtomfmseox5php356mxm5jnuzmwa6lq",
  "state" : "SUCCESS"
}, {
  "modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaaqoucitgwgmdn6kre3j67l4e7r4xtzhm3rkvuwbmtyrkvjicjlflq",
  "state" : "SUCCESS"
}, {
  "modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaamaowevcsufxhhzrewzebeqoak7krx24mvlprdxzlsenuwxkxhkra",
  "state" : "SUCCESS"
} ]
200

Use Bring Your own Container with a Model Group

  1. Follow the steps in Bring Your Own Container.
  2. To return the inference server's health, expose a /health endpoint.
  3. To return the health of a model, expose the models/model-id/health endpoint.
  4. To load the model in the model serving engine, or to remove the model from the model serving engine, expose the model/model-id endpoint.
    BYOC API
    Verb Endpoint Description
    GET /models/model-id/health Model is healthy
    POST /models/model-id Load model in the model serving engine
    POST /models/model-id/predict Inference for a particular model
    DELETE /models/model-id Remove model from model serving engine
    An example of BYOC payload:
    "environmentConfigurationDetails": {
    	"environmentConfigurationType": "OCIR_CONTAINER",
        "serverPort": 8080,
        "image": "iad.ocir.io/ociodscdev/mms-ref-byoc:3.0",
        # "entrypoint": [ "python", "-m", "uvicorn", "a/model/server:app", "--port", "5000","--host","0.0.0.0"],
        # "cmd": ["param1"],
    	"environmentVariables": {
     		"WEB_CONCURRENCY": "1"
    	}
    }

LLM Stacked Inferencing

With LLM Stacked Inferencing you have efficient deployment of large language models by packaging the base model with several fine-tuned weights, giving runtime selection for improved GPU usage and A/B testing. This setup uses a model group and is deployed as a STACKED type, supported only with vLLM containers.

For more Information, see LLM Stacked Inferencing on GitHub.

Heterogeneous Model Deployments

Heterogeneous Model Group Deployment means you can serve models built on different ML frameworks (for example, PyTorch, TensorFlow, or ONNX) within a unified endpoint using BYOC containers. Heterogeneous Model Group Deployment is ideal for deploying diverse architectures together, with NVIDIA Triton recommended for automatic routing and execution.

For more information see, Heterogeneous Model Group Deployment on GitHub.

Model Deployment Metrics

For information on the metrics for model deployment, see Metrics.