Multimodel Serving
Multimodel Serving (MMS) introduces a new capability where you can deploy and manage several machine learning models as a group through a construct called a Model Group.
This approach marks a shift from traditional single-model deployments to more scalable, efficient, and cost-effective model deployments. By sharing infrastructure components such as compute instances and load balancers across several models, you can dramatically reduce overheads.
Also, this feature supports live updates, enabling you to update or replace models in an existing deployment without causing downtime or requiring infrastructure recreation. This ensures high availability and operational continuity for production workloads.
The model grouping mechanism is also designed with immutability and versioning, for robust lifecycle tracking, reproducibility, and safe iteration of deployments. Use cases such as stacked LLM inferencing, where several fine-tuned models share a base model, are now supported, bringing better performance and resource utilisation on shared GPUs.
Benefits:
-
Deploy up to 500 models (subject to instance shape limitations) in a single deployment.
-
Share infrastructure (CPU, GPU, memory) across models.
-
Seamless Live Update capability.
-
Enhanced support for Stacked LLM inferencing.
-
Improved cost-efficiency and manageability.
A Model Group is a logical construct used to encapsulate several machine learning models into a single, version-controlled unit. With a model Group, you can group deployments, share resources, and perform live updates while maintaining immutability and reproducibility. For more information on what is a Model Group, see the Model Groups documentation.
Step 1: Deploy the Model Group
Ensure that the appropriate policies are applied. For more information, see Model Group Policies. - Create the model deployment:
# 1. Create model group configuration details object model_group_config_details = ModelGroupConfigurationDetails( model_group_id="ocid1.modelgroup.oc1..exampleuniqueID" bandwidth_mbps=<bandwidth-mbps>, instance_configuration=<instance-configuration>, scaling_policy=<scaling-policy> ) # 2. Create infrastructure configuration details object infrastructure_config_details = InstancePoolInfrastructureConfigurationDetails( infrastructure_type="INSTANCE_POOL", instance_configuration=instance_config, scaling_policy=scaling_policy ) # 3. Create environment configuration environment_config_details = ModelDeploymentEnvironmentConfigurationDetails( environment_configuration_type="DEFAULT", environment_variables={"WEB_CONCURRENCY": "1"} ) # 4. Create category log details category_log_details = CategoryLogDetails( access=LogDetails( log_group_id=<log-group-id>, log_id=<log-id> ), predict=LogDetails( log_group_id=<log-group-id>, log_id=<log-id> ) ) # 5. Bundle into deployment configuration model_group_deployment_config_details = ModelGroupDeploymentConfigurationDetails( deployment_type="MODEL_GROUP", model_group_configuration_details=model_group_config, infrastructure_configuration_details=infrastructure_config_details, environment_configuration_details=environment_config_details ) # 6. Set up parameters required to create a new model deployment. create_model_deployment_details = CreateModelDeploymentDetails( display_name=<deployment_name>, description=<description>, compartment_id=<compartment-id>, project_id=<project-id>, model_deployment_configuration_details=model_group_deployment_config_details, category_log_details=category_log_details ) # 7. Create deployment using SDK client response = data_science_client.create_model_deployment( create_model_deployment_details=create_model_deployment_details ) print("Model Deployment OCID:", response.data.id)
- Create the model group deployment:
{ "displayName": "MMS Model Group Deployment", "description": "mms", "compartmentId": compartment_id, "projectId": project_id, "modelDeploymentConfigurationDetails": { "deploymentType": "MODEL_GROUP", "modelGroupConfigurationDetails": { "modelGroupId": model_group_id }, "infrastructureConfigurationDetails": { "infrastructureType": "INSTANCE_POOL", "instanceConfiguration": { "instanceShapeName": "VM.Standard.E4.Flex", "modelDeploymentInstanceShapeConfigDetails": { "ocpus": 8, "memoryInGBs": 128 } }, "scalingPolicy": { "policyType": "FIXED_SIZE", "instanceCount": 1 } }, "environmentConfigurationDetails": { "environmentConfigurationType": "DEFAULT", "environmentVariables": { "WEB_CONCURRENCY": "1" } } }, "categoryLogDetails": { "access": { "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba", "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq" }, "predict": { "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba", "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq" } } } }
Step 2: Perform Inferencing
Supported routes:
-
/predict
withmodelOCID(model-ocid)
orinferenceKey(model-key)
- For more information on inference keys, see Inference Keys in Model Group.
-
Inference routing is handled internally based on these keys. Provide these keys in the HTTP header.
Step 3: Update Model Deployment
Keep all the configuration same for model deployment. Only change the Model Group OCID. - Create a model group. For example:
{ "displayName": "MMS Model Group Deployment - Test", "description": "mms", "compartmentId": compartment_id, "projectId": project_id, "modelDeploymentConfigurationDetails": { "deploymentType": "MODEL_GROUP", "updateType": "LIVE", "modelGroupConfigurationDetails": { "modelGroupId": update_model_group_id }, "infrastructureConfigurationDetails": { "infrastructureType": "INSTANCE_POOL", "instanceConfiguration": { "instanceShapeName": "VM.Standard.E4.Flex", "modelDeploymentInstanceShapeConfigDetails": { "ocpus": 8, "memoryInGBs": 128 } }, "scalingPolicy": { "policyType": "FIXED_SIZE", "instanceCount": 1 } }, "environmentConfigurationDetails": { "environmentConfigurationType": "DEFAULT", "environmentVariables": { "WEB_CONCURRENCY": "1" } } }, "categoryLogDetails": { "access": { "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba", "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq" }, "predict": { "logGroupId": "ocid1.loggroup.oc1.iad.amaaaaaav66vvniaygnbicsbzb4anlmf7zg2gsisly3ychusjlwuq34pvjba", "logId": "ocid1.log.oc1.iad.amaaaaaav66vvniavsuh34ijk46uhjgsn3ddzienfgquwrr7dwa4dzt4pirq" } } } }
Step 4: Model Deployment Status API
- Method
- GET
- API
-
{endpoint}/modelDeployments/{modelDeploymentId}/modelStates
- Parameters
- Pagination (Offset and Count)
- Response Payload
- A list of models with their model state.
- Verb: GET
- URL:
/modelDeployments/{modelDeploymentId}/models/modelState
- Parameters
- Headers
- RequestIdHeader
-
RetryTokenHeader
-
Query Parameters:
- page
- limit
- sortOrder
- sortBy
- compartmentId
- projectId
- displayName
- inferenceKey
- modelId
- Headers
url = f"{endpoint}/modelDeployments/{md_id}/models/modelState"
response = requests.request("GET", url, headers=util.headers, auth=auth)
An example response
snapshot:GET /20190101/modelDeployments/ocid1.datasciencemodeldeploymentdev.oc1.iad.aaaaaaaah4wlp2v4rwzz7qgbmdad4w4m4g3xygzhfhrv7mquxvyylajmh6ra/models/modelState
[ {
"modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaaumqu5snfwbsfuiy6xvs6mvtomfmseox5php356mxm5jnuzmwa6lq",
"state" : "SUCCESS"
}, {
"modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaaqoucitgwgmdn6kre3j67l4e7r4xtzhm3rkvuwbmtyrkvjicjlflq",
"state" : "SUCCESS"
}, {
"modelId" : "ocid1.datasciencemodel.oc1.iad.aaaaaaaamaowevcsufxhhzrewzebeqoak7krx24mvlprdxzlsenuwxkxhkra",
"state" : "SUCCESS"
} ]
200
Use Bring Your own Container with a Model Group
LLM Stacked Inferencing
With LLM Stacked Inferencing you have efficient deployment of large language models by packaging the base model with several fine-tuned weights, giving runtime
selection for improved GPU usage and A/B testing. This setup uses a model group and is deployed as a STACKED
type, supported only with vLLM
containers.
For more Information, see LLM Stacked Inferencing on GitHub.
Heterogeneous Model Deployments
Heterogeneous Model Group Deployment means you can serve models built on different ML frameworks (for example, PyTorch, TensorFlow, or ONNX) within a unified endpoint using BYOC containers. Heterogeneous Model Group Deployment is ideal for deploying diverse architectures together, with NVIDIA Triton recommended for automatic routing and execution.
For more information see, Heterogeneous Model Group Deployment on GitHub.
Model Deployment Metrics
For information on the metrics for model deployment, see Metrics.