Use the /predictWithResponseStream endpoint for real-time streaming of
inference results over HTTP/1.1.
It supports both standard chunked transfer encoding and Server-Sent Events (SSE), letting
clients receive prediction outputs incrementally as they're generated. This is especially
useful for applications that require low-latency responses, such as interactive AI
experiences or large language model outputs.
Consider increasing the provisioned Load Balancer
bandwidth to avoid these errors by editing the model
deployment.
Tenancy request-rate limit exceeded
Maximum number of requests per second per tenancy is set to 150.
If you're consistently receiving error messages after increasing the LB
bandwidth, use the OCI
Console to submit a support ticket for the
tenancy. Include the following details in the ticket.
Describe the issue with the error message that occurred, and indicate
the new request per second needed for the tenancy.
Indicate that it's a minor loss of service.
Indicate Analytics & AI and Data Science.
Indicate that the issue is creating and managing models.
With HTTP 1.1 streaming, when any part of the response body is sent to the client, the
response headers and status code can no longer be changed. As a result, even in cases of
partial success, where not all expected inference data is received, the client still sees a
200 OK status code.
To provide better visibility on these model stream failures, we use HTTP Trailers. A trailer
response header is used to send extra metadata back to the user on their request after the
entire response stream or message body has been sent. This metadata is present in the trailer
section of response body that clients can separately read apart from regular data chunks.
StreamFailure: This is a trailer header field set in the response body at the end
of stream to pass extra metadata to the client if any failures are seen after the
request is accepted.
It captures the following types of error:
ErrorCode - RequestTimeout
HttpCode - 408
ErrorReason - ServiceTimeout/ModelResponseTimeExceeded
ServiceTimeout: This is same as 60 second idle timeout
ModelResponseTimeExceeded: The model failed to finish
sending the entire response within 5 mins time window.
ErrorCode - InternalServerError
HttpCode - 500
ErrorReason - InternalServerError
Invoking with the OCI Python SDK 🔗
This example code is a reference to help you invoke your model deployment:
Copy
# The OCI SDK must be installed for this example to function properly.
# Installation instructions can be found here: https://docs.oracle.com/iaas/Content/API/SDKDocs/pythonsdk.htm
import oci
from oci.signer import Signer
from oci.model_deployment import ModelDeploymentClient
config = oci.config.from_file("~/.oci/config") # replace with the location of your oci config file
auth = Signer(
tenancy=config['tenancy'],
user=config['user'],
fingerprint=config['fingerprint'],
private_key_file_location=config['key_file'],
pass_phrase=config['pass_phrase'])
# For security token based authentication
# token_file = config['security_token_file']
# token = None
# with open(token_file, 'r') as f:
# token = f.read()
# private_key = oci.signer.load_private_key_from_file(config['key_file'])
# auth = oci.auth.signers.SecurityTokenSigner(token, private_key)
model_deployment_ocid = '${modelDeployment.id}'
request_body = {} # payload goes here
model_deployment_client = ModelDeploymentClient({'region': config['region']}, signer=auth)
# Enable a realm-specific endpoint: https://docs.oracle.com/iaas/tools/python/2.152.0/sdk_behaviors/realm_specific_endpoint_template.html
model_deployment_client.base_client.client_level_realm_specific_endpoint_template_enabled = True
response = model_deployment_client.predict_with_response_stream(model_deployment_id=model_deployment_ocid,
request_body=request_body)
# Read chunks from the response stream
with open('example_file_retrieved_streaming', 'wb') as f:
for chunk in response.data.raw.stream(1024 * 1024, decode_content=False):
print(chunk)
Invoking with the OCI CLI 🔗
Use a model deployment in the CLI by invoking it.
The CLI is included in the OCI Cloud Shell environment, and is preauthenticated. This example
invokes a model deployment with the CLI: