Pretrained Foundational Models in Generative AI

You can use the following pretrained foundational models in OCI Generative AI.

Important

For supported model time lines, see Retiring the Models.

Ask questions and get conversational responses through an AI chat interface.

Cohere Models


Model	Available in These Regions	Key Features
`cohere.command-r-08-2024`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	Optimized for complex tasks, offers advanced language understanding, higher capacity and more nuanced responses than `cohere.command-r`, and can maintain context from its long conversation history of 128,000 tokens. Also ideal for question-answering, sentiment analysis, and information retrieval. Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. When you fine-tune this model, user prompt for the custom model can be up to 16,000 tokens and the response length is capped at 4,000 tokens for each run. Improved math, coding, and reasoning skills. Enhanced multilingual retrieval-augmented generation (RAG) feature with customizable citation options. You can fine-tune this model with your dataset.
`cohere.command-r-plus-08-2024`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	Optimized for complex tasks, offers advanced language understanding, higher capacity and more nuanced responses than `cohere.command-r-plus`, and can maintain context from its long conversation history of 128,000 tokens. Also ideal for question-answering, sentiment analysis, and information retrieval. Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. Improved math, coding, and reasoning skills. Enhanced multilingual retrieval-augmented generation (RAG) feature with customizable citation options.
`cohere.command-r-16k (deprecated)`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) (dedicated AI cluster only) UK South (London) US Midwest (Chicago)	For dedicated inferencing, create a dedicated AI cluster and endpoint and host the model on the cluster. Maximum prompt + response length: 16,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. When you fine-tune this model, the model's response length is capped at 4,000 tokens for each run. Optimized for conversational interaction and long context tasks. Ideal for text generation, summarization, translation, and text-based classification. You can fine-tune this model with your dataset.
`cohere.command-r-plus (deprecated)`	Brazil East (Sao Paulo) Germany Central (Frankfurt) UK South (London) US Midwest (Chicago)	For dedicated inferencing, create a dedicated AI cluster and endpoint and host the model on the cluster. Maximum prompt + response length: 16,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. Optimized for complex tasks, offers advanced language understanding, higher capacity, and more nuanced responses than `cohere.command-r-16k`. Also ideal for question-answering, sentiment analysis, and information retrieval.

Meta Llama Models


Model	Available in These Regions	Key Features
`meta.llama-3.3-70b-instruct` (New)	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	Model has 70 billion parameters. Accepts text-only inputs and produces text-only outputs. Delivers better performance than both Llama 3.1 70B and Llama 3.2 90B for text tasks. Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. You can fine-tune this model with your dataset.
`meta.llama-3.2-11b-vision-instruct`	Brazil East (Sao Paulo) (dedicated AI cluster only) UK South (London) (dedicated AI cluster only) Japan Central (Osaka) (dedicated AI cluster only) US Midwest (Chicago) (dedicated AI cluster only)	Model has 11 billion parameters. Dedicated mode only. (On-demand inferencing not available.) For dedicated inferencing, create a dedicated AI cluster and endpoint and host the model on the cluster. Context length: 128,000 tokens Maximum prompt + response length: 128,000 tokens for each run. Multimodal support: Input text and images and get a text output. English is the only supported language for the image plus text option. Multilingual option supported for the text only option. In the Console, input a `.png` or `.jpg` image of 5 MB or less. Submitting an image without a prompt doesn't work. When you submit an image, you must submit a prompt about that image in the same request. You can then submit follow-up prompts and the model keeps the context of the conversation. If you host the model in the playground, to add the next image and text, you must clear the chat which results in losing context of the previous conversation by clearing the chat. For API, input a `base64` encoded image in each run. A 512 x 512 image is converted to about 1,610 tokens.
`meta.llama-3.2-90b-vision-instruct`	Brazil East (Sao Paulo) UK South (London) Japan Central (Osaka) US Midwest (Chicago)	Model has 90 billion parameters. Context length: 128,000 tokens Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. Multimodal support: Input text and images and get a text output. English is the only supported language for the image plus text option. Multilingual option supported for the text only option. In the Console, input a `.png` or `.jpg` image of 5 MB or less. Submitting an image only works when submit a prompt about that image in the same request. In the playground, to add the next image and text, you must clear the chat which results in losing context of the previous conversation by clearing the chat. For API, input a `base64` encoded image in each run. A 512 x 512 image is converted to about 1,610 tokens.
`meta.llama-3.1-70b-instruct`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	Model has 70 billion parameters. Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. You can fine-tune this model with your dataset.
`meta.llama-3.1-405b-instruct`	Brazil East (Sao Paulo) (dedicated AI cluster only) Germany Central (Frankfurt) (dedicated AI cluster only) Japan Central (Osaka) (dedicated AI cluster only) UK South (London) (dedicated AI cluster only) US Midwest (Chicago)	Model has 405 billion parameters. Maximum prompt + response length: 128,000 tokens for each run. For on-demand inferencing, the response length is capped at 4,000 tokens for each run. On-demand inferencing is only available in the US Midwest (Chicago) region. Other regions require that you create your own dedicated AI clusters and endpoints to host this model on those clusters for inferencing.
`meta.llama-3-70b-instruct` (deprecated)	Brazil East (Sao Paulo) Germany Central (Frankfurt) UK South (London) US Midwest (Chicago)	Model has 70 billion parameters. Maximum prompt + response length: 8,000 tokens for each run. Has a broad general knowledge, from generating ideas to refining text analysis and drafting written content, such as emails, blog posts, and descriptions.

Tip

Learn about the chat models.

Embedding Models

Convert text to vector embeddings to use in applications for semantic searches, text classification, or text clustering.


Model	Available in These Regions	Key Features
`cohere.embed-english-v3.0`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	English or multilingual. Model creates a 1,024-dimensional vector for each embedding. Maximum 96 sentences per run. Maximum 512 tokens per embedding.
`cohere.embed-multilingual-v3.0`	Brazil East (Sao Paulo) Germany Central (Frankfurt) Japan Central (Osaka) UK South (London) US Midwest (Chicago)	English or multilingual. Model creates a 1,024-dimensional vector for each embedding. Maximum 96 sentences per run. Maximum 512 tokens per embedding.
`cohere.embed-english-light-v3.0`	US Midwest (Chicago)	Light models are smaller and faster than the original models. English or multilingual. Model creates a 384-dimensional vector for each embedding. Maximum 96 sentences per run. Maximum 512 tokens per embedding.
`cohere.embed-multilingual-light-v3.0`	US Midwest (Chicago)	Light models are smaller and faster than the original models. English or multilingual. Model creates a 384-dimensional vector for each embedding. Maximum 96 sentences per run. Maximum 512 tokens per embedding.

Tip

Learn about the embedding models.

Generation Models (Deprecated)

Give instructions to generate text or extract information from text.

Important

Not Available on-demand: All OCI Generative AI foundational pretrained models supported for the on-demand serving mode that use the text generation and summarization APIs (including the playground) are now retired. We recommend that you use the chat models instead.
Can be hosted on clusters: If you host a summarization or a generation model such as cohere.command on a dedicated AI cluster, (dedicated serving mode), you can continue to use that model until it's retired. These models, when hosted on a dedicated AI cluster are only available in US Midwest (Chicago). See Retiring the Models for retirement dates and definitions.


Model	Available in These Regions	Key Features
`cohere.command` (deprecated)	US Midwest (Chicago)	Model has 52 billion parameters. User prompt and response can be up to 4,096 tokens for each run. You can fine-tune this model with your dataset.
`cohere.command-light` (deprecated)	US Midwest (Chicago)	Model has 6 billion parameters. User prompt and response can be up to 4,096 tokens for each run. You can fine-tune this model with your dataset.
`meta.llama-2-70b-chat` (deprecated)	US Midwest (Chicago)	Model has 70 billion parameters. User prompt and response can be up to 4,096 tokens for each run.

Tip

Learn about the text generation models.

The Summarization Model (Deprecated)

Summarize text with your instructed format, length, and tone.

Important

The cohere.command model supported for the on-demand serving mode is now retired and this model is deprecated for the dedicated serving mode. If you're hosting cohere.command on a dedicated AI cluster, (dedicated serving mode) for summarization, you can continue to use this hosted model replica with the summarization API and in the playground until the cohere.command model retires for the dedicated serving mode. These models, when hosted on a dedicated AI cluster are only available in US Midwest (Chicago). See Retiring the Models for retirement dates and definitions. We recommend that you use the chat models instead which offer the same summarization capabilities, including control over summary length and style.


Model	Available in These Regions	Key Features
`cohere.command` (deprecated)	US Midwest (Chicago)	Model has 52 billion parameters. User prompt and response can be up to 4,096 tokens for each run.

Tip

Learn about the summarization model.

Oracle Cloud Infrastructure Documentation

Pretrained Foundational Models in Generative AI