Concepts for Generative AI
To help you underatand OCI Generative AI, review some concepts and terms related to the service.
Generative AI Model
An AI model trained on large amounts of data which takes inputs that it hasn't seen before and generates new content.
Retrieval-Augmented Generation (RAG)
A program that retrieves data from given sources and augments large language model (LLM) responses with the given information to generate grounded responses.
Prompts and Prompt Engineering
- Prompts
- Strings of text in natural language used to instruct or extract information from a large
language model. For example,
- What is the summer solstice?
- Write a poem about trees swaying in the breeze.
- Rewrite the previous text in a lighter tone.
- Prompt Engineering
- The iterative process of crafting specific requests in natural language for extracting optimized prompts from a large language model (LLM). Based on the exact language used, the prompt engineer can guide the LLM to provide better or different outputs.
Inference
The ability of a large language model (LLM) to generate a response based on instructions and context provided by the user in the prompt. An LLM can generate new data, make predictions, or draw conclusions based on its learned patterns and relationships in the training data, without having been explicitly programmed.
Inference is a key feature of natural language processing (NLP) tasks such as question answering, summarizing text, and translating. You can use the foundational models in Generative AI for inference.
Streaming
Generation of content by a large language model (LLM) where the user can see the tokens being generated one at a time instead of waiting for a complete response to be generated before returning the response to the user.
Embedding
A numerical representation that has the property of preserving the meaning of a piece of text. This text can be a phrase, a sentence, or one or more paragraphs. The Generative AI embedding models transform each phrase, sentence, or paragraph that you input, into an array with 384 or 1024 numbers, depending on the embedding model that you choose. You can use these embeddings for finding similarity in phrases that are similar in context or category. Embeddings are typically stored in a vector database. Embeddings are mostly used for semantic searches where the search function focuses on the meaning of the text that it's searching through rather than finding results based on keywords. To create the embeddings, you can input phrases in English and other languages.
Playground
An interface in the Oracle Cloud Console for exploring the hosted pretrained and custom models without writing a single line of code. Use the playground to test your use cases and refine prompts and parameters. When you're happy with the results, copy the generated code or use the model's endpoint to integrate Generative AI into your applications.
Custom Model
A model that you create by using a pretrained model as a base and using your own dataset to fine-tune that model.
Tokens
A token is a word, part of a word, or a punctuation. For example, apple is one token and friendship is two tokens (friend and ship), and don’t is two tokens (don and ‘t). When you run a model in the playground, you can set the maximum number of output tokens. Estimate four characters per token.
Temperature
The level of randomness used to generate the output text. To generate a similar output for a prompt every time that you run that prompt, use 0. To generate a random new text for that prompt, increase the temperature.
Start with the temperature set to 0 and increase the temperature as you regenerate the prompts to refine the output. High temperatures can introduce hallucinations and factually incorrect information.
Top k
A sampling method in which the model chooses the next token randomly from the top
k
most likely tokens. A higher value for k
generates more random
output, which makes the output text sound more natural. The default value for k is 0 for
command
models and -1 for Llama
models, which means that
the models should consider all tokens and not use this method.
Top p
A sampling method that controls the cumulative probability of the top tokens to consider for
the next token. Assign p
a decimal number between 0 and 1 for the
probability. For example, enter 0.75 for the top 75 percent to be considered. Set
p
to 1 to consider all tokens.
Frequency Penalty
A penalty that is assigned to a token when that token appears frequently. High penalties encourage fewer repeated tokens and produce a more random output.
Presence Penalty
A penalty that is assigned to each token when it appears in the output to encourage generating outputs with tokens that haven't been used.
Likelihood
In the output of a large language model (LLM), how likely it is for a token to follow the
current generated token. When an LLM generates a new token for the output text, a likelihood
is assigned to all tokens, where tokens with higher likelihoods are more likely to follow the
current token. For example, it's more likely that the word favorite is followed by the
word food or book rather than the word zebra. Likelihood is defined by a
number between -15
and 0
and the more negative the number,
the less likely it is that the token follows the current token.
Preamble
An initial context or guiding message for a chat model. When you don't give a preamble to a
chat model, the default preamble for that model is used. The default preamble for the
cohere.command-r-plus
and cohere.command-r-16k
models
is:
You are Command.
You are an extremely capable large language model built by Cohere.
You are given instructions programmatically via an API that you follow to the best of your ability.
It's optional to give a preamble. If want to use your own preamble, for best results, give the model context, instructions, and a conversation style. Here are some examples:
- You are a seasoned marketing professional with a deep understanding of consumer behavior and market trends. Answer with a friendly and informative tone, sharing industry insights and best practices.
- You are a travel advisor that focuses on fun activities. Answer with sense of humor and a pirate tone.
You can also include a preamble in a chat conversation and directly ask the model to answer in a certain way. For example, "Answer the following question in a marketing tone. Where's the best place to go sailing?"
Model Endpoint
A designated point on a dedicated AI cluster where a large language model (LLM) can accept user requests and send back responses such as the model's generated text.
In OCI Generative AI, you can create endpoints for ready-to-use pretrained models and custom models. Those endpoints are listed in the playground for testing the models. You can also reference those endpoints in applications.
Content Moderation
- Hate and harassment, such as identity attacks, insults, threats of violence, and sexual aggression
- Self-inflicted harm, such as self-harm and eating-disorder promotion
- Ideological harm, such as extremism, terrorism, organized crime, and misinformation
- Exploitation, such as scams and sexual abuse
By default, OCI Generative AI doesn't add a content moderation layer on top of the ready-to-use pretrained models. However, pretrained models have some level of content moderation that filter the output responses. To incorporate content moderation into models, you must enable content moderation when creating an endpoint for a pretrained or a fine-tuned model. See Creating an Endpoint in Generative AI.
Dedicated AI Clusters
Compute resources that you can use for fine-tuning custom models or for hosting endpoints for pretrained and custom models. The clusters are dedicated to your models and not shared with other customers.
Retired and Deprecated Models
- Retirement
- When a model is retired, it's no longer available for use in the Generative AI service.
- Deprecation
- When a model is deprecated it remains available in the Generative AI service, but will have a defined amount of time that it can be used before it's retired.
For more information, see Retiring the Models.