Training Data in Generative AI
Here are guidelines for creating training data for fine-tuning the pretrained models in OCI
Generative AI. A custom model can be fine‑tuned with only one dataset, which the system automatically splits into 80 % training and 20 % validation data. The dataset must be a JSONL file containing at least 32 prompt/completion pairs, each line formatted as: {"prompt": "<your prompt>", "completion": "<expected response>"}. Save the file in an OCI
Object Storage bucket and reference it when creating the custom model.
Dataset Requirements
Datasets for training custom models have the following requirements:
- A maximum of one fine-tuning dataset is allowed per custom model. This dataset is randomly split to a 80:20 ratio for training and validating.
- Each file must have at least 32 prompt/completion pair examples.
- The file format is
JSONL. - Each line in the
JSONLfile has the following format:{"prompt": "<a prompt>", "completion": "<expected response given the prompt>"}\n - The file must be stored in an OCI Object Storage bucket.
JSONL Format
- About
JSONL -
A
JSONLfile contains a newJSONvalue or object on each line. The file isn't evaluated as a whole, like a regularJSONfile. Instead, each line is treated as if it is a separateJSONfile. This format is ideal for storing a set of inputs inJSONformat.The OCI Generative AI service accepts a
JSONLfile for fine-tuning custom models in the following format:{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"} {"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"} . . . JSONLExample
Ensure that each
JSONL dataset file that you create for Generative AI has the following properties: - The file is
UTF-8encoded. - Each line item contains a valid
JSONobject. - Each
JSONobject has two properties:"prompt"and"completion". - Each
JSONobject is entered in a new line or followed by a newline character (\n).
After you create the JSONL file, add your dataset to an Object Storage bucket.