What are Datasets?
A dataset in Phoenix is a structured collection of examples, where each example consists of:- Input: The data provided to your model or application (e.g., user prompts, questions)
- Output: The expected or reference output (e.g., correct answers, ideal responses)
- Metadata: Additional information about the example (e.g., difficulty level, category, source)
What are Experiments?
An experiment is a systematic evaluation of your AI application’s performance on a dataset. Each experiment:- Runs a task (your AI application logic) on every example in a dataset
- Captures the output and execution trace for each run
- Optionally evaluates the outputs using evaluators (metrics and quality checks)
- Stores results for comparison and analysis
Key Use Cases
Model Evaluation
Test your models against benchmarks and ground truth data to measure accuracy and quality.
Regression Testing
Ensure changes to your application don’t degrade performance on known examples.
A/B Testing
Compare different models, prompts, or configurations to find the best approach.
Fine-tuning Preparation
Curate high-quality datasets for training and fine-tuning language models.
Workflow Overview
Create a Dataset
Build datasets from production traces, upload CSV/DataFrame, or manually create examples.
Dataset Versioning
Phoenix automatically versions your datasets:- Each modification creates a new version
- Experiments are tied to specific versions for reproducibility
- You can retrieve and compare any historical version
Trace Association
Datasets can be linked to production traces, enabling you to:- Create datasets from real user interactions
- Track which traces contributed to each example
- Debug issues by reviewing original execution context
Next Steps
Creating Datasets
Learn different ways to create and populate datasets
Running Experiments
Execute experiments and evaluate your AI applications
Dataset Versioning
Manage versions, tags, and export datasets
Evaluators
Understand built-in and custom evaluators