Data
Open AI Datasets
25 widely used datasets for text, code, image, audio, video, and benchmark workflows.
| Name | Provider | Category | Size | License | Commercial | Link |
|---|---|---|---|---|---|---|
| AudioSet | audio | 2M+ 10s clips | CC-BY 4.0 | Yes | Open | |
| BIG-Bench | benchmark | 204 tasks | Apache 2.0 | Yes | Open | |
| C4 (Colossal Clean Crawled Corpus) | Google/AllenAI | text | 750GB | ODC-By 1.0 | Yes | Open |
| Common Voice | Mozilla | audio | 30K+ hours | CC-0 | Yes | Open |
| Cosmopedia | HuggingFace | text | 25B tokens | Apache 2.0 | Yes | Open |
| Dolma | AI2 (Allen Institute) | text | 3T tokens | ODC-By 1.0 | Yes | Open |
| FineWeb | HuggingFace | text | 15T tokens | ODC-By 1.0 | Yes | Open |
| FineWeb-Edu | HuggingFace | text | 1.3T tokens | ODC-By 1.0 | Yes | Open |
| ImageNet | Stanford/Princeton | image | 14M images | Custom (research) | No | Open |
| LAION-5B | LAION | image | 5.85B img-text pairs | CC-BY 4.0 | Yes | Open |
| LibriSpeech | OpenSLR | audio | 1,000 hours | CC-BY 4.0 | Yes | Open |
| MS COCO | Microsoft | image | 330K images | CC-BY 4.0 | Yes | Open |
| MS MARCO | Microsoft | text | 1M+ passages | Custom | No | Open |
| Natural Questions | text | 300K+ questions | CC-BY-SA 3.0 | Yes | Open | |
| Open Images V7 | image | 9M images | CC-BY 4.0 | Yes | Open | |
| OpenAssistant Conversations | LAION | text | 161K messages | Apache 2.0 | Yes | Open |
| RedPajama v2 | Together AI | text | 30T+ tokens | Apache 2.0 | Yes | Open |
| SlimPajama | Cerebras | text | 627B tokens | Apache 2.0 | Yes | Open |
| SQuAD 2.0 | Stanford | text | 150K questions | CC-BY-SA 4.0 | Yes | Open |
| StarCoderData | BigCode | code | 783GB | Apache 2.0 | Yes | Open |
| The Pile | EleutherAI | text | 825GB | MIT | Yes | Open |
| The Stack v2 | BigCode / HuggingFace | code | 67.5TB | Mixed per-repo | No | Open |
| UltraChat | Tsinghua NLP | text | 1.5M dialogues | MIT | Yes | Open |
| WebVid | University of Oxford | video | 10M clips | CC-BY 4.0 | Yes | Open |
| Wikipedia Dump | Wikimedia | text | 22GB (English) | CC-BY-SA 4.0 | Yes | Open |