Data

Open AI Datasets

25 widely used datasets for text, code, image, audio, video, and benchmark workflows.

Name Provider Category Size License Commercial Link
AudioSet Google audio 2M+ 10s clips CC-BY 4.0 Yes Open
BIG-Bench Google benchmark 204 tasks Apache 2.0 Yes Open
C4 (Colossal Clean Crawled Corpus) Google/AllenAI text 750GB ODC-By 1.0 Yes Open
Common Voice Mozilla audio 30K+ hours CC-0 Yes Open
Cosmopedia HuggingFace text 25B tokens Apache 2.0 Yes Open
Dolma AI2 (Allen Institute) text 3T tokens ODC-By 1.0 Yes Open
FineWeb HuggingFace text 15T tokens ODC-By 1.0 Yes Open
FineWeb-Edu HuggingFace text 1.3T tokens ODC-By 1.0 Yes Open
ImageNet Stanford/Princeton image 14M images Custom (research) No Open
LAION-5B LAION image 5.85B img-text pairs CC-BY 4.0 Yes Open
LibriSpeech OpenSLR audio 1,000 hours CC-BY 4.0 Yes Open
MS COCO Microsoft image 330K images CC-BY 4.0 Yes Open
MS MARCO Microsoft text 1M+ passages Custom No Open
Natural Questions Google text 300K+ questions CC-BY-SA 3.0 Yes Open
Open Images V7 Google image 9M images CC-BY 4.0 Yes Open
OpenAssistant Conversations LAION text 161K messages Apache 2.0 Yes Open
RedPajama v2 Together AI text 30T+ tokens Apache 2.0 Yes Open
SlimPajama Cerebras text 627B tokens Apache 2.0 Yes Open
SQuAD 2.0 Stanford text 150K questions CC-BY-SA 4.0 Yes Open
StarCoderData BigCode code 783GB Apache 2.0 Yes Open
The Pile EleutherAI text 825GB MIT Yes Open
The Stack v2 BigCode / HuggingFace code 67.5TB Mixed per-repo No Open
UltraChat Tsinghua NLP text 1.5M dialogues MIT Yes Open
WebVid University of Oxford video 10M clips CC-BY 4.0 Yes Open
Wikipedia Dump Wikimedia text 22GB (English) CC-BY-SA 4.0 Yes Open