Unsafe files
#25 opened 8 days ago
by
Ali-C137

Sample dataset?
7
#23 opened 12 days ago
by
dweb
Details on the evaluation with lighteval
2
#22 opened 12 days ago
by
amaracani
tiny-fineweb
1
#19 opened 15 days ago
by
3thn

Training configs for data ablation study
1
#14 opened 19 days ago
by
jimmyhbx
Reprocessing for a new language
7
#12 opened 20 days ago
by
pere

Are copyrighted works included in this dataset?
4
#9 opened 23 days ago
by
umm-maybe

Any plan to train models on larger subset of dataset?
1
#8 opened 23 days ago
by
mrfakename

Split by languages?
3
#7 opened 24 days ago
by
mhenrichsen
Thank you for the great dataset
#5 opened 24 days ago
by
musicurgy
Torrent?
2
#4 opened 24 days ago
by
emilss
Scoring documents with LLM and making scores available as a quality filter (Ask-LLM)
1
#3 opened 24 days ago
by
Lauler