@
julien-c \n\n\t! Is it possible to modify the title and the abstract of this paper, now that we have updated the name (from OBELISC to OBELICS) and the arXiv article?\n","updatedAt":"2023-08-23T13:46:58.373Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","fullname":"Hugo LaurenΓ§on","name":"HugoLaurencon","type":"user","isPro":false,"isHf":true}},"numEdits":0,"editors":["HugoLaurencon"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.8875864744186401},"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2306.16527","authors":[{"_id":"649e28bdba50848bd6c59e4f","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","isPro":false,"fullname":"Hugo LaurenΓ§on","user":"HugoLaurencon","type":"user"},"name":"Hugo LaurenΓ§on","status":"claimed_verified","statusLastChangedAt":"2023-06-30T16:17:31.942Z","hidden":false},{"_id":"649e28bdba50848bd6c59e50","name":"Lucile Saulnier","hidden":false},{"_id":"649e28bdba50848bd6c59e51","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652185658647-6244866a456803e9500d0f6a.jpeg","isPro":false,"fullname":"Leo Tronchon","user":"Leyo","type":"user"},"name":"LΓ©o Tronchon","status":"claimed_verified","statusLastChangedAt":"2023-06-30T10:23:16.478Z","hidden":false},{"_id":"649e28bdba50848bd6c59e52","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1594311341799-5f07383b19cb630495b812cd.jpeg","isPro":false,"fullname":"Stas Bekman","user":"stas","type":"user"},"name":"Stas Bekman","status":"admin_assigned","statusLastChangedAt":"2023-08-28T15:08:06.714Z","hidden":false},{"_id":"649e28bdba50848bd6c59e53","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652253065548-6230c6ecfd8b720a5648f6c4.jpeg","isPro":false,"fullname":"Amanpreet Singh","user":"aps","type":"user"},"name":"Amanpreet Singh","status":"claimed_verified","statusLastChangedAt":"2024-02-19T09:36:05.941Z","hidden":false},{"_id":"649e28bdba50848bd6c59e54","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1613655355830-noauth.png","isPro":false,"fullname":"Anton Lozhkov","user":"anton-l","type":"user"},"name":"Anton Lozhkov","status":"claimed_verified","statusLastChangedAt":"2023-06-30T16:17:34.987Z","hidden":false},{"_id":"649e28bdba50848bd6c59e55","user":{"avatarUrl":"/avatars/3d6e4b4d02eda7b5ef28e1cc0fb8e08a.svg","isPro":false,"fullname":"Thomas Wang","user":"TimeRobber","type":"user"},"name":"Thomas Wang","status":"admin_assigned","statusLastChangedAt":"2024-04-11T18:41:56.710Z","hidden":false},{"_id":"649e28bdba50848bd6c59e56","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1634666324094-6150b090d84cf0532aa1764b.jpeg","isPro":false,"fullname":"Siddharth Karamcheti","user":"skaramcheti","type":"user"},"name":"Siddharth Karamcheti","status":"claimed_verified","statusLastChangedAt":"2023-08-15T00:35:38.943Z","hidden":false},{"_id":"649e28bdba50848bd6c59e57","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/J21QWWtFOKqhJTF5qIMlD.jpeg","isPro":false,"fullname":"Sasha Rush","user":"srush","type":"user"},"name":"Alexander M. Rush","status":"admin_assigned","statusLastChangedAt":"2023-06-30T09:52:19.798Z","hidden":false},{"_id":"649e28bdba50848bd6c59e58","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1641847245435-61dc997715b47073db1620dc.jpeg","isPro":false,"fullname":"Douwe Kiela","user":"douwekiela","type":"user"},"name":"Douwe Kiela","status":"admin_assigned","statusLastChangedAt":"2023-06-30T09:52:31.092Z","hidden":false},{"_id":"649e28bdba50848bd6c59e59","name":"Matthieu Cord","hidden":false},{"_id":"649e28bdba50848bd6c59e5a","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},"name":"Victor Sanh","status":"admin_assigned","statusLastChangedAt":"2023-06-30T09:52:03.972Z","hidden":false}],"publishedAt":"2023-06-21T14:01:01.000Z","title":"OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text\n Documents","summary":"Large multimodal models trained on natural documents, which interleave images\nand text, outperform models trained on image-text pairs on various multimodal\nbenchmarks. However, the datasets used to train these models have not been\nreleased, and the collection process has not been fully specified. We introduce\nthe OBELICS dataset, an open web-scale filtered dataset of interleaved\nimage-text documents comprising 141 million web pages extracted from Common\nCrawl, 353 million associated images, and 115 billion text tokens. We describe\nthe dataset creation process, present comprehensive filtering rules, and\nprovide an analysis of the dataset's content. To show the viability of OBELICS,\nwe train vision and language models of 9 and 80 billion parameters named\nIDEFICS, and obtain competitive performance on different multimodal benchmarks.\nWe release our dataset, models and code.","upvotes":40},"canReadDatabase":false,"canManageCommunity":false,"hasHfLevelAccess":false,"publishedOnDailyAt":"2023-06-30T02:00:26.235Z","upvoted":false,"upvoters":[{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675778487155-63d4c8ce13ae45b780792f32.jpeg","isPro":false,"fullname":"Ohenenoo","user":"PeepDaSlan9","type":"user"},{"avatarUrl":"/avatars/2c7f39d68d921016e7924ca22fc6f59f.svg","isPro":false,"fullname":"dr_xiami","user":"xiami","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","isPro":false,"fullname":"Hugo LaurenΓ§on","user":"HugoLaurencon","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6032802e1f993496bc14d9e3/w6hr-DEQot4VVkoyRIBiy.png","isPro":false,"fullname":"Omar Sanseviero","user":"osanseviero","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652185658647-6244866a456803e9500d0f6a.jpeg","isPro":false,"fullname":"Leo Tronchon","user":"Leyo","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/NQtzmrDdbG0H8qkZvRyGk.jpeg","isPro":true,"fullname":"Julien Chaumond","user":"julien-c","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6258561f4d4291e8e63d8ae6/mNNs4rK_UzoOLflIMAotH.jpeg","isPro":false,"fullname":"Sylvestre Bcht","user":"Sylvestre","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60a551a34ecc5d054c8ad93e/dhcBFtwNLcKqqASxniyVw.jpeg","isPro":false,"fullname":"Mishig Davaadorj","user":"mishig","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656275265222-62503f03815d0fd28f847c19.jpeg","isPro":false,"fullname":"William Berrios","user":"will33am","type":"user"},{"avatarUrl":"/avatars/d53dee91892cccfd0c4d7353ffb67cbf.svg","isPro":false,"fullname":"Gabriel Ilharco","user":"gabrielilharco","type":"user"},{"avatarUrl":"/avatars/7d4fed4f418ea94fa442a7dbbf12c3fc.svg","isPro":false,"fullname":"L","user":"abunchofrandomwords","type":"user"}],"acceptLanguages":["en","*"]}">
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents
Published on Jun 21, 2023
Authors:
Lucile Saulnier
,
Matthieu Cord
,
Abstract
Large multimodal models trained on natural documents, which interleave images
and text, outperform models trained on image-text pairs on various multimodal
benchmarks. However, the datasets used to train these models have not been
released, and the collection process has not been fully specified. We introduce
the OBELICS dataset, an open web-scale filtered dataset of interleaved
image-text documents comprising 141 million web pages extracted from Common
Crawl, 353 million associated images, and 115 billion text tokens. We describe
the dataset creation process, present comprehensive filtering rules, and
provide an analysis of the dataset's content. To show the viability of OBELICS,
we train vision and language models of 9 and 80 billion parameters named
IDEFICS, and obtain competitive performance on different multimodal benchmarks.
We release our dataset, models and code.
Community
Edit Preview
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here .
Tap or paste here to upload images