https://www.aimodels.fyi/papers/arxiv/what-matters-when-building-vision-language-models\n","updatedAt":"2024-05-14T19:22:37.031Z","author":{"avatarUrl":"/avatars/0bc16a7447cd71ac18828a678313bd83.svg","fullname":"Mike Young","name":"mikelabs","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["mikelabs"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.6740341186523438},"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2405.02246","authors":[{"_id":"6638c1a5d26a707999bbac28","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","isPro":false,"fullname":"Hugo Laurençon","user":"HugoLaurencon","type":"user"},"name":"Hugo Laurençon","status":"claimed_verified","statusLastChangedAt":"2024-05-13T07:48:59.560Z","hidden":false},{"_id":"6638c1a5d26a707999bbac29","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652185658647-6244866a456803e9500d0f6a.jpeg","isPro":false,"fullname":"Leo Tronchon","user":"Leyo","type":"user"},"name":"Léo Tronchon","status":"claimed_verified","statusLastChangedAt":"2024-05-07T08:33:18.440Z","hidden":false},{"_id":"6638c1a5d26a707999bbac2a","name":"Matthieu Cord","hidden":false},{"_id":"6638c1a5d26a707999bbac2b","user":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},"name":"Victor Sanh","status":"claimed_verified","statusLastChangedAt":"2024-05-06T13:15:17.754Z","hidden":false}],"publishedAt":"2024-05-03T17:00:00.000Z","title":"What matters when building vision-language models?","summary":"The growing interest in vision-language models (VLMs) has been driven by\nimprovements in large language models and vision transformers. Despite the\nabundance of literature on this subject, we observe that critical decisions\nregarding the design of VLMs are often not justified. We argue that these\nunsupported decisions impede progress in the field by making it difficult to\nidentify which choices improve model performance. To address this issue, we\nconduct extensive experiments around pre-trained models, architecture choice,\ndata, and training methods. Our consolidation of findings includes the\ndevelopment of Idefics2, an efficient foundational VLM of 8 billion parameters.\nIdefics2 achieves state-of-the-art performance within its size category across\nvarious multimodal benchmarks, and is often on par with models four times its\nsize. We release the model (base, instructed, and chat) along with the datasets\ncreated for its training.","upvotes":58},"canReadDatabase":false,"canManageCommunity":false,"hasHfLevelAccess":false,"publishedOnDailyAt":"2024-05-14T13:18:00.832Z","upvoted":false,"upvoters":[{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1635201569275-noauth.jpeg","isPro":false,"fullname":"Hugo Laurençon","user":"HugoLaurencon","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1619623771844-5ecea265968f6028e0559fa5.jpeg","isPro":true,"fullname":"Victor Sanh","user":"VictorSanh","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1652185658647-6244866a456803e9500d0f6a.jpeg","isPro":false,"fullname":"Leo Tronchon","user":"Leyo","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62596f9e1c0a084224b93e00/X2aLkJ0ofhkXwAg7lXvxD.jpeg","isPro":false,"fullname":"Guilherme Penedo","user":"guipenedo","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1583857146757-5e67bdd61009063689407479.jpeg","isPro":true,"fullname":"Clem 🤗","user":"clem","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/D78gS9F1gE6mwdbpyzT5K.jpeg","isPro":false,"fullname":"Sergei Petrov","user":"sergeipetrov","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667018139063-noauth.jpeg","isPro":false,"fullname":"Ojasvi Singh Yadav","user":"ojasvisingh786","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644340617257-noauth.png","isPro":false,"fullname":"Clémentine Fourrier","user":"clefourrier","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61d375fd733d3a83ecd1bba9/oIXwvvs1-HaCnJXMCZgkc.jpeg","isPro":false,"fullname":"Andrew Reed","user":"andrewrreed","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61bf0e11c88f3fd22f654059/vussVtYbO_D-BGBS6euZy.png","isPro":false,"fullname":"Eryk Mazuś","user":"eryk-mazus","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645712223620-60d35154d7b174177faabd55.jpeg","isPro":false,"fullname":"Théo Gigant","user":"gigant","type":"user"},{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63d10d4e8eaa4831005e92b5/7p7-OmWM6PqqCs7ZStPGD.jpeg","isPro":false,"fullname":"Aymeric Roucher","user":"m-ric","type":"user"}],"acceptLanguages":["en","*"]}">
The growing interest in vision-language models (VLMs) has been driven by
improvements in large language models and vision transformers. Despite the
abundance of literature on this subject, we observe that critical decisions
regarding the design of VLMs are often not justified. We argue that these
unsupported decisions impede progress in the field by making it difficult to
identify which choices improve model performance. To address this issue, we
conduct extensive experiments around pre-trained models, architecture choice,
data, and training methods. Our consolidation of findings includes the
development of Idefics2, an efficient foundational VLM of 8 billion parameters.
Idefics2 achieves state-of-the-art performance within its size category across
various multimodal benchmarks, and is often on par with models four times its
size. We release the model (base, instructed, and chat) along with the datasets
created for its training.