@merve on Hugging Face: "New open Vision Language Model by @Google: PaliGemma 💙🤍 📝 Comes in 3B…"

https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

It might be something wrong with demo space configuration, or... we need better benchmarks.

\n","updatedAt":"2024-05-14T21:26:36.195Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.7831694483757019},"isReport":false},"replies":[{"id":"6643daff0c34964ea126aadb","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T21:43:27.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.","html":"

\n\n@MoonRide\n\n\t it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

\n","updatedAt":"2024-05-14T21:43:46.617Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":1,"editors":["merve"],"reactions":[{"reaction":"👍","users":["MoonRide"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.9473082423210144},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643e0ba7a482e37da66e5dd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-14T22:07:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.\n\nBut my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.","html":"

Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.

But my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.

\n","updatedAt":"2024-05-14T22:07:54.437Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9064463973045349},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643f0c021aa7c2880e9cbbd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T23:16:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance. \n\nthe main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well. ","html":"

\n\n@MoonRide\n\n\t if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance.

the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well.

\n","updatedAt":"2024-05-14T23:16:16.484Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":0,"editors":["merve"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9090250134468079},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}}]},{"id":"6644106c4b853d36693fb819","author":{"avatarUrl":"/avatars/634456d086cc6023b12785da12de7033.svg","fullname":"Cuiunbo","name":"Cuiunbo","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-15T01:31:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hi! nice work! \nI tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.\nIs the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?\nis there any plan to open-source a chatty model?","html":"

Hi! nice work!
I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.
Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?
is there any plan to open-source a chatty model?

\n","updatedAt":"2024-05-15T01:32:06.973Z","author":{"avatarUrl":"/avatars/634456d086cc6023b12785da12de7033.svg","fullname":"Cuiunbo","name":"Cuiunbo","type":"user","isPro":false,"isHf":false}},"numEdits":2,"editors":["Cuiunbo"],"reactions":[{"reaction":"❤️","users":["merve"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.9872443675994873},"isReport":false}}],"numComments":5},"theme":"light","acceptLanguages":["en","*"],"primaryEmailConfirmed":false}">

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

merve

posted an update about 13 hours ago

Post

744

New open Vision Language Model by @Google : PaliGemma 💙🤍

📝 Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
🤗 Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🤯
📑 Detailed document understanding and reasoning
🙋 Visual question answering, captioning and any other VLM task!

Read our blog 🔖 hf.co/blog/paligemma
Try the demo 🪀 hf.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection 📚 google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda

MoonRide

about 12 hours ago

Nice scores in benchmarks, but it failed at my first test image: https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

It might be something wrong with demo space configuration, or... we need better benchmarks.

merve

about 11 hours ago

•

edited about 11 hours ago

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

Cuiunbo

about 8 hours ago

•

edited about 8 hours ago

In this post