https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

\n

It might be something wrong with demo space configuration, or... we need better benchmarks.

\n","updatedAt":"2024-05-14T21:26:36.195Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.7831694483757019},"isReport":false},"replies":[{"id":"6643daff0c34964ea126aadb","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T21:43:27.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.","html":"

\n\n@MoonRide\n\n\t it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

\n","updatedAt":"2024-05-14T21:43:46.617Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":1,"editors":["merve"],"reactions":[{"reaction":"πŸ‘","users":["MoonRide"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.9473082423210144},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643e0ba7a482e37da66e5dd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-14T22:07:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.\n\nBut my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.","html":"

Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.

\n

But my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.

\n","updatedAt":"2024-05-14T22:07:54.437Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9064463973045349},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643f0c021aa7c2880e9cbbd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T23:16:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance. \n\nthe main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well. ","html":"

\n\n@MoonRide\n\n\t if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance.

\n

the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well.

\n","updatedAt":"2024-05-14T23:16:16.484Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":0,"editors":["merve"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9090250134468079},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}}]},{"id":"6644106c4b853d36693fb819","author":{"avatarUrl":"/avatars/634456d086cc6023b12785da12de7033.svg","fullname":"Cuiunbo","name":"Cuiunbo","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-15T01:31:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hi! nice work! \nI tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.\nIs the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?\nis there any plan to open-source a chatty model?","html":"

Hi! nice work!
I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.
Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?
is there any plan to open-source a chatty model?

\n","updatedAt":"2024-05-15T01:32:06.973Z","author":{"avatarUrl":"/avatars/634456d086cc6023b12785da12de7033.svg","fullname":"Cuiunbo","name":"Cuiunbo","type":"user","isPro":false,"isHf":false}},"numEdits":2,"editors":["Cuiunbo"],"reactions":[{"reaction":"❀️","users":["merve"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.9872443675994873},"isReport":false}}],"numComments":5},"theme":"light","acceptLanguages":["en","*"],"primaryEmailConfirmed":false}">

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveΒ 
posted an update about 13 hours ago
Post
744
New open Vision Language Model by @Google : PaliGemma πŸ’™πŸ€

πŸ“ Comes in 3B, pretrained, mix and fine-tuned models in 224, 448 and 896 resolution
🧩 Combination of Gemma 2B LLM and SigLIP image encoder
πŸ€— Supported in transformers

PaliGemma can do..
🧩 Image segmentation and detection! 🀯
πŸ“‘ Detailed document understanding and reasoning
πŸ™‹ Visual question answering, captioning and any other VLM task!

Read our blog πŸ”– hf.co/blog/paligemma
Try the demo πŸͺ€ hf.co/spaces/google/paligemma
Check out the Spaces and the models all in the collection πŸ“š google/paligemma-release-6643a9ffbf57de2ae0448dda
Collection of fine-tuned PaliGemma models google/paligemma-ft-models-6643b03efb769dad650d2dda

Nice scores in benchmarks, but it failed at my first test image: https://huggingface.co/google/paligemma-3b-mix-448/discussions/2

It might be something wrong with demo space configuration, or... we need better benchmarks.

Β·

@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.

Hi! nice work!
I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.
Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?
is there any plan to open-source a chatty model?