It might be something wrong with demo space configuration, or... we need better benchmarks.
\n","updatedAt":"2024-05-14T21:26:36.195Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.7831694483757019},"isReport":false},"replies":[{"id":"6643daff0c34964ea126aadb","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T21:43:27.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.","html":"
\n\n@MoonRide\n\n\t it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.
\n","updatedAt":"2024-05-14T21:43:46.617Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":1,"editors":["merve"],"reactions":[{"reaction":"π","users":["MoonRide"],"count":1}],"identifiedLanguage":{"language":"en","probability":0.9473082423210144},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643e0ba7a482e37da66e5dd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-14T22:07:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.\n\nBut my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.","html":"
Thanks - with your \"caption\" prompt it provided a bit better output - so it's also prompt comprehention, not just training dataset.
\n
But my point still stands - if model fails to understand simple and clear prompt in plain English and/or struggles with describing an image that vast majority of human beings could easily describe, then it should be visible in benchmark score. If I get higher benchmark score, but worse results in practice, then it's something missing in the benchmark.
\n","updatedAt":"2024-05-14T22:07:54.437Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419d46b9a27800807c43fe3/H99LfQaSRU3c6uHHoGWPj.jpeg","fullname":"MoonRide","name":"MoonRide","type":"user","isPro":false,"isHf":false}},"numEdits":0,"editors":["MoonRide"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9064463973045349},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}},{"id":"6643f0c021aa7c2880e9cbbd","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true},"createdAt":"2024-05-14T23:16:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@MoonRide if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance. \n\nthe main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well. ","html":"
\n\n@MoonRide\n\n\t if you check the model card you can see the scores. mix models are trained on a mix of academic benchmark datasets (coco captions, vqav2, ocrvqa etc), where you just say e.g. \"caption\" and it captions. these datasets often have shorter descriptions and not long prompts, however they're grounded so they are good in the test sets of those benchmarks and can be used in many industry use cases (document AI etc since it hardly hallucinates). for your prompt, I just input \"caption\" and it came up with very grounded caption for instance.
\n
the main point of PaliGemma release is to release finetuneable models, not provide heavy models with wide zero shot capabilities (where you input super long instruction or chat like prompts) so if you want, you can finetune a \"pt\" model on any benchmark of your choice and it should perform well.
\n","updatedAt":"2024-05-14T23:16:16.484Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648113222875-6141a88b3a0ec78603c9e784.png","fullname":"Merve Noyan","name":"merve","type":"user","isPro":true,"isHf":true}},"numEdits":0,"editors":["merve"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9090250134468079},"isReport":false,"parentCommentId":"6643d70c4b853d36692e36a5"}}]},{"id":"6644106c4b853d36693fb819","author":{"avatarUrl":"/avatars/634456d086cc6023b12785da12de7033.svg","fullname":"Cuiunbo","name":"Cuiunbo","type":"user","isPro":false,"isHf":false},"createdAt":"2024-05-15T01:31:24.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Hi! nice work! \nI tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask.\nIs the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned?\nis there any plan to open-source a chatty model?","html":"
Hi! nice work! I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask. Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned? is there any plan to open-source a chatty model?
@MoonRide it's not about benchmarks, but the training dataset of the mix checkpoint is different than your use case. I responded on your issue with more details.
Hi! nice work! I tried this model and it is more than capable of doing what I thought it could do, it's awesome! I have some questions about some of the details I would like to ask. Is the training data mentioned in the blog all the training data, and did paligemma have any other training data that is not mentioned? is there any plan to open-source a chatty model?