Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
I care about whether these VLMs can accurately _see_ and _describe_ things in a picture. Meanwhile the vision part of these benchmarks are a lot of extremely basic OCR that any VLMs of the past year can do. The gains in score come from the LM improving logic skills not from the actual vision ability improving.
LLAMA can be trusted to summarize and format information, and some of the other models can be OK coding assistances, but when I was showing Ollama off to a friend I struggled to think of anything useful other than a party trick of "yup that's what is in the picture".
Obviously it would be useful to blind people, but the hard part is using it for something where the person could just look at the picture. Possibly could be used on a security camera and combined with a basic keyword alert, but I imagine there's a lot of false positives and false negatives.
Quote from this paper: “ Moreover, they (VLM) frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”
Consider their Proposed Method:
"Each stage is initiated at the model’s discretion, without external prompt engineering frameworks or additional prompting. Specifically, we provide the model with four pairs of special tags: <SUMMARY></SUMMARY>, <CAPTION></CAPTION>, <REASONING></REASONING>, and <CONCLUSION></CONCLUSION>.
These tags correspond to summarizing the response approach, describing relevant image content, conducting reasoning, and preparing a final answer, respectively. Upon training, the model autonomously selects these tags as needed, activating each stage based on its own judgment.
As with OpenAI o1 [63], all stages are completed by the model in a single inference pass."
[63]: https://arxiv.org/pdf/2409.18486