Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.
SWE Aider Cost Fast Fresh
Claude 3.7 70% 65% $15 77 8/24
Gemini 2.5 64% 69% $10 200 1/25
GPT-4.1 55% 53% $8 169 6/24
DeepSeek R1 49% 57% $2.2 22 7/24
Grok 3 Beta ? 53% $15 ? 11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet?
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
My take aways:
- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.
- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.
- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.
- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.
My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.
0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide
> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).
- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
Why do I have to figure all of this out myself?