Immediately after switching the page, it will work with CSR.
Please reload your browser to see how it works.

Source:https://github.com/SoraKumo001/next-streaming

lxgr 5 daysReload

As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between

- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)

- o3-mini (web search, CoT, canvas, but no image generation)

- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)

- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)

Why do I have to figure all of this out myself?

modeless 5 daysReload

Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year:

             SWE  Aider Cost Fast Fresh
 Claude 3.7  70%  65%   $15  77   8/24
 Gemini 2.5  64%  69%   $10  200  1/25
 GPT-4.1     55%  53%   $8   169  6/24
 DeepSeek R1 49%  57%   $2.2 22   7/24
 Grok 3 Beta ?    53%   $15  ?    11/24

I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.

Is it available in Cursor yet?

swyx 5 daysReload

don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:

- telling the model to be persistent (+20%)

- dont self-inject/parse toolcalls (+2%)

- prompted planning (+4%)

- JSON BAD - use XML or arxiv 2406.13121 (GDM format)

- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD

- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work

source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

omneity 5 daysReload

I have been trying GPT-4.1 for a few hours by now through Cursor on a fairly complicated code base. For reference, my gold standard for a coding agent is Claude Sonnet 3.7 despite its tendency to diverge and lose focus.

My take aways:

- This is the first model from OpenAI that feels relatively agentic to me (o3-mini sucks at tool use, 4o just sucks). It seems to be able to piece together several tools to reach the desired goal and follows a roughly coherent plan.

- There is still more work to do here. Despite OpenAI's cookbook[0] and some prompt engineering on my side, GPT-4.1 stops quickly to ask questions, getting into a quite useless "convo mode". Its tool calls fails way too often as well in my opinion.

- It's also able to handle significantly less complexity than Claude, resulting in some comical failures. Where Claude would create server endpoints, frontend components and routes and connect the two, GPT-4.1 creates simplistic UI that calls a mock API despite explicit instructions. When prompted to fix it, it went haywire and couldn't handle the multiple scopes involved in that test app.

- With that said, within all these parameters, it's much less unnerving than Claude and it sticks to the request, as long as the request is not too complex.

My conclusion: I like it, and totally see where it shines, narrow targeted work, adding to Claude 3.7 - for creative work, and Gemini 2.5 Pro for deep complex tasks. GPT-4.1 does feel like a smaller model compared to these last two, but maybe I just need to use it for longer.

0: https://cookbook.openai.com/examples/gpt4-1_prompting_guide

marsh_mellow 5 daysReload

From OpenAI's announcement:

> Qodo tested GPT‑4.1 head-to-head against Claude Sonnet 3.7 on generating high-quality code reviews from GitHub pull requests. Across 200 real-world pull requests with the same prompts and conditions, they found that GPT‑4.1 produced the better suggestion in 55% of cases. Notably, they found that GPT‑4.1 excels at both precision (knowing when not to make suggestions) and comprehensiveness (providing thorough analysis when warranted).

https://www.qodo.ai/blog/benchmarked-gpt-4-1/