terminal
howtonotcode.com
GPT-5.2 logo

GPT-5.2

Ai Tool

GPT-5.2 is an advanced model for natural language processing tasks.

article 9 storys calendar_today First seen: 2026-01-06 update Last seen: 2026-03-03 open_in_new Website menu_book Wikipedia

Stories

Showing 1-9 of 9

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination and flawed tests that can mislead real-world adoption. Alibaba’s Qwen 3.5 family uses a sparse MoE design (397B total/17B active), ships open weights under Apache 2.0, and shows strong instruction following and competitive coding scores in public benchmarks, with setup guidance and comparisons to frontier models detailed in this deep-dive guide [Qwen 3.5: The Complete Guide](https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks). MiniMax’s latest model claims state-of-the-art coding and agentic performance, faster task completion, and ultra-low runtime cost (about $1/hour at 100 tok/s), alongside reported scores on coding and browsing evaluations [MiniMax-M2.5 on Hugging Face](https://huggingface.co/unsloth/MiniMax-M2.5). OpenAI, however, reports that many SWE-bench Verified tasks have broken tests and that major models were trained on benchmark solutions, halting its use of the metric and urging caution in interpreting scores [OpenAI Abandons SWE-bench Verified](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests). For quick, low-cost trials of multiple “top models,” a short explainer points to an Alibaba Cloud coding plan bundling popular options [This $3 AI Coding Plan Gives You Every Top Model You Need](https://www.youtube.com/watch?v=Qnz7S-5fzWo&pp=ygUXbmV3IEFJIG1vZGVsIGZvciBjb2RpbmfSBwkJrgoBhyohjO8%3D).

calendar_today 2026-03-03
qwen-35 alibaba alibaba-cloud minimax-m25 openai

Open-weight "AI engineer" models arrive: Qwen 3.5, GLM-5, MiniMax M2.5

A new wave of open-weight frontier models now rivals closed systems on coding and long-horizon agent tasks, making self-hosted AI engineer workflows practical for backend and data teams. Alibaba’s Qwen 3.5 ships as an open‑weights Mixture‑of‑Experts model (397B total, 17B active) with multimodal input and a 256K context, alongside a hosted Qwen3.5‑Plus variant offering 1M context and built‑in tools; details and early impressions are summarized by Simon Willison’s write‑up of the [Qwen 3.5 release](https://simonwillison.net/2026/Feb/17/qwen35/#atom-everything) and the official [Qwen blog](https://qwen.ai/blog?id=qwen3.5). Z.ai’s GLM‑5 launched open source with top open-model scores on SWE‑bench‑Verified (77.8) and Terminal Bench 2.0 (56.2), plus long‑context and RL‑driven agent training advances, with the announcement and code at [BusinessWire](https://www.businesswire.com/news/home/20260215030665/en/GLM-5-Launch-Signals-a-New-Era-in-AI-When-Models-Become-Engineers) and the [GitHub repo](https://github.com/zai-org/GLM-5). MiniMax M2.5 claims state‑of‑the‑art coding/agent performance (e.g., 80.2% SWE‑Bench Verified) and aggressive cost/speed on its [Hugging Face card](https://huggingface.co/unsloth/MiniMax-M2.5), while hands‑on videos compare real coding runs for GLM‑5 and M2.5; you can also quickly trial free models via [OpenRouter’s free router](https://openrouter.ai/openrouter/free).

calendar_today 2026-02-17
qwen35-397b-a17b qwen35-plus qwen-chat alibaba-cloud glm-5

Cursor 2.4.x instability: agent command failures, stalls, and plan-mode popups

Multiple recent reports indicate Cursor 2.4.35–2.4.37 have agent command execution failures, stalls, and plan-mode file refresh loops on macOS and Linux. On macOS (2.4.37), users report the agent cannot run any commands in IDE or CLI; a staff response flags a known sandbox issue and suggests switching Agents > Auto-Run Mode to “Ask Every Time” to disable sandbox as a workaround ([thread](https://forum.cursor.com/t/in-the-cursor-ide-and-cli-when-the-agent-tries-to-run-a-command-all-commands-fail/152020)). On Linux (2.4.35 AppImage), plan tabs repeatedly reopen with “The content of the file is newer” prompts, copy actions fail during generation, navigation lags, and the window crashes (code 132) when GPT-5.2 is running ([thread](https://forum.cursor.com/t/dozend-of-plans-reopen-at-once-with-the-content-of-the-file-is-newer/152041)). Related posts describe agents stalling/getting stuck and ignoring iteration rules, suggesting broader instability across the 2.4.x line ([stalls](https://forum.cursor.com/t/cursor-agent-stalls-gets-stuck/152008), [iteration rules](https://forum.cursor.com/t/cursor-is-ignoring-even-the-most-simple-rules-in-iteration-development-approach/152025), [tag change question](https://forum.cursor.com/t/what-the-think-tag-in-cursor-version-2-5-14-has-been-changed-to/152026)).

calendar_today 2026-02-17
cursor cursor-ide cursor-cli ai-ide agent-workflows

OpenAI Skills + Shell for long‑running agents: patterns and pitfalls

OpenAI’s new Skills and Shell tooling make it easier to ship capability‑scoped, long‑running agents for real backend work, but early adopters report reliability gaps you should engineer around. OpenAI’s cookbook shows how to turn discrete capabilities into reusable Skills that your agent invokes via tool calls, enabling least‑privilege execution and clearer observability ([Skills in API](https://developers.openai.com/cookbook/examples/skills_in_api/)); paired with the “tool‑call render” pattern, this turns a chatty bot into a doer with predictable handoffs ([render pattern explainer](https://dev.to/programmingcentral/the-tool-call-render-pattern-turning-your-ai-from-a-chatty-bot-into-a-doer-4cb2)). For workloads that run minutes to hours, OpenAI’s guidance combines Shell, Skills, and compaction to manage state bloat, retry long steps, and keep transcripts affordable and debuggable ([Shell + Skills + Compaction tips](https://developers.openai.com/blog/skills-shell-tips/)). Plan for rough edges reported by developers: an embedding outage returned all‑zero vectors in text‑embedding‑3‑small, some Assistants API file uploads expired immediately, GPT‑5.2 extended‑thinking had very low tokens/sec for some, and Apps SDK toolInvocation status UI required a widget workaround ([embedding outage](https://community.openai.com/t/embedding-model-outage-text-embedding-3-small-api-ev3-model-name-with-all-0-values/1374079#post_10), [files expiring](https://community.openai.com/t/files-instantly-expiring-upon-upload/1366339#post_5), [slow generation](https://community.openai.com/t/gpt-5-2-extended-thinking-webchat-has-unworkably-slow-token-4-tps-generation/1373185?page=3#post_49), [toolInvocation UI bug](https://community.openai.com/t/bug-meta-openai-toolinvocation-invoking-and-meta-openai-toolinvocation-invoked-not-shown-unless-the-tool-registers-a-widget/1374087#post_1)).

calendar_today 2026-02-12
openai chatgpt assistants-api agents-sdk chatgpt-apps-sdk

OpenAI recommends GPT-5.3-Codex as the default agentic coding model

OpenAI now recommends GPT-5.3-Codex as the default Codex model, signaling a step-up in agentic coding and reasoning for real-world engineering. The official Codex Models page highlights GPT-5.3-Codex as the most capable, with GPT-5.2-Codex as predecessor and a smaller GPT-5.1-Codex-mini option for cost-sensitive tasks [OpenAI Codex Models](https://developers.openai.com/codex/models/)[^1]. An anecdotal report describes spending $10,000 to automate research with Codex, indicating emerging large-scale usage patterns [Practitioner report](https://links.tldrnewsletter.com/J7poJAf substantial Codex-driven automation and spend.

calendar_today 2026-02-07
openai codex gpt-53-codex gpt-52-codex gpt-51-codex-mini

Codex 0.95–0.96 ship async compaction, rate-limit signals; MassGen adds Codex backend

OpenAI’s Codex app/server shipped 0.95–0.96 with v2 async thread compaction, websocket rate‑limit signaling, expanded skill loading/remote catalogs, shell parallelism, state‑DB correctness, telemetry, and Linux sandbox groundwork ([0.95.0](https://github.com/openai/codex/releases/tag/rust-v0.95.0)[^1], [0.96.0](https://github.com/openai/codex/releases/tag/rust-v0.96.0)[^2]). MassGen now offers a Codex backend with local/Docker modes to orchestrate multi‑agent workflows and MCP tooling ([MassGen v0.1.47](https://github.com/massgen/MassGen/releases/tag/v0.1.47)[^3]). Expect workflow differences vs IDEs—Codex is positioned as an agentic assistant, not a full IDE—and note a Windows PowerShell 5.1 ANSI‑encoding issue affecting Cyrillic output ([video](https://www.youtube.com/watch?v=ts7yQdfBW_U&pp=ygURQ3Vyc29yIElERSB1cGRhdGU%3D)[^4], [forum thread](https://community.openai.com/t/incorrect-cyrillic-rendering-in-codex-agent-on-windows-due-to-powershell-5-1-default-ansi-encoding/1356123#post_5)[^5]). [^1]: Release notes: skills loading and remote catalogs, macOS `codex app` CLI, shell parallelism, Git safety hardening, TUI improvements, Linux sandbox groundwork. [^2]: Release notes: `thread/compact` async RPC, websocket `codex.rate_limits` event, `unified_exec` enablement, state DB-first thread listing, telemetry. [^3]: MassGen adds a Codex backend (local/Docker), native tool architecture, and a quick start to try Codex workflows. [^4]: Explains Codex app’s agentic workflow vs IDEs like Cursor and how to use it effectively. [^5]: Documents Windows PowerShell 5.1 ANSI encoding causing Cyrillic rendering issues and workaround considerations.

calendar_today 2026-02-04
openai codex massgen cursor claude-code

E2E coding agents: 27% pass, cheaper scaling, and safer adoption

A new end-to-end benchmark, [ProjDevBench](https://arxiv.org/html/2602.01655v1)[^1] with [code](https://github.com/zsworld6/projdevbench)[^2], reports only 27.38% acceptance for agent-built repos, highlighting gaps in system design, complexity, and resource management. Efficiency is improving: [SWE-Replay](https://quantumzeitgeist.com/17-4-percent-performance-swe-replay-achieves-gain-efficient/)[^3] recycles prior agent trajectories to cut test-time compute by up to 17.4% while maintaining or slightly improving fix rates. For evaluation and safety, Together AI shows open LLM judges can beat GPT‑5.2 on preference alignment ([post](https://www.together.ai/blog/fine-tuning-open-llm-judges-to-outperform-gpt-5-2at/))[^5], Java teams get a pragmatic path via [ASTRA‑LangChain4j](https://quantumzeitgeist.com/ai-astra-langchain4j-achieves-llm-integration/)[^6], and an open‑weight coding LM targets agentic/local dev ([Qwen3‑Coder‑Next](https://www.youtube.com/watch?v=UwVi2iu-xyA&pp=ygURU1dFLWJlbmNoIHJlc3VsdHM%3D))[^7]. [^1]: Adds: defines an E2E agent benchmark with architecture, correctness, and refinement criteria plus pass-rate findings. [^2]: Adds: benchmark repository for tasks, harnesses, and evaluation assets. [^3]: Adds: test-time scaling via trajectory replay with up to 17.4% cost reduction and small performance gains on SWE-Bench variants. [^4]: Adds: DPO-tuned open "LLM-as-judge" models outperform GPT‑5.2 on RewardBench 2 preference alignment, with code/how-to. [^5]: Adds: security analysis of self-propagating adversarial prompts ("prompt worms") and the OpenClaw agent network example. [^6]: Adds: Java integration pattern for agent+LLM via ASTRA modules and LangChain4J, including BeliefRAG and Maven packaging. [^7]: Adds: open-weight coding model positioned for agentic workflows and local development.

calendar_today 2026-02-03
projdevbench swe-replay swe-bench-verified swe-bench-pro astra

GitHub Copilot: GPT-5.1 Codex preview, Spaces sharing, and model retirements

GitHub Copilot added a public preview of GPT-5.1-Codex-Max across web, IDE, mobile, and CLI (Enterprise/Business must enable it), made Spaces shareable publicly or per-user with a code-viewer add-to-Space flow, and refined the VS model picker. Older OpenAI/Anthropic/Google models were retired with suggested replacements, agents gained mission control and skills with broader IDE coverage, and knowledge bases fully sunset in favor of Spaces.

calendar_today 2026-01-06
github-copilot agentic-ai context-grounding model-lifecycle jetbrains