OPEN-WEIGHT CODING AGENTS HIT 60%+ SWE-BENCH AND GET EASIER TO RUN ON-PREM
Open-weight coding agents leaped forward as NVIDIA’s Nemotron 3 Super tops SWE-Bench and new research streamlines on‑prem and local runs. NVIDIA unveiled Nemot...
Open-weight coding agents leaped forward as NVIDIA’s Nemotron 3 Super tops SWE-Bench and new research streamlines on‑prem and local runs.
NVIDIA unveiled Nemotron 3 Super, a 120B-parameter hybrid MoE model scoring 60.47% on SWE-Bench Verified with open weights, recipes, and a 1M-token context window, plus strong throughput in their own benchmarks Smart Chunks. The pitch targets enterprises that want agentic coding on their own hardware.
On the research side, CodeScout trains a code search agent via RL that uses only a Unix shell and posts competitive repo‑level localization on SWE-Bench; the team open-sourced code and models repo.
Local inference got a boost from a community “LLM in a Flash” experiment that streamed MoE experts from SSD to run Qwen3.5‑397B at 5.5+ tok/s on a 48GB M3 Max MacBook, with code and a write‑up shared (Simon Willison, repo). A hands‑on report shows smaller Qwen3.5 variants already usable in VS Code via LM Studio and Continue, though still behind top cloud IDE copilots InfoWorld.
Stronger open weights plus RL-driven repo search shrink the gap with closed coding copilots while keeping code on-prem.
Feasible local MoE inference lowers hardware barriers for large models, expanding deployment options outside cloud APIs.
-
terminal
Re-run a subset of SWE-Bench Verified on your codebases: Nemotron 3 Super vs your current model, with and without a CodeScout-style localization step.
-
terminal
Prototype SSD-streamed MoE inference (flash-moe) on a dev workstation/server and measure throughput, latency spikes, and quality vs 4–8 bit baselines.
Legacy codebase integration strategies...
- 01.
Integrate agent runs with existing CI to gate PRs: require green tests plus agent-suggested patches behind feature flags.
- 02.
Lock down repo access: run models in isolated runners, audit tool use, and capture prompts/diffs for compliance.
Fresh architecture paradigms...
- 01.
Design repos for agents: consistent test scaffolds, richer docstrings, and lightweight code maps to aid localization.
- 02.
Build an on-prem agent stack early: retrieval, terminal tools, and an evaluation harness around SWE-Bench-like tasks.