terminal
howtonotcode.com
HuggingFace logo

HuggingFace

Company

Hugging Face develops open-source AI tools and models for natural language processing.

article 6 storys calendar_today First seen: 2026-02-13 update Last seen: 2026-03-05 open_in_new Website menu_book Wikipedia

Resources

Links to check for updates: homepage, feed, or git repo.

home Homepage

Stories

Showing 1-6 of 6

Operationalizing Agent Evaluation: SWE-CI + MLflow + OTel Tracing

A new CI-loop benchmark and practical guidance on evaluation and observability outline how to move coding agents from pass/fail demos to production-grade reliability. The SWE-CI benchmark shifts assessment from one-shot bug fixes to long-horizon repository maintenance, requiring multi-iteration changes across realistic CI histories; see the paper and assets on [arXiv](https://arxiv.org/html/2603.03823v1), the [Hugging Face dataset](https://huggingface.co/datasets/skylenage/SWE-CI), and the [GitHub repo](https://github.com/SKYLENAGE-AI/SWE-CI) for tasks averaging 233 days and 71 commits of evolution. Complementing this, MLflow’s guide to [LLM and agent evaluation](https://mlflow.org/llm-evaluation) details using LLM judges, regression checks, and safety/compliance scoring to turn non-deterministic outputs into CI-enforceable quality signals across correctness, relevance, and grounding. For runtime assurance, a hands-on pattern combines agent loop tracing with OpenTelemetry and SigNoz as outlined in this [observability walkthrough](https://hackernoon.com/production-observability-for-multi-agent-ai-with-kaos-otel-signoz?source=rss), while testing/monitoring playbooks from HackerNoon and a roundup of tools like LangSmith, Langfuse, Arize Phoenix, and WhyLabs in this [monitoring guide](https://www.webpronews.com/monitoring-ai-generated-code/) help catch subtle regressions post-deploy; see additional testing tactics in this [strategy piece](https://hackernoon.com/testing-strategies-for-llm-generated-web-development-code?source=rss).

calendar_today 2026-03-05
mlflow hugging-face github opentelemetry signoz

MiniMax-M2.5 launches with SOTA coding claims; verify SWE-bench results

MiniMax launched MiniMax-M2.5, a fast, low-cost coding and agentic model, but teams should validate its headline SWE-bench gains with internal tests given recent concerns about benchmark contamination. MiniMax-M2.5 claims state-of-the-art results in coding, agentic tool use, and search—scoring 80.2% on SWE-bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp—while running 37% faster than M2.1 (matching Claude Opus 4.6 speed) and costing about $1/hour at 100 tokens/sec according to its [Hugging Face card](https://huggingface.co/unsloth/MiniMax-M2.5). OpenAI has ceased reporting on SWE-bench Verified after an audit found flawed tests and evidence of benchmark contamination across major models, suggesting reported gains may reflect training exposure rather than general capability; details are summarized here ([Blockchain.News report](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests)). If you trial M2.5, note the card’s operational tips (Unsloth quantization and llama.cpp’s --jinja template) to streamline self-hosting and cost control via the same [Hugging Face source](https://huggingface.co/unsloth/MiniMax-M2.5).

calendar_today 2026-03-04
minimax-m25 minimax openai hugging-face swe-bench-verified

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination and flawed tests that can mislead real-world adoption. Alibaba’s Qwen 3.5 family uses a sparse MoE design (397B total/17B active), ships open weights under Apache 2.0, and shows strong instruction following and competitive coding scores in public benchmarks, with setup guidance and comparisons to frontier models detailed in this deep-dive guide [Qwen 3.5: The Complete Guide](https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks). MiniMax’s latest model claims state-of-the-art coding and agentic performance, faster task completion, and ultra-low runtime cost (about $1/hour at 100 tok/s), alongside reported scores on coding and browsing evaluations [MiniMax-M2.5 on Hugging Face](https://huggingface.co/unsloth/MiniMax-M2.5). OpenAI, however, reports that many SWE-bench Verified tasks have broken tests and that major models were trained on benchmark solutions, halting its use of the metric and urging caution in interpreting scores [OpenAI Abandons SWE-bench Verified](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests). For quick, low-cost trials of multiple “top models,” a short explainer points to an Alibaba Cloud coding plan bundling popular options [This $3 AI Coding Plan Gives You Every Top Model You Need](https://www.youtube.com/watch?v=Qnz7S-5fzWo&pp=ygUXbmV3IEFJIG1vZGVsIGZvciBjb2RpbmfSBwkJrgoBhyohjO8%3D).

calendar_today 2026-03-03
qwen-35 alibaba alibaba-cloud minimax-m25 openai

Practical LLM efficiency: Magma optimizer, Unsloth on HF Jobs, and NVLink realities

A new wave of efficiency wins—masked optimizers, free small‑model fine‑tuning, and faster GPU interconnects—can cut LLM costs without sacrificing quality. Google proposes masking-based adaptive optimization that outperforms Adam/Muon with negligible overhead and drop‑in simplicity; their Momentum‑aligned gradient masking (Magma) reduced 1B‑scale perplexity versus strong baselines in pretraining experiments, making it a compelling swap for existing pipelines ([paper](https://arxiv.org/abs/2602.15322)). For fast, low‑cost customization, Unsloth + Hugging Face Jobs deliver ~2x faster training and ~60% lower VRAM with free credits for fine‑tuning compact models like LFM2.5‑1.2B, which can be deployed on CPUs/phones; the post walks through submitting HF Jobs and provides a ready SFT script ([guide](https://huggingface.co/blog/unsloth-jobs), [training script](https://huggingface.co/datasets/unsloth/jobs/resolve/main/sft-lfm2.5.py)). At the hardware layer, multi‑GPU throughput is gated by interconnects: within a node, NVLink dwarfs PCIe (A100 ~600 GB/s, H100 ~900 GB/s, Blackwell up to 1.8 TB/s per GPU), so collective ops and DDP settings should match topology to avoid communication bottlenecks ([multi‑GPU overview](https://towardsdatascience.com/how-gpus-communicate/)).

calendar_today 2026-02-20
google hugging-face hugging-face-jobs unsloth nvidia

Open-weight "AI engineer" models arrive: Qwen 3.5, GLM-5, MiniMax M2.5

A new wave of open-weight frontier models now rivals closed systems on coding and long-horizon agent tasks, making self-hosted AI engineer workflows practical for backend and data teams. Alibaba’s Qwen 3.5 ships as an open‑weights Mixture‑of‑Experts model (397B total, 17B active) with multimodal input and a 256K context, alongside a hosted Qwen3.5‑Plus variant offering 1M context and built‑in tools; details and early impressions are summarized by Simon Willison’s write‑up of the [Qwen 3.5 release](https://simonwillison.net/2026/Feb/17/qwen35/#atom-everything) and the official [Qwen blog](https://qwen.ai/blog?id=qwen3.5). Z.ai’s GLM‑5 launched open source with top open-model scores on SWE‑bench‑Verified (77.8) and Terminal Bench 2.0 (56.2), plus long‑context and RL‑driven agent training advances, with the announcement and code at [BusinessWire](https://www.businesswire.com/news/home/20260215030665/en/GLM-5-Launch-Signals-a-New-Era-in-AI-When-Models-Become-Engineers) and the [GitHub repo](https://github.com/zai-org/GLM-5). MiniMax M2.5 claims state‑of‑the‑art coding/agent performance (e.g., 80.2% SWE‑Bench Verified) and aggressive cost/speed on its [Hugging Face card](https://huggingface.co/unsloth/MiniMax-M2.5), while hands‑on videos compare real coding runs for GLM‑5 and M2.5; you can also quickly trial free models via [OpenRouter’s free router](https://openrouter.ai/openrouter/free).

calendar_today 2026-02-17
qwen35-397b-a17b qwen35-plus qwen-chat alibaba-cloud glm-5

GLM-5 and MiniMax M2.5 push low-cost, agentic coding into production range

Two Chinese releases—Zhipu AI’s GLM-5 and MiniMax M2.5—signal a shift toward affordable, agentic coding models that challenge frontier systems on practical benchmarks. Zhipu AI’s GLM-5 is positioned as an MIT-licensed open model with a native Agent Mode that rivals proprietary leaders on multiple benchmarks, with a deep-dive detailing its pre-launch appearance under a pseudonym and hints from vLLM pull requests ([official overview](https://z.ai/blog/glm-5?_bhlid=d84a093754c9e11cb0d2e9ff416fd99cb5f0e2da), [leak analysis](https://medium.com/reading-sh/glm-5-chinas-745b-parameter-open-source-model-that-leaked-before-it-launched-b2cfbafe99ef?source=rss-8af100df272------2), [weights claim](https://medium.com/ai-software-engineer/glm-5-arrive-with-a-bang-from-vibe-coding-to-agentic-engineering-disrupts-opus-b2b13f02b819)). MiniMax’s M2.5 posts strong results on coding and agentic tasks—80.2% SWE-Bench Verified, 51.3% Multi-SWE-Bench, 76.3% BrowseComp—while running 37% faster than M2.1 and costing roughly $1/hour at 100 tokens/sec (or $0.30/hour at 50 tps), with speed reportedly matching Claude Opus 4.6 ([release details](https://www.minimax.io/news/minimax-m25)). For developer workflows, quick-start videos show GLM-5 (and similarly Kimi K2.5) slotting into Claude Code with minimal setup, lowering trial friction inside existing IDEs ([GLM-5 with Claude Code](https://www.youtube.com/watch?v=Ey-HW-nJBiw&pp=ygURQ3Vyc29yIElERSB1cGRhdGU%3D), [Kimi K2.5 with Claude Code](https://www.youtube.com/watch?v=yZtLwOhmHps&pp=ygURQ3Vyc29yIElERSB1cGRhdGU%3D)).

calendar_today 2026-02-12
zhipu-ai glm-5 minimax minimax-m25 openrouter