terminal
howtonotcode.com
Hugging Face logo

Hugging Face

Company

Hugging Face develops open-source AI tools and models for natural language processing.

article 5 storys calendar_today First seen: 2026-02-11 update Last seen: 2026-03-03 open_in_new Website menu_book Wikipedia

Resources

Links to check for updates: homepage, feed, or git repo.

home Homepage

Stories

Showing 1-5 of 5

Coding Benchmarks Shake-up: Qwen 3.5, MiniMax M2.5, and a SWE-bench Reality Check

Open models like Alibaba’s Qwen 3.5 and MiniMax M2.5 post strong coding-agent results, but OpenAI’s audit of SWE-bench Verified shows contamination and flawed tests that can mislead real-world adoption. Alibaba’s Qwen 3.5 family uses a sparse MoE design (397B total/17B active), ships open weights under Apache 2.0, and shows strong instruction following and competitive coding scores in public benchmarks, with setup guidance and comparisons to frontier models detailed in this deep-dive guide [Qwen 3.5: The Complete Guide](https://techie007.substack.com/p/qwen-35-the-complete-guide-benchmarks). MiniMax’s latest model claims state-of-the-art coding and agentic performance, faster task completion, and ultra-low runtime cost (about $1/hour at 100 tok/s), alongside reported scores on coding and browsing evaluations [MiniMax-M2.5 on Hugging Face](https://huggingface.co/unsloth/MiniMax-M2.5). OpenAI, however, reports that many SWE-bench Verified tasks have broken tests and that major models were trained on benchmark solutions, halting its use of the metric and urging caution in interpreting scores [OpenAI Abandons SWE-bench Verified](https://blockchain.news/news/openai-abandons-swe-bench-verified-contamination-flawed-tests). For quick, low-cost trials of multiple “top models,” a short explainer points to an Alibaba Cloud coding plan bundling popular options [This $3 AI Coding Plan Gives You Every Top Model You Need](https://www.youtube.com/watch?v=Qnz7S-5fzWo&pp=ygUXbmV3IEFJIG1vZGVsIGZvciBjb2RpbmfSBwkJrgoBhyohjO8%3D).

calendar_today 2026-03-03
qwen-35 alibaba alibaba-cloud minimax-m25 openai

Graph-structured dependency navigation fixes missed-file failures in repo-scale coding agents

New results show that wiring coding agents to traverse a code dependency graph outperforms expanding context or keyword/vector retrieval on architecture-heavy tasks where critical files are semantically distant. An arXiv study introduces the Navigation Paradox: as context windows grow, failures shift from retrieval capacity to navigational salience, and presents CodeCompass, an MCP-based graph tool exposing IMPORTS/INHERITS/INSTANTIATES edges during agent runs with Claude Code; on a FastAPI RealWorld benchmark, BM25 hits 100% on semantic (G1) tasks but gives no lift on hidden-dependency (G3) tasks (78.2% vs 76.2% baseline), while CodeCompass reaches 99.4% ACS on G3, a +23.2 point jump over both baselines ([paper](https://arxiv.org/html/2602.20048v1), [code/benchmark](https://github.com/tpaip607/research-codecompass)). Crucially, benefit depends on tool invocation: trials that actually used the graph (42%) averaged 99.5% ACS; those that skipped it despite instructions scored 80.2%, indistinguishable from vanilla—highlighting that prompt design and agent policies must reliably trigger graph consultation. For teams piloting repo-level agents, treat structural navigation as a first-class capability: generate a per-repo AST-derived dependency graph, expose it via MCP, and enforce early graph lookups when touching modules with broad non-local impact; the author also shares a practitioner-friendly narrative on why assistants miss critical files ([Medium](https://medium.datadriveninvestor.com/why-do-ai-coding-assistants-miss-critical-files-i-built-a-graph-database-to-find-out-9c6c98fe6456?source=rss----32881626c9c9---4)).

calendar_today 2026-02-24
codecompass claude-code mcp fastapi github

Practical LLM efficiency: Magma optimizer, Unsloth on HF Jobs, and NVLink realities

A new wave of efficiency wins—masked optimizers, free small‑model fine‑tuning, and faster GPU interconnects—can cut LLM costs without sacrificing quality. Google proposes masking-based adaptive optimization that outperforms Adam/Muon with negligible overhead and drop‑in simplicity; their Momentum‑aligned gradient masking (Magma) reduced 1B‑scale perplexity versus strong baselines in pretraining experiments, making it a compelling swap for existing pipelines ([paper](https://arxiv.org/abs/2602.15322)). For fast, low‑cost customization, Unsloth + Hugging Face Jobs deliver ~2x faster training and ~60% lower VRAM with free credits for fine‑tuning compact models like LFM2.5‑1.2B, which can be deployed on CPUs/phones; the post walks through submitting HF Jobs and provides a ready SFT script ([guide](https://huggingface.co/blog/unsloth-jobs), [training script](https://huggingface.co/datasets/unsloth/jobs/resolve/main/sft-lfm2.5.py)). At the hardware layer, multi‑GPU throughput is gated by interconnects: within a node, NVLink dwarfs PCIe (A100 ~600 GB/s, H100 ~900 GB/s, Blackwell up to 1.8 TB/s per GPU), so collective ops and DDP settings should match topology to avoid communication bottlenecks ([multi‑GPU overview](https://towardsdatascience.com/how-gpus-communicate/)).

calendar_today 2026-02-20
google hugging-face hugging-face-jobs unsloth nvidia

Open-weight "AI engineer" models arrive: Qwen 3.5, GLM-5, MiniMax M2.5

A new wave of open-weight frontier models now rivals closed systems on coding and long-horizon agent tasks, making self-hosted AI engineer workflows practical for backend and data teams. Alibaba’s Qwen 3.5 ships as an open‑weights Mixture‑of‑Experts model (397B total, 17B active) with multimodal input and a 256K context, alongside a hosted Qwen3.5‑Plus variant offering 1M context and built‑in tools; details and early impressions are summarized by Simon Willison’s write‑up of the [Qwen 3.5 release](https://simonwillison.net/2026/Feb/17/qwen35/#atom-everything) and the official [Qwen blog](https://qwen.ai/blog?id=qwen3.5). Z.ai’s GLM‑5 launched open source with top open-model scores on SWE‑bench‑Verified (77.8) and Terminal Bench 2.0 (56.2), plus long‑context and RL‑driven agent training advances, with the announcement and code at [BusinessWire](https://www.businesswire.com/news/home/20260215030665/en/GLM-5-Launch-Signals-a-New-Era-in-AI-When-Models-Become-Engineers) and the [GitHub repo](https://github.com/zai-org/GLM-5). MiniMax M2.5 claims state‑of‑the‑art coding/agent performance (e.g., 80.2% SWE‑Bench Verified) and aggressive cost/speed on its [Hugging Face card](https://huggingface.co/unsloth/MiniMax-M2.5), while hands‑on videos compare real coding runs for GLM‑5 and M2.5; you can also quickly trial free models via [OpenRouter’s free router](https://openrouter.ai/openrouter/free).

calendar_today 2026-02-17
qwen35-397b-a17b qwen35-plus qwen-chat alibaba-cloud glm-5

Enterprise LLM fine-tuning is maturing fast—precision up, guardrails required

LLM fine-tuning is getting easier to scale and more precise, but safety, evaluation reliability, and reasoning-compute pitfalls demand stronger guardrails in your ML pipeline. AWS details a streamlined Hugging Face–on–SageMaker path while new research flags safety regressions, more precise activation-level steering, unreliable public leaderboards, reasoning "overthinking" inefficiencies, and limits of multi-source summarization like Perplexity’s aggregation approach ([AWS + HF on SageMaker overview](https://theaireport.net/news/new-approaches-to-llm-fine-tuning-emerge-from-aws-and-academ/)[^1]; [three fine-tuning safety/security/mechanism studies](https://theaireport.net/news/three-new-studies-examine-fine-tuning-safety-security-and-me/)[^2]; [AUSteer activation-unit control](https://quantumzeitgeist.com/ai-steering-made-far-more-precise/)[^3]; [MIT on ranking instability](https://sciencesprings.wordpress.com/2026/02/10/from-the-computer-science-artificial-intelligence-laboratory-csail-and-the-department-of-electrical-engineering-and-computer-science-in-the-school-of-engineering-both-in-the-s/)[^4]; [reasoning models wasting compute](https://www.webpronews.com/the-hidden-cost-of-thinking-harder-why-ai-reasoning-models-sometimes-get-dumber-with-more-compute/)[^5]; [Perplexity multi-source synthesis limits](https://www.datastudios.org/post/can-perplexity-summarize-multiple-web-pages-accurately-multi-source-aggregation-and-quality)[^6]). [^1]: Adds: Enterprise-oriented path to scale LLM fine-tuning via Hugging Face on SageMaker. [^2]: Adds: Evidence of safety degradation post-fine-tune, secure code RL alignment approach, and PEFT mechanism insight. [^3]: Adds: Fine-grained activation-unit steering (AUSteer) for more precise model control. [^4]: Adds: Study showing LLM leaderboards can be swayed by a few votes, undermining reliability. [^5]: Adds: Research summary on "overthinking" where more reasoning tokens can hurt accuracy and waste compute. [^6]: Adds: Analysis of how Perplexity aggregates sources and where summarization can miss nuance.

calendar_today 2026-02-10
amazon-web-services amazon-sagemaker hugging-face perplexity openai