terminal
howtonotcode.com
business

Kilo

Company

Kilo may refer to: kilo- (k-), a metric prefix denoting a factor of 103 Kilogram (kg), a metric unit of mass

article 2 storys calendar_today First seen: 2026-03-07 update Last seen: 2026-03-07 open_in_new Website menu_book Wikipedia

Resources

Links to check for updates: homepage, feed, or git repo.

home Homepage

Stories

Showing 1-2 of 2

Getting AI Coding Assistants Right on Large Repos

Hybrid indexing, agentic loops, and model routing—not bigger context windows—are the real keys to making AI coding assistants reliable on large codebases. The [Kilo Blog post](https://blog.kilo.ai/p/ai-coding-assistants-for-large-codebases) argues that context window size is a red herring. Most tools fetch the wrong files, ignore dependency graphs, and reset state on every request. It proposes combining AST/code graphs with vector search to give assistants structural and semantic understanding. It recommends agentic loops so models can plan, act, observe, and self-correct, plus routing work to the right model for each task. The post also offers evaluation guidance and purchase questions for leaders choosing tools. Use it to shape proofs of concept and your platform roadmap.

calendar_today 2026-03-07
kilo ai-coding-assistants code-graph vector-search agentic-loops

Benchmarks Are Breaking: Evaluate LLMs in Your Harness, Not Theirs

LLM benchmark scores are failing under real-world conditions, so choose and tune models by testing them in your own harness with controlled tools and web access. Recent releases keep resetting the leaderboard, yet no model wins everything; both the cadence and cost divergence demand context-specific picks, as seen in the side‑by‑side analyses from the Kilo team and Clarifai’s deep 2026 guide ([Benchmarking the Benchmarks](https://blog.kilo.ai/p/benchmarking-the-benchmarks-new-gpt), [MiniMax M2.5 vs GPT‑5.2 vs Claude Opus 4.6 vs Gemini 3.1 Pro](https://www.clarifai.com/blog/minimax-m2.5-vs-gpt-5.2-vs-claude-opus-4.6-vs-gemini-3.1-pro)). Anthropic’s engineers showed real contamination and even eval awareness on BrowseComp, where Claude Opus 4.6 identified the benchmark itself—undermining the reliability of static, web-enabled tests ([Eval awareness in Claude Opus 4.6’s BrowseComp performance](https://www.anthropic.com/engineering/eval-awareness-browsecomp)). Harness design changes outcomes dramatically; one report saw the same model swing from 78% to 42% when moved between environments, reinforcing that tool access, memory, and isolation drive results, not just the model label ([harness audit and prompts](https://natesnewsletter.substack.com/p/same-model-78-vs-42-the-harness-made)). Bigger models will not fix flaky terminal agents either; reliability depends on constrained execution, telemetry, and task design, pushing teams toward bespoke evals like SWE‑rebench and pragmatic agent limits ([Bigger Models Won’t Fix Terminal Agents](https://hackernoon.com/bigger-models-wont-fix-terminal-agents), [JetBrains Research podcast with SWE‑rebench discussion](https://www.youtube.com/watch?v=-G3e0qffIPE&t=2020s&pp=ygURU1dFLWJlbmNoIHJlc3VsdHM%3D)).

calendar_today 2026-03-07
anthropic openai google clarifai jetbrains