Agents ace SWE-bench but stumble on OpenTelemetry tasks
Recent benchmarks show AI agents excel at code-fix tasks but falter on real-world observability work, signaling teams must evaluate agents against domain-specific, production-grade objectives.
A powerful spreadsheet editor with advanced data analysis features.
Links to check for updates: homepage, feed, or git repo.
Recent benchmarks show AI agents excel at code-fix tasks but falter on real-world observability work, signaling teams must evaluate agents against domain-specific, production-grade objectives.
Google released Gemini 3.1 Pro with major reasoning gains, a context window up to 1 million tokens, and broad availability across developer and enterprise surfaces.
You can automate RFP scoring and spreadsheet analysis with Gemini today using n8n, while planning around concrete file-format and size limits across Gemini and ChatGPT. An end-to-end n8n workflow shows how to accept vendor PDFs via a form webhook, fetch the RFP from Drive, extract text, merge both streams, call the Gemini API with a structured prompt to return JSON scores, and append results to Sheets—plus Drive auth scopes and download details like alt=media are covered in this guide ([n8n + Gemini RFP evaluation](https://dev.to/hackceleration/building-ai-powered-rfp-evaluation-with-n8n-and-google-gemini-pf5)). For data handling at scale, Gemini supports XLS/XLSX/CSV/TSV and Google Sheets; Gemini chat allows up to 10 files per prompt at 100 MB each, while the Files API permits up to 2 GB per file and 20 GB per project for 48 hours—useful for batch or programmatic flows ([Gemini spreadsheet upload and limits](https://www.datastudios.org/post/google-gemini-spreadsheet-uploading-excel-and-csv-support-data-analysis-capabilities-formula-hand)). If you compare providers, ChatGPT accepts many document and data types but caps file size at 512 MB (with spreadsheet practical limits around ~50 MB) and also enforces token and image-specific ceilings, which can influence provider selection for large artifacts ([ChatGPT file upload limits](https://www.datastudios.org/post/chatgpt-file-uploading-capabilities-supported-file-types-upload-size-limits-rules-and-document-r)).
OpenAI’s new Skills and Shell tooling make it easier to ship capability‑scoped, long‑running agents for real backend work, but early adopters report reliability gaps you should engineer around. OpenAI’s cookbook shows how to turn discrete capabilities into reusable Skills that your agent invokes via tool calls, enabling least‑privilege execution and clearer observability ([Skills in API](https://developers.openai.com/cookbook/examples/skills_in_api/)); paired with the “tool‑call render” pattern, this turns a chatty bot into a doer with predictable handoffs ([render pattern explainer](https://dev.to/programmingcentral/the-tool-call-render-pattern-turning-your-ai-from-a-chatty-bot-into-a-doer-4cb2)). For workloads that run minutes to hours, OpenAI’s guidance combines Shell, Skills, and compaction to manage state bloat, retry long steps, and keep transcripts affordable and debuggable ([Shell + Skills + Compaction tips](https://developers.openai.com/blog/skills-shell-tips/)). Plan for rough edges reported by developers: an embedding outage returned all‑zero vectors in text‑embedding‑3‑small, some Assistants API file uploads expired immediately, GPT‑5.2 extended‑thinking had very low tokens/sec for some, and Apps SDK toolInvocation status UI required a widget workaround ([embedding outage](https://community.openai.com/t/embedding-model-outage-text-embedding-3-small-api-ev3-model-name-with-all-0-values/1374079#post_10), [files expiring](https://community.openai.com/t/files-instantly-expiring-upon-upload/1366339#post_5), [slow generation](https://community.openai.com/t/gpt-5-2-extended-thinking-webchat-has-unworkably-slow-token-4-tps-generation/1373185?page=3#post_49), [toolInvocation UI bug](https://community.openai.com/t/bug-meta-openai-toolinvocation-invoking-and-meta-openai-toolinvocation-invoked-not-shown-unless-the-tool-registers-a-widget/1374087#post_1)).
Anthropic's Claude Opus 4.6 brings multi-agent "Agent Teams" and a 1M-token context while OpenAI's GPT-5.3-Codex counters with faster, stronger agentic coding, together signaling a step change in AI-assisted development. Opus 4.6 adds team-based parallelization in Claude Code, long‑context retrieval gains, adaptive reasoning/effort controls, and Office sidebars, with pricing unchanged [Data Points](https://www.deeplearning.ai/the-batch/claude-opus-4-6-pushes-the-envelope/)[^1] and launch coverage framing initial benchmark leads at release [AI Collective](https://aicollective.substack.com/p/the-brief-anthropics-opus-46-agent)[^2]. OpenAI’s GPT‑5.3‑Codex posts top results on SWE‑Bench Pro and Terminal‑Bench 2.0 and helped debug its own training pipeline [Data Points](https://www.deeplearning.ai/the-batch/claude-opus-4-6-pushes-the-envelope/)[^3], while practitioners surface Claude Code’s new Auto‑Memory behavior/controls for safer long‑running projects [Reddit](https://www.reddit.com/r/ClaudeCode/comments/1qzmofn/how_claude_code_automemory_works_official_feature/)[^4] and Anthropic leaders say AI now writes nearly all their internal code [India Today](https://www.indiatoday.in/technology/news/story/anthropic-says-ai-writing-nearly-100-percent-code-internally-claude-basically-writes-itself-now-2865644-2026-02-09)[^5]. [^1]: Adds: Opus 4.6 features (1M context), long‑context results, adaptive/effort/compaction API controls, and unchanged pricing. [^2]: Adds: Agent Teams in Claude Code, Office (Excel/PowerPoint) sidebars, 1M context, and benchmark framing at launch. [^3]: Adds: GPT‑5.3‑Codex benchmarks, 25% speedup, availability, and self‑use in OAI’s training/deployment pipeline. [^4]: Adds: Concrete Auto‑Memory details (location, 200‑line cap) and disable flag for policy compliance. [^5]: Adds: Real‑world claim of near‑100% AI‑written internal code at Anthropic, indicating mature SDLC use.