Early agent benchmarks: Claude leads tool-calling, Gemini 3 Flash rebounds, GPT Mini/Nano lag
A practitioner benchmarked LLMs on real operational tasks (data enrichment, calendar scheduling, CRM clean-up) with minimal prompting and explicit tool specs. Claude was most reliable at tool-calling but can hit context limits on long tasks; Gemini 3 Flash notably improved and outperformed 3 Pro; GPT Mini/Nano struggled with constraint adherence when reasoning was off. These are early, single-source results but map closely to common backend/data-engineering agent patterns.