PRACTICAL PATTERNS FOR LLM BACKENDS: STREAMING, BACKGROUND JOBS, AND A DUAL‑MODEL SPLIT
A hands-on DEV post shows how to harden an LLM chatbot backend with streaming, background jobs, and a dual-model setup to cut latency and cost. The author exte...
A hands-on DEV post shows how to harden an LLM chatbot backend with streaming, background jobs, and a dual-model setup to cut latency and cost.
The author extends a FastAPI + PostgreSQL chatbot with streaming, Dockerization, and a background task for auto-titling conversations using asyncio.create_task, documenting pitfalls and fixes in PR-sized steps article.
Big lesson: token budget matters for reasoning models. Title generation initially took ~22s; switching that job to a small “utility” model dropped it to under 2s. The post also mentions trying newer “gpt-5.4” variants for a 3x speedup, but teams should verify availability and parity before adopting same source.
Separating utility tasks to smaller models reduces latency and spend without touching the main chat path.
Background jobs and streaming improve perceived performance and production readiness.
-
terminal
A/B test dual-model routing: keep main chat on your higher-quality model; move titles/summaries to a cheaper model and measure latency and cost.
-
terminal
Benchmark streaming vs. non-streaming responses in FastAPI under load; watch event loop backpressure and client time-to-first-token.
Legacy codebase integration strategies...
- 01.
Start by offloading low-risk background work (titles, summaries) to a utility model via asyncio.create_task and verify idempotency and retries.
- 02.
Set per-route token and timeout budgets to avoid starving reasoning-heavy endpoints.
Fresh architecture paradigms...
- 01.
Design a dual-model architecture from day one: primary for chat, utility for metadata and summaries, with clear SLAs.
- 02.
Containerize the service and enable server-sent events or websockets for streaming to shrink perceived latency.