FASTAPI PUB_DATE: 2026.04.06

PRACTICAL PATTERNS FOR LLM BACKENDS: STREAMING, BACKGROUND JOBS, AND A DUAL‑MODEL SPLIT

A hands-on DEV post shows how to harden an LLM chatbot backend with streaming, background jobs, and a dual-model setup to cut latency and cost. The author exte...

Practical patterns for LLM backends: streaming, background jobs, and a dual‑model split

A hands-on DEV post shows how to harden an LLM chatbot backend with streaming, background jobs, and a dual-model setup to cut latency and cost.

The author extends a FastAPI + PostgreSQL chatbot with streaming, Dockerization, and a background task for auto-titling conversations using asyncio.create_task, documenting pitfalls and fixes in PR-sized steps article.

Big lesson: token budget matters for reasoning models. Title generation initially took ~22s; switching that job to a small “utility” model dropped it to under 2s. The post also mentions trying newer “gpt-5.4” variants for a 3x speedup, but teams should verify availability and parity before adopting same source.

[ WHY_IT_MATTERS ]
01.

Separating utility tasks to smaller models reduces latency and spend without touching the main chat path.

02.

Background jobs and streaming improve perceived performance and production readiness.

[ WHAT_TO_TEST ]
  • terminal

    A/B test dual-model routing: keep main chat on your higher-quality model; move titles/summaries to a cheaper model and measure latency and cost.

  • terminal

    Benchmark streaming vs. non-streaming responses in FastAPI under load; watch event loop backpressure and client time-to-first-token.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

  • 01.

    Start by offloading low-risk background work (titles, summaries) to a utility model via asyncio.create_task and verify idempotency and retries.

  • 02.

    Set per-route token and timeout budgets to avoid starving reasoning-heavy endpoints.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

  • 01.

    Design a dual-model architecture from day one: primary for chat, utility for metadata and summaries, with clear SLAs.

  • 02.

    Containerize the service and enable server-sent events or websockets for streaming to shrink perceived latency.

SUBSCRIBE_FEED
Get the digest delivered. No spam.