Thinkscoop
Engineering

Latency and UX in LLM Products: Lessons from 2023–24

Thinkscoop Engineering Aug 22, 2024 13 min read
Latency and UX in LLM Products: Lessons from 2023–24

LLMs are slower than traditional APIs, but users are surprisingly tolerant when the UX is honest, responsive, and designed around perceived speed.

From 2022–24, the biggest complaint we heard from early users of AI products was not that outputs were wrong - it was that the experience felt slow and unresponsive. The underlying model latency sometimes improved, but the biggest wins came from UX decisions that respected the user's time and attention rather than exposing them to the raw mechanics of inference.

Design for Perceived, Not Absolute, Speed

Streaming partial responses, showing token-by-token typing, and revealing structure early made even 3–4 second responses feel fast. Silent spinners made 1.5 seconds feel like forever. The perception of speed is shaped almost entirely by feedback quality - how much the interface communicates that work is in progress and how quickly useful content starts appearing.

A response that starts streaming within 300ms and takes 5 seconds to complete feels faster than one that shows nothing for 1.5 seconds and then dumps the full output at once. The first byte of content matters as much as the total response time. Optimising for time to first token as aggressively as total completion time became a standard part of our LLM product design checklist by 2024.

Streaming: Implementation Considerations

  • Buffer a few tokens before starting to render to avoid single-character flicker on fast connections
  • Render markdown structure (headers, bullets) as soon as the opening syntax is detected - do not wait for the full block
  • Show a typing indicator or pulse animation in the first 100–200ms before tokens arrive
  • For long structured outputs, use a skeleton layout that hints at the final structure while content fills in
  • Handle partial JSON outputs from tool-use models carefully - display a clear loading state for structured data sections

Loading States That Set the Right Expectations

Good loading states are not just spinners. They communicate what kind of work is happening and set a realistic expectation for how long it will take. For agent systems involving multiple steps, showing each step as it starts - Searching documents, Analysing results, Drafting response - dramatically improved user patience and satisfaction, even when total time was longer.

What users actually said

In one usability study from 2024, users who saw step-by-step progress indicators rated a 7-second response as 'fast' and 'thorough'. The same users, shown a silent spinner for the same 7 seconds, rated the experience as 'broken' or 'too slow to use'. Same latency, completely different perception.

Task-Level vs. Call-Level Latency

Engineering teams optimising for LLM latency often focused on individual model call times. But users experience task-level latency - the time from sending a request to having what they need. For agent systems, that includes multiple model calls, tool executions, and retries. Optimising a single model call from 1.2s to 0.9s while adding two extra retrieval steps produces a net regression from the user's perspective.

Tracking task-level completion time as a first-class metric - separate from p95 latency per call - was one of the changes that most improved product decisions. It made the true cost of architectural complexity visible and created better incentives for simplifying agent pipelines.

Timeouts, Fallbacks, and Graceful Degradation

  1. 1Set aggressive client-side timeouts (15–30s depending on expected task complexity) and surface a clear message when they fire
  2. 2Design explicit fallback paths: if the primary model times out, can a simpler response be generated by a faster model?
  3. 3Show partial results when available - a partial summary is often more valuable than a timeout error
  4. 4Cache the last successful result for tasks where a slightly stale response is acceptable
  5. 5Log every timeout and slow call - they are your most actionable performance data

Measuring Latency in Production

  • Time to first token: measures how quickly the user sees any response - strongly correlated with perceived responsiveness
  • Time to complete: the full wall-clock time from request to final token - relevant for task completion rates
  • p95 and p99 latency: most users with a bad experience are in the tail, not the median
  • Timeout rate: the percentage of requests exceeding your client-side threshold - a leading indicator of product health
  • User abandonment rate: how many users navigate away before a response completes - the business-level consequence of bad latency

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

  • Perceived speed mattered more than raw latency numbers
  • Streaming responses made even slower models feel faster
  • Good loading states and skeletons set the right expectations
  • Task-level latency was more important than individual model calls
  • Graceful timeouts and fallbacks turned failures into acceptable delays
Back to all articles