68%
queries resolved without human intervention
Intelligent Support Agent Resolving 68% of Queries Autonomously for Booking.com
Booking.com
68%
Autonomous resolution rate
<2min
First response time
+22pts
Customer satisfaction
10k+
Daily queries handled
Context
The business context
At Booking.com's scale, a 4-hour average support response time isn't just frustrating - it's a retention problem. Customers who can't resolve accommodation disputes, cancellation questions, or payment issues quickly don't rebook. The team had tried off-the-shelf chatbots and rule-based automation. Both failed for the same reason: the policy surface was too large and too dynamic for static decision trees to cover. Policies vary by property, country, booking type, and promotional tier. Any bot that couldn't reason across that complexity was destined to escalate.
The problem
5 specific problems that needed solving
10,000+ daily queries across accommodation disputes, cancellations, payment failures, and policy questions - with no triage
Existing chatbots resolved less than 20% of queries without escalation, flooding human agent queues with fully resolvable issues
4-hour average first-response time during peak periods, directly tied to negative reviews and reduced rebooking rates
500,000+ policy documents across 50+ jurisdictions - impossible to maintain as static rules in any existing system
No context handoff between bot and human: agents received escalations with no background, forcing customers to repeat themselves
Our approach
Escalation design is as important as resolution design.
Most support AI projects are measured on resolution rate. We argued from day one that resolution rate is the wrong metric - a bot that resolves 80% of queries incorrectly is worse than one that resolves 68% correctly and escalates the rest gracefully. We built the escalation experience first. Every escalated query arrives at a human agent with the booking data, the policy evidence, the conversation history, and the agent's uncertainty reasoning already assembled - so the agent can resolve in 90 seconds rather than 8 minutes. This changed how the team thought about the project: instead of trying to maximise automation, we focused on making the human-AI handoff seamless.
Confidence-threshold escalation: agent escalates with full assembled context rather than a cold handoff
Policy RAG with versioning: the knowledge base tracks when policies change and which version applies to which booking date
Reasoning traces exposed to human reviewers: agents can see exactly which policy clauses the AI cited for any decision
Separate evaluation dataset for each query category (refunds, disputes, payment failures) to prevent cross-category accuracy masking
What we built
A reasoning agent with live system access
The system is a multi-step LangGraph agent with direct tool access to Booking.com's booking retrieval API, payment and refund processing API, and a Pinecone vector store containing 500,000+ policy documents chunked and tagged by jurisdiction, property type, and booking tier. The agent retrieves the relevant booking, identifies the policy applicable to that specific booking context, checks refund eligibility, and - within defined thresholds - can initiate a refund or send a policy-based resolution without any human involvement. For edge cases, it routes to a human queue with a pre-populated resolution context card.
Policy RAG engine
500,000+ policy documents indexed in Pinecone with jurisdiction metadata, property category, and effective date ranges. Every policy retrieval includes the document version applicable at the time of the booking, not the current version - preventing incorrect application of retrospective policy changes.
Booking context retrieval
Live integration with Booking.com's booking API gives the agent real-time access to booking status, payment history, cancellation window, and property-level terms - so responses are always grounded in the actual booking, not a generic policy summary.
Refund authority layer
The agent can initiate refunds up to a defined threshold autonomously. Above that threshold, it prepares a refund recommendation with supporting evidence for a human agent to approve in a single click.
Context handoff card
Every escalation generates a structured handoff card: booking summary, conversation history, policies retrieved, agent's confidence score, and the specific reason for escalation. Human agents resolve escalated queries 6× faster than before.
Multi-jurisdiction compliance logging
Every customer interaction is logged with the policy version cited, the jurisdiction applied, and the resolution taken - satisfying data retention and consumer protection requirements across 50+ countries.
Impact
What changed in production
The resolution rate improvement was significant. The first-response time improvement changed the customer experience entirely.
Autonomous resolution rate rose from 20% to 68%. First response time dropped from 4 hours to under 2 minutes. Customer satisfaction up 22 points.
68%
Autonomous resolution rate
<2min
First response time
+22pts
Customer satisfaction
10k+
Daily queries handled
“We went from a chatbot that frustrated customers to an AI that actually solves their problems. The escalation design is what makes it trustworthy at scale - our human agents love it as much as our customers do.”
Head of Customer Experience
Head of Customer Experience - Booking.com
Learnings
What we took away from this project
Policy versioning is a first-class engineering problem
We underestimated how complex Booking.com's policy versioning was. Policies change at the property level, country level, and platform level - and the correct policy for a dispute is the one that applied at booking time, not today. Building a time-aware policy retrieval system added two weeks to the project but was the difference between a legally defensible system and a liability.
Confidence calibration requires category-specific evaluation
A single confidence threshold across all query types led to overconfident responses on refund decisions and overconservative escalation on simple FAQ queries. We moved to per-category confidence thresholds with separate evaluation datasets for each - immediately improving both resolution accuracy and escalation precision.
The handoff experience determines agent adoption
Human agents initially viewed the AI system with scepticism. What changed their minds wasn't the resolution rate - it was the quality of the escalation handoff card. When they realised the AI was assembling all the context they'd previously had to gather themselves, and that escalated queries now took 90 seconds instead of 8 minutes, they became active advocates for the system.
68%
queries resolved without human intervention
At a glance
Tech stack
Capabilities
Build something similar?
We've solved this category of problem before. Let's scope yours.
Start a conversation View related service