Thinkscoop
Intelligent Support Agent Resolving 68% of Queries Autonomously for Booking.com
Travel & eCommerce 10 weeksAI AgentsAI-Powered DevelopmentAI Integration

68%

queries resolved without human intervention

Intelligent Support Agent Resolving 68% of Queries Autonomously for Booking.com

Booking.com

Booking.com

68%

Autonomous resolution rate

<2min

First response time

+22pts

Customer satisfaction

10k+

Daily queries handled

Context

The business context

At Booking.com's scale, a 4-hour average support response time isn't just frustrating - it's a retention problem. Customers who can't resolve accommodation disputes, cancellation questions, or payment issues quickly don't rebook. The team had tried off-the-shelf chatbots and rule-based automation. Both failed for the same reason: the policy surface was too large and too dynamic for static decision trees to cover. Policies vary by property, country, booking type, and promotional tier. Any bot that couldn't reason across that complexity was destined to escalate.

The problem

5 specific problems that needed solving

10,000+ daily queries across accommodation disputes, cancellations, payment failures, and policy questions - with no triage

Existing chatbots resolved less than 20% of queries without escalation, flooding human agent queues with fully resolvable issues

4-hour average first-response time during peak periods, directly tied to negative reviews and reduced rebooking rates

500,000+ policy documents across 50+ jurisdictions - impossible to maintain as static rules in any existing system

No context handoff between bot and human: agents received escalations with no background, forcing customers to repeat themselves

Booking.com - solution

Our approach

Escalation design is as important as resolution design.

Most support AI projects are measured on resolution rate. We argued from day one that resolution rate is the wrong metric - a bot that resolves 80% of queries incorrectly is worse than one that resolves 68% correctly and escalates the rest gracefully. We built the escalation experience first. Every escalated query arrives at a human agent with the booking data, the policy evidence, the conversation history, and the agent's uncertainty reasoning already assembled - so the agent can resolve in 90 seconds rather than 8 minutes. This changed how the team thought about the project: instead of trying to maximise automation, we focused on making the human-AI handoff seamless.

Confidence-threshold escalation: agent escalates with full assembled context rather than a cold handoff

Policy RAG with versioning: the knowledge base tracks when policies change and which version applies to which booking date

Reasoning traces exposed to human reviewers: agents can see exactly which policy clauses the AI cited for any decision

Separate evaluation dataset for each query category (refunds, disputes, payment failures) to prevent cross-category accuracy masking

What we built

A reasoning agent with live system access

The system is a multi-step LangGraph agent with direct tool access to Booking.com's booking retrieval API, payment and refund processing API, and a Pinecone vector store containing 500,000+ policy documents chunked and tagged by jurisdiction, property type, and booking tier. The agent retrieves the relevant booking, identifies the policy applicable to that specific booking context, checks refund eligibility, and - within defined thresholds - can initiate a refund or send a policy-based resolution without any human involvement. For edge cases, it routes to a human queue with a pre-populated resolution context card.

1

Policy RAG engine

500,000+ policy documents indexed in Pinecone with jurisdiction metadata, property category, and effective date ranges. Every policy retrieval includes the document version applicable at the time of the booking, not the current version - preventing incorrect application of retrospective policy changes.

2

Booking context retrieval

Live integration with Booking.com's booking API gives the agent real-time access to booking status, payment history, cancellation window, and property-level terms - so responses are always grounded in the actual booking, not a generic policy summary.

3

Refund authority layer

The agent can initiate refunds up to a defined threshold autonomously. Above that threshold, it prepares a refund recommendation with supporting evidence for a human agent to approve in a single click.

4

Context handoff card

Every escalation generates a structured handoff card: booking summary, conversation history, policies retrieved, agent's confidence score, and the specific reason for escalation. Human agents resolve escalated queries 6× faster than before.

5

Multi-jurisdiction compliance logging

Every customer interaction is logged with the policy version cited, the jurisdiction applied, and the resolution taken - satisfying data retention and consumer protection requirements across 50+ countries.

Impact

What changed in production

The resolution rate improvement was significant. The first-response time improvement changed the customer experience entirely.

Autonomous resolution rate rose from 20% to 68%. First response time dropped from 4 hours to under 2 minutes. Customer satisfaction up 22 points.

68%

Autonomous resolution rate

<2min

First response time

+22pts

Customer satisfaction

10k+

Daily queries handled

We went from a chatbot that frustrated customers to an AI that actually solves their problems. The escalation design is what makes it trustworthy at scale - our human agents love it as much as our customers do.
H

Head of Customer Experience

Head of Customer Experience - Booking.com

Learnings

What we took away from this project

Policy versioning is a first-class engineering problem

We underestimated how complex Booking.com's policy versioning was. Policies change at the property level, country level, and platform level - and the correct policy for a dispute is the one that applied at booking time, not today. Building a time-aware policy retrieval system added two weeks to the project but was the difference between a legally defensible system and a liability.

Confidence calibration requires category-specific evaluation

A single confidence threshold across all query types led to overconfident responses on refund decisions and overconservative escalation on simple FAQ queries. We moved to per-category confidence thresholds with separate evaluation datasets for each - immediately improving both resolution accuracy and escalation precision.

The handoff experience determines agent adoption

Human agents initially viewed the AI system with scepticism. What changed their minds wasn't the resolution rate - it was the quality of the escalation handoff card. When they realised the AI was assembling all the context they'd previously had to gather themselves, and that escalated queries now took 90 seconds instead of 8 minutes, they became active advocates for the system.

68%

queries resolved without human intervention

At a glance

ClientBooking.com
IndustryTravel & eCommerce
Timeline10 weeks

Tech stack

GPT-4oLangGraphPineconeFastAPIReactPostgreSQLDatadog

Capabilities

AI Agents
AI-Powered Development
AI Integration

Build something similar?

We've solved this category of problem before. Let's scope yours.

Start a conversation View related service