Evolution Of Cloudflare's Cache Purging System: From Centralized To Distributed Architecture

Introduction
Background
The Old System: Centralized Architecture
Challenges with the Old System
The New System: Distributed Architecture
Technical Implementation
Performance Analysis
Code Examples
Future Improvements
Conclusion
References & Further Reading

Introduction

Cloudflare, originally starting as a Content Delivery Network (CDN), has evolved into a comprehensive platform offering security, performance, and reliability services. One of their critical features is the ability to purge cached content globally in near real-time. This article explores how Cloudflare transformed their cache purging system from a centralized architecture taking 1.5 seconds to a distributed system completing purges in 150 milliseconds.

Background

Cloudflare operates one of the world’s largest CDNs, with data centers in over 275 cities worldwide. Their CDN serves as a reverse proxy between users and origin servers, caching content to improve performance and reduce load on origin servers.

Key concepts to understand:

Cache Key: A unique identifier generated from request URL and headers
TTL (Time To Live): Duration for which cached content remains valid
Purge Request: Command to invalidate cached content
Edge Nodes: Cloudflare’s distributed data centers

The Old System: Centralized Architecture

The original purging system utilized a centralized approach with these key components:

Quicksilver Database: Central configuration database located in US
Lazy Purging: Purge verification during content request
Core-based Distribution: All purge requests routed through central location

Here’s a Python example demonstrating the old system’s purge request flow:

class QuicksilverPurgeSystem:
    def __init__(self):
        self.purge_history = []
        self.cache_store = {}
        
    def submit_purge_request(self, purge_key):
        # Send to central Quicksilver DB
        latency = self.send_to_quicksilver(purge_key)
        
        # Add to local purge history
        self.purge_history.append({
            'key': purge_key,
            'timestamp': time.time()
        })
        
        return latency
    
    def verify_cache_hit(self, cache_key):
        # Check if content exists in cache
        content = self.cache_store.get(cache_key)
        
        if content:
            # Verify against purge history
            for purge in self.purge_history:
                if self.matches_purge_criteria(content, purge):
                    return None  # Content purged
        
        return content

# Example usage
purge_system = QuicksilverPurgeSystem()
latency = purge_system.submit_purge_request("content_type=json")
print(f"Purge request latency: {latency}ms")

Let’s visualize the old system’s flow:

This diagram shows how purge requests had to travel to the central Quicksilver database before being distributed globally.

Challenges with the Old System

The centralized architecture faced several significant challenges:

High Latency: Global round-trip to US-based central database
Purge History Growth: Accumulating purge records consumed storage
Limited Scalability: Quicksilver not designed for high-frequency updates
Resource Competition: Purge history competed with cache storage

Performance metrics for the old system:

P50 Latency: 1.5 seconds
Global Propagation Time: 2-5 seconds
Storage Overhead: ~20% for purge history

The New System: Distributed Architecture

Cloudflare redesigned their purging system with these key improvements:

Coreless Architecture: Direct peer-to-peer communication
Active Purging: Immediate content invalidation
Local Databases: RocksDB-based CacheDB on each machine

Here’s a Python implementation demonstrating the new system:

class DistributedPurgeSystem:
    def __init__(self):
        self.local_db = RocksDB()  # Local CacheDB instance
        self.peer_nodes = []
        
    async def handle_purge_request(self, purge_key):
        # Process locally
        await self.local_db.add_purge(purge_key)
        
        # Fan out to peers
        tasks = [self.notify_peer(peer, purge_key) 
                for peer in self.peer_nodes]
        await asyncio.gather(*tasks)
        
        # Actively purge matching content
        await self.execute_purge(purge_key)
        
    async def execute_purge(self, purge_key):
        # Find matching cache entries
        matches = await self.local_db.find_matches(purge_key)
        
        # Remove from cache and index
        for entry in matches:
            await self.local_db.remove_cache_entry(entry)
            
    async def verify_cache_hit(self, cache_key):
        # Check pending purges first
        if await self.local_db.has_pending_purge(cache_key):
            return None
            
        return await self.local_db.get_cache_entry(cache_key)

# Example usage
purge_system = DistributedPurgeSystem()
await purge_system.handle_purge_request("content_type=json")

The new system’s flow:

This diagram illustrates the distributed nature of the new system, with parallel processing and direct peer communication.

Technical Implementation

The new system’s key technical components:

RocksDB: LSM-tree based storage engine
CacheDB: Custom service written in Rust
Peer-to-peer Protocol: Direct edge node communication
Active Purging: Immediate content invalidation

Implementation considerations:

Index Management: Efficient lookup for flexible purge patterns
Tombstone Handling: Managing deleted entry markers
Storage Optimization: Balancing index size with performance
Network Protocol: Reliable peer-to-peer communication

Performance Analysis

The new system achieved significant improvements:

P50 Latency: 150 milliseconds
Storage Savings: 10% reduction
Scalability: Linear with node count
Reliability: No single point of failure

Performance comparison:

Metric	Old System	New System	Improvement
P50 Latency	1500ms	150ms	90%
Storage Overhead	20%	10%	50%
Max Throughput	10K/s	100K/s	10x

Future Improvements

Cloudflare continues to enhance the system:

Further latency reduction for single-file purges
Expanded purge types for all plan levels
Enhanced rate limiting capabilities
Additional optimization opportunities

Conclusion

Cloudflare’s evolution from a centralized to distributed purging system demonstrates the challenges and solutions in building global-scale content delivery networks. The new architecture achieves impressive performance improvements while maintaining reliability and consistency.

References & Further Reading

Cloudflare Blog: “Instant Purge: Invalidating cached content in under 150 milliseconds”
Research Paper: “LSM-tree based Storage Systems”
Distributed Systems: “Peer-to-peer Protocols and Applications”
RocksDB Documentation
Content Delivery Networks: Architecture and Performance

Table of Contents