System Design, Building a Real-Time Collaborative Editor

Why Study System Design with Katas?

Just as coding katas help programmers refine their skills through repetitive practice, system design katas train engineers to break down complex problems, evaluate tradeoffs, and architect scalable solutions. By simulating real-world constraints (e.g., latency, fault tolerance), katas foster:

Pattern Recognition: Identifying reusable solutions (e.g., CRDTs for conflict resolution).
Decision-Making Skills: Balancing consistency vs. availability, or latency vs. durability.
Communication: Explaining technical choices clearly, as you would in a team or interview.

Let’s apply this to a classic problem: designing a real-time collaborative editor.

The Problem: Real-Time Collaborative Editing System

Requirements

Functional:

User authentication, document management, real-time sync, conflict resolution, version history, presence awareness.

Non-Functional:

Latency ≤200ms, 99.9% uptime, scalability for 10M+ users, zero data loss.

Constraints:

≤50 concurrent editors per document.
Support for offline editing.

Step-by-Step Solution

(Below is the full thought process and design, as if pairing with a collaborator.)

Step 1: High-Level Architecture

Goal: Outline components and their interactions.

graph TD  
    Client -->|WebSocket| RT[Real-Time Engine]  
    RT -->|gRPC| DS[Document Service]  
    DS --> Cassandra[(Cassandra: Documents)]  
    DS --> Redis[(Redis: Sessions)]  
    DS --> S3[(S3: Version History)]  
    RT --> PS[Presence Service]  
    PS --> Redis

Key Decisions:

WebSocket for real-time updates (lower latency than polling).
Cassandra for scalable document storage (handles high write throughput).
Redis for session management and pub/sub messaging.

Step 2: Conflict Resolution with CRDTs

Problem: Two users editing the same text region.

Solution: Use Conflict-free Replicated Data Types (CRDTs). Each edit is a timestamped operation that automatically merges without conflicts.

stateDiagram-v2  
    AliceEdit: Alice inserts "X" (timestamp A=1)  
    BobEdit: Bob inserts "Y" (timestamp B=1)  
    MergedState: "XY" (ordered by unique IDs)  
    AliceEdit --> MergedState  
    BobEdit --> MergedState

Why CRDTs Over OT?

Operational Transforms (OT) require a central server to resolve conflicts.
CRDTs are decentralized, making them more resilient to network issues.

Step 3: Real-Time Sync Flow

Scenario: User A edits a document.

sequenceDiagram  
    participant A as User A  
    participant RT as Real-Time Engine  
    participant DS as Document Service  
    participant Redis  
    participant Cassandra  

    A->>RT: Send edit (CRDT op)  
    RT->>DS: Forward edit  
    DS->>Redis: Buffer op  
    DS->>Cassandra: Persist merged state  
    RT->>A: Acknowledge  
    RT->>All Clients: Broadcast update

Optimizations:

Redis Buffering: Temporarily store edits to handle retries during network failures.
Asynchronous Persistence: Cassandra writes happen after broadcasting to minimize latency.

Step 4: Version History

Goal: Allow users to roll back to any prior state.

graph LR  
    v1[Version 1: Hello] -->|User A inserts !| v2[Version 2: Hello!]  
    v2 -->|User B deletes e| v3[Version 3 Hllo!]  
    v3 --> S3[(S3: Store deltas)]

Implementation:

Store deltas (changes) in S3 with vector clocks for ordering.
Reconstruct versions by replaying deltas.

Step 5: Scaling to Millions of Users

Challenge: Handle 1M+ concurrent documents.

graph TB  
    LoadBalancer -->|Shard 1| US-West[US-West Region]  
    LoadBalancer -->|Shard 2| EU[EU Region]  
    US-West --> Cassandra-West[(Cassandra)]  
    EU --> Cassandra-EU[(Cassandra)]  
    US-West & EU --> GlobalS3[(Global S3)]

Tactics:

Sharding: Distribute documents by doc_id % shard_count.
Active-Active Replication: Cassandra syncs data across regions.
Edge Caching: Use Redis in each region for low-latency reads.

Step 6: Security

Requirements: Auth, encryption, access control.

OAuth2 for user authentication.
TLS for encrypting data in transit.
Role-Based Access Control (RBAC):
- Owners: Full control.
- Editors: Can edit but not delete.
- Viewers: Read-only access.

Step 7: Handling Edge Cases

Example: User A edits offline while User B deletes the same text.

Solution:

Offline edits are queued in Redis.
On reconnect, CRDT merges User A’s edits, respecting deletion tombstones.

Final Architecture

flowchart TD  
    Client -->|WebSocket| RT[Real-Time Engine]  
    RT -->|gRPC| DS[Document Service]  
    DS --> Cassandra[(Cassandra)]  
    DS --> Redis[(Redis)]  
    DS --> S3[(S3)]  
    RT --> PS[Presence Service]

Components:

Real-Time Engine: Manages WebSocket connections and CRDT merging.
Document Service: Handles CRUD operations and versioning.
Redis: Pub/sub for live updates and session storage.
Cassandra: Persistent, scalable document storage.

Key Takeaways

CRDTs simplify conflict resolution in distributed systems.
Tradeoffs are inevitable: Prioritize low latency and availability over strong consistency for real-time apps.
Katas force clarity: Breaking down the problem step-by-step exposes gaps early.

Next Steps

Build a CRDT prototype (try Automerge).
Benchmark Redis vs. Kafka for real-time messaging.
Simulate network partitions with tools like Chaos Monkey.

By practicing system design katas, you’ll develop the muscle memory to tackle even the most daunting architectures. Happy building! 🛠️

This exercise encapsulates the iterative, practice-driven approach of system design katas. For deeper learning, recreate the diagrams, code snippets, and tradeoff analysis yourself!

Erick Santana

Explorer