Designing a Real-Time Collaborative Editor Using the “Blurry to Sharp” Technique

How to iteratively solve system design problems by progressing from ambiguity to precision.


Introduction

Designing a system like Google Docs requires balancing real-time collaboration, conflict resolution, and scalability. Traditional approaches often dive into technical details too early, leading to analysis paralysis or missed edge cases. Inspired by Kalid Azad’s Blurry to Sharp learning strategy, this article walks through a progressive refinement process for building a collaborative editor. We’ll start with a vague vision and iteratively sharpen it into a detailed architecture.


Phase 1: Blurry Overview

Define the system’s purpose and scope without technical specifics.

Core Requirements

  1. Users edit documents together in real time.
  2. Changes shouldn’t clash.
  3. Track document history.
  4. Handle 50+ concurrent editors.

Blurry Diagram

flowchart TD
    A[User] --> B[System]
    B --> C[Sync Changes]
    B --> D[Resolve Conflicts]
    B --> E[Save History]
    B --> F[Handle Many Users]

Key Risks

  • How to sync changes instantly?
  • Resolve conflicts if two users edit the same text?
  • Scale to 50+ concurrent editors?

Lesson: Start by identifying what needs to be solved, not how.


Phase 2: Sharpen Layer 1 – Core Architecture

Define major components and interactions.

Decisions

  • Real-Time Sync: Use WebSockets for instant updates (lower latency than polling).
  • Conflict Resolution: CRDTs (Conflict-free Replicated Data Types) for decentralized merging.
  • Storage: Separate databases for active sessions (Redis) and document history (S3).

Diagram (First Sharpening)

flowchart TD
    A[Client] -->|WebSocket| B[Real-Time Service]
    B --> C[Conflict Resolver]
    C --> D[Session Store]
    C --> E[Document Store]
    C --> F[Version History]

Key Questions

  • Which CRDT type is best for text?
  • How to handle offline edits?

Lesson: Prioritize foundational components that address the riskiest unknowns (e.g., conflict resolution).


Phase 3: Sharpen Layer 2 – Component Details

Drill into technologies, data flow, and critical logic.

Decisions

  1. Conflict Resolver: Use Replicated Growable Array (RGA) CRDT for ordered text.
  2. Session Store: Redis (pub/sub for real-time updates).
  3. Document Store: Cassandra (handles high write throughput).
  4. Version History: S3 with vector clocks for delta storage.

Data Flow Diagram

sequenceDiagram
    participant User
    participant RealTimeService
    participant CRDTResolver
    participant Redis
    participant Cassandra
    participant S3

    User->>RealTimeService: Edit "Hello" → "Hullo"
    RealTimeService->>CRDTResolver: Apply CRDT op
    CRDTResolver->>Redis: Store pending op
    CRDTResolver->>Cassandra: Persist merged state
    CRDTResolver->>S3: Append delta to history
    RealTimeService->>User: Confirm edit
    RealTimeService->>All Clients: Broadcast update

Key Risks

  • CRDT metadata overhead.
  • Cassandra’s write scalability.

Lesson: Prototype core logic early (e.g., CRDT merging) to validate assumptions.


Phase 4: Sharpen Layer 3 – Edge Cases & Optimization

Address failure modes and scaling bottlenecks.

Decisions

  1. Offline Edits: Queue operations in Redis; merge on reconnect.
  2. Scalability: Shard documents by doc_id % 100 across Cassandra clusters.
  3. Performance: Cache frequent document accesses in Redis.

Scalability Diagram

graph TB
    Client -->|WebSocket| LoadBalancer
    LoadBalancer -->|Shard 1| Node1[Real-Time Service]
    LoadBalancer -->|Shard 2| Node2[Real-Time Service]
    Node1 & Node2 --> ShardedCassandra[(Cassandra: Shards 1-50)]
    Node1 & Node2 --> GlobalRedis[(Redis Cluster)]
    Node1 & Node2 --> GlobalS3[(S3)]

Key Tests

  • Simulate network partitions during merges.
  • Benchmark CRDT merge latency under load.

Lesson: Design for failure (e.g., network splits, node crashes).


Phase 5: Validate & Iterate

Test assumptions and refine.

Actions

  1. Prototype CRDT Logic:
    • Use Automerge.js to simulate 50 users editing the same paragraph.
  2. Stress-Test Infrastructure:
    • Measure Redis/Cassandra latency with 1K concurrent edits.
  3. Chaos Testing:
    • Kill a Cassandra node mid-edit; verify recovery.

Final Architecture (Sharpened)

flowchart TD
    Client -->|WebSocket| LoadBalancer
    LoadBalancer --> RealTimeService[Real-Time Service]
    RealTimeService --> CRDTResolver[CRDT Resolver]
    CRDTResolver --> Redis[(Redis: Sessions & Ops)]
    CRDTResolver --> Cassandra[(Cassandra: Docs)]
    CRDTResolver --> S3[(S3: Version Deltas)]
    RealTimeService --> PresenceService[Presence Service]
    PresenceService --> Redis

Lessons Learned

  1. CRDTs simplified conflict resolution but required careful metadata tracking.
  2. Redis + Cassandra separated fast session data from scalable document storage.
  3. S3 versioning became cost-effective but required delta-replay logic.

Next Iterations

  1. Security Layer: Add OAuth2 and role-based access control.
  2. Global Replication: Sync data across regions with CRDT-aware conflict resolution.
  3. Client Optimization: Compress WebSocket payloads for low-bandwidth users.

Conclusion

The Blurry to Sharp method avoids premature optimization by focusing on the riskiest unknowns first. By iterating from ambiguity to precision, we designed a scalable collaborative editor while minimizing wasted effort. Whether you’re building real-time systems or solving complex problems, this approach helps you learn by refining rather than guessing.