Designing a Real-Time Collaborative Editor Using the “Blurry to Sharp” Technique
How to iteratively solve system design problems by progressing from ambiguity to precision.
Introduction
Designing a system like Google Docs requires balancing real-time collaboration, conflict resolution, and scalability. Traditional approaches often dive into technical details too early, leading to analysis paralysis or missed edge cases. Inspired by Kalid Azad’s Blurry to Sharp learning strategy, this article walks through a progressive refinement process for building a collaborative editor. We’ll start with a vague vision and iteratively sharpen it into a detailed architecture.
Phase 1: Blurry Overview
Define the system’s purpose and scope without technical specifics.
Core Requirements
- Users edit documents together in real time.
- Changes shouldn’t clash.
- Track document history.
- Handle 50+ concurrent editors.
Blurry Diagram
flowchart TD A[User] --> B[System] B --> C[Sync Changes] B --> D[Resolve Conflicts] B --> E[Save History] B --> F[Handle Many Users]
Key Risks
- How to sync changes instantly?
- Resolve conflicts if two users edit the same text?
- Scale to 50+ concurrent editors?
Lesson: Start by identifying what needs to be solved, not how.
Phase 2: Sharpen Layer 1 – Core Architecture
Define major components and interactions.
Decisions
- Real-Time Sync: Use WebSockets for instant updates (lower latency than polling).
- Conflict Resolution: CRDTs (Conflict-free Replicated Data Types) for decentralized merging.
- Storage: Separate databases for active sessions (Redis) and document history (S3).
Diagram (First Sharpening)
flowchart TD A[Client] -->|WebSocket| B[Real-Time Service] B --> C[Conflict Resolver] C --> D[Session Store] C --> E[Document Store] C --> F[Version History]
Key Questions
- Which CRDT type is best for text?
- How to handle offline edits?
Lesson: Prioritize foundational components that address the riskiest unknowns (e.g., conflict resolution).
Phase 3: Sharpen Layer 2 – Component Details
Drill into technologies, data flow, and critical logic.
Decisions
- Conflict Resolver: Use Replicated Growable Array (RGA) CRDT for ordered text.
- Session Store: Redis (pub/sub for real-time updates).
- Document Store: Cassandra (handles high write throughput).
- Version History: S3 with vector clocks for delta storage.
Data Flow Diagram
sequenceDiagram participant User participant RealTimeService participant CRDTResolver participant Redis participant Cassandra participant S3 User->>RealTimeService: Edit "Hello" → "Hullo" RealTimeService->>CRDTResolver: Apply CRDT op CRDTResolver->>Redis: Store pending op CRDTResolver->>Cassandra: Persist merged state CRDTResolver->>S3: Append delta to history RealTimeService->>User: Confirm edit RealTimeService->>All Clients: Broadcast update
Key Risks
- CRDT metadata overhead.
- Cassandra’s write scalability.
Lesson: Prototype core logic early (e.g., CRDT merging) to validate assumptions.
Phase 4: Sharpen Layer 3 – Edge Cases & Optimization
Address failure modes and scaling bottlenecks.
Decisions
- Offline Edits: Queue operations in Redis; merge on reconnect.
- Scalability: Shard documents by
doc_id % 100
across Cassandra clusters. - Performance: Cache frequent document accesses in Redis.
Scalability Diagram
graph TB Client -->|WebSocket| LoadBalancer LoadBalancer -->|Shard 1| Node1[Real-Time Service] LoadBalancer -->|Shard 2| Node2[Real-Time Service] Node1 & Node2 --> ShardedCassandra[(Cassandra: Shards 1-50)] Node1 & Node2 --> GlobalRedis[(Redis Cluster)] Node1 & Node2 --> GlobalS3[(S3)]
Key Tests
- Simulate network partitions during merges.
- Benchmark CRDT merge latency under load.
Lesson: Design for failure (e.g., network splits, node crashes).
Phase 5: Validate & Iterate
Test assumptions and refine.
Actions
- Prototype CRDT Logic:
- Use Automerge.js to simulate 50 users editing the same paragraph.
- Stress-Test Infrastructure:
- Measure Redis/Cassandra latency with 1K concurrent edits.
- Chaos Testing:
- Kill a Cassandra node mid-edit; verify recovery.
Final Architecture (Sharpened)
flowchart TD Client -->|WebSocket| LoadBalancer LoadBalancer --> RealTimeService[Real-Time Service] RealTimeService --> CRDTResolver[CRDT Resolver] CRDTResolver --> Redis[(Redis: Sessions & Ops)] CRDTResolver --> Cassandra[(Cassandra: Docs)] CRDTResolver --> S3[(S3: Version Deltas)] RealTimeService --> PresenceService[Presence Service] PresenceService --> Redis
Lessons Learned
- CRDTs simplified conflict resolution but required careful metadata tracking.
- Redis + Cassandra separated fast session data from scalable document storage.
- S3 versioning became cost-effective but required delta-replay logic.
Next Iterations
- Security Layer: Add OAuth2 and role-based access control.
- Global Replication: Sync data across regions with CRDT-aware conflict resolution.
- Client Optimization: Compress WebSocket payloads for low-bandwidth users.
Conclusion
The Blurry to Sharp method avoids premature optimization by focusing on the riskiest unknowns first. By iterating from ambiguity to precision, we designed a scalable collaborative editor while minimizing wasted effort. Whether you’re building real-time systems or solving complex problems, this approach helps you learn by refining rather than guessing.