Integrating and Normalizing Data from Multiple Sources

TLDR:

When integrating data from various Online Travel Agencies (OTAs) like Booking.com and Expedia, it’s crucial to aggregate and normalize the data while maintaining a clear separation between external and internal models. This article explains how to structure your database and codebase using known architectural patterns, and why avoiding the term “core” is important to prevent confusion with Domain-Driven Design (DDD)’s core domain.

Introduction:

In systems that interact with external services, particularly in industries like hospitality where different OTAs (Online Travel Agencies) play a major role, it’s common to encounter challenges related to data consistency and integration. Each OTA has its own way of representing reservations, properties, and other entities, which can make it difficult to achieve a unified internal system.

This challenge leads to questions like: How do you aggregate data from multiple sources while ensuring consistency? How do you structure your code and database so that external and internal models remain clearly separated? And what terminology should you use to avoid confusion with established concepts like DDD’s core domain?

In this article, we’ll walk through the process of integrating and normalizing OTA data into a unified system, and how to properly organize both your database and codebase using known patterns like Data Aggregation, Data Normalization, the Adapter Pattern, and Anti-Corruption Layers. Additionally, we’ll explore why the term “core” can be misleading and suggest more appropriate terminology for your project.

Problem: Integrating Data from Multiple OTAs into a Unified System

In many systems that integrate with external services, particularly in the hospitality industry, managing data from different OTAs like Booking.com, Expedia, and others can be a complex challenge. Each OTA has its own data model, with differences in how reservations, properties, and other entities are represented. This variation can make it difficult to provide a consistent internal view of the data.

For instance, you might receive reservation data from Booking and Expedia, but the way each represents customer details, room information, or reservation statuses may differ. You need a way to transform this external data into a format that your system can use consistently, without introducing unnecessary complexity.

Moreover, the term “core” is often used to describe tables or models that centralize data from these different sources, but this can lead to confusion with Domain-Driven Design (DDD)‘s concept of core domain, which refers to the most business-critical part of a system.

The challenge lies in:

Aggregating and normalizing data from different OTAs while maintaining a clear separation between external and internal data models.
Organizing your database and codebase in a way that is scalable, maintainable, and avoids introducing confusion with improper terminology like “core.”

Solution: A Structured Approach to Aggregation, Normalization, and Code Architecture

To solve these challenges, it’s important to use well-defined concepts for both database design and code architecture, while replacing terms like “core” with more meaningful terminology to avoid confusion with DDD’s core domain.

Clarifying Data Aggregation vs. Data Normalization

Data Aggregation and Data Normalization are distinct but related concepts, and they should not be used interchangeably. Each has a specific purpose and is applied in different contexts:

Data Aggregation:
- Definition: Data aggregation is the process of collecting and summarizing data from different sources into a unified form. In your case, this means gathering data from different OTAs (like Booking and Expedia) and bringing it together in a meaningful way.
- Purpose: Aggregation is typically used when you need to bring together diverse sets of data to derive insights or make it available for further processing. The goal is often to summarize or combine data.
- Usage: You aggregate when you’re pulling together reservations from Booking, Expedia, and other OTAs into a unified view, for instance, a high-level overview of all reservations.
Example:
- Aggregating reservation counts from different OTAs into a single daily or monthly report.
Data Normalization:
- Definition: Data normalization is the process of structuring data in a way that reduces redundancy and ensures consistency. This involves transforming data into a standard format, making it easier to process and store.
- Purpose: Normalization is used when you need to transform diverse data into a standardized, clean format. It’s about making sure that data from different sources conforms to a single schema.
- Usage: You normalize when you take data from Booking and Expedia, which might have different fields or structures, and transform it into a single, standardized format that your internal system can use.
Example:
- Normalizing different reservation schemas (e.g., booking_reservation vs. expedia_reservation) into a common reservation format.

Key Difference:

Aggregation refers to collecting and summarizing data from different sources, while normalization refers to transforming data into a consistent structure.
Aggregation is about combining data, normalization is about making data consistent.

Replacing the Term “Core” and Understanding Core Domain

You mentioned wanting to replace the term “core” as it may be causing confusion with Domain-Driven Design (DDD)‘s Core Domain. Let’s clarify:

Core Domain (in DDD):
- The core domain refers to the central part of your business logic that differentiates your product or service from competitors. This is where the most valuable and complex business rules live.
- Using the term “core” in a non-business-critical context (e.g., referring to tables that aggregate or normalize data) could confuse developers or stakeholders, as it implies that this data or logic is the most critical part of the business, which may not be the case.
Inappropriate Use of “Core”:
- Using “core” to describe technical or infrastructure elements (like a data normalization process) could give the impression that those processes are business-critical, when they may simply be support layers. In DDD, “core domain” should be reserved for the strategic, competitive part of the business logic.

Suggested Terminology for Your Data Model

Instead of “core,” you might want to use terms that more clearly describe the role of these tables in the system:

Unified Model or Common Model: To represent the internal data model that unifies data from multiple OTAs.
Normalized Reservation / Normalized Property: To describe the internal tables that store data after normalization.
Aggregated View: If there are tables specifically created to summarize or aggregate data from different sources.

Example Code Package Structure

To ensure clarity and maintainability, you should organize your code into distinct modules that handle specific responsibilities. Here’s an example structure for your project:

com.zatlas
│
├── reservation
│   ├── model                
│   │   ├── NormalizedReservation.kt
│   │   └── NormalizedProperty.kt
│   ├── repository           
│   │   └── ReservationRepository.kt
│   ├── service
│   │   └── ReservationService.kt
│   └── adapter
│       ├── BookingAdapter.kt
│       └── ExpediaAdapter.kt
│
├── ota
│   ├── booking
│   │   ├── model
│   │   └── service
│   ├── expedia
│   │   ├── model
│   │   └── service
│   └── common
│       └── OTAService.kt
│
└── common
    ├── dto
    └── util
and common code ├── dto

Explanation of Package Structure:

reservation Package:
- This package contains all code related to the internal, normalized model of reservations and properties. It includes:
  - Model: Defines the internal representation (NormalizedReservation, NormalizedProperty).
  - Repository: Interacts with the database (e.g., saving and retrieving normalized data).
  - Service: Contains business logic related to reservations.
  - Adapter: Adapts external OTA models (e.g., Booking, Expedia) to the internal format.
ota Package:
- This package deals specifically with OTA models and logic. Each OTA (Booking, Expedia) has its own sub-package containing:
  - Model: External model representing the OTA’s data structure.
  - Service: OTA-specific logic for fetching or processing OTA data.
common Package:
- Contains shared utilities, DTOs, and helpers that are used throughout the application, including utilities for things like date conversion, data validation, and more.

Why This Structure Works:

Separation of Concerns: Each part of the system (normalization, OTA-specific logic, database interaction) is clearly separated, making it easier to maintain and extend the code.
Adhering to DDD: You are now explicitly modeling the difference between external models (OTA-specific models) and internal models (normalized or unified structures), without conflating them with terms like “core,” which could imply deeper business-critical significance.

Conclusion

Data Aggregation refers to the process of collecting and summarizing data from multiple sources. It’s used when you need a high-level view or combined data.
Data Normalization refers to standardizing and structuring data into a consistent format. It’s used when you need to ensure data consistency and remove redundancy.
The term “core” should be avoided unless it truly refers to the most critical business logic (core domain in DDD).
A clear separation between external models (from OTAs) and internal models (normalized or unified structures) should be maintained both in database and code architecture.
The suggested code structure separates concerns between OTA-specific data and internal normalized models, providing clarity and maintainability.

Erick Santana

Explorer