Imagine an organization that has four applications—let’s call them A, B, C, and D—each with its own database. If we use the ETL (Extract, Transform, and Load) method to combine data from these four databases, the steps are as follows:
- Extract
Data from each database (A, B, C, and D) must be fully downloaded. This means a large volume of data is transferred over the network. - Transform
Once the data is extracted, the next step is transformation, because the format or representation of data in each database may differ. For instance, database A might store data in a “triangular” format, while B uses a “quadrilateral” format. Consequently, B’s data must be transformed into a “triangular” format so it can be combined with A’s data. Similarly, if C and D store data in “circular” or “pentagonal” formats, each must be transformed accordingly to ensure compatibility. - Load
The transformed data is then loaded into a centralized database—let’s call it Z. This database is where all the data from A, B, C, and D converges.
The challenge is that Z must have a large capacity because it holds the combined data from all sources. In addition, Z requires high computational power (CPU) and storage, given the massive volume of data. This approach is commonly referred to as building a data warehouse or leveraging big data.
Data Virtualization as a Solution
Unlike ETL, data virtualization also aims to integrate data from A, B, C, and D to generate insights or reports, but in a more efficient way. The concept is to make it appear as if all data from these databases is stored in one virtual location—let’s call it X—even though, in reality, A, B, C, and D maintain their data locally.
When a data request is made (for example, “Please show me the annual financial report”), the system then makes a specific request to each database:
- Processing at the Source
When X requests data from A, for example, the initial processing happens in A. Only the necessary data is sent to X— not the entire dataset. The same process occurs for B, C, and D. - Filtered Data
The data sent to X is usually smaller, such as aggregated data, transformed data, or filtered data. This reduces network load because only relevant data is transferred. - Consolidation at X
Finally, X consolidates data from A, B, C, and D. The user accessing X can see the results as if all data was indeed stored in a single place, even though it remains physically distributed.
See Also : Data Virtualization and ETL: Friends or Enemies?
Advantages of Data Virtualization
- Reduced Network Load
Because only processed or filtered data is transferred, the volume of data moving through the network is significantly lower than downloading the entire dataset. - Efficient Storage
With ETL, database Z must hold all combined data, requiring substantial storage capacity. In data virtualization, X only stores the integrated results (aggregated or filtered data), requiring less storage space. - Distributed CPU Utilization
Computational processes (initial transformations) occur on each database server (A, B, C, and D). X primarily consolidates the final data, so it does not require a powerful CPU to handle large-scale data processing. - Greater Flexibility and Speed
If the data format changes or a new data source is added, adjustments can be made more quickly because each source continues to operate on its own system. Integration takes place virtually, rather than by moving all data to a single location.
Through these mechanisms, data virtualization proves to be a more efficient choice compared to ETL. It minimizes the network, storage, and computational burdens typically seen when data must be extracted, transformed, and loaded into a single, large database. Thus, data virtualization is a reliable solution for integrating data across multiple applications within an organization.