Your business collects mountains of data. All the insights to improve your operations, strategies, and revenue are right in it. These insights are as valuable as diamonds, and they’re already yours. But you can’t use them until you get them out of the rock.
To get insights from data, you’ll need to collect and blend it using the so-called pipeline. It’s a series of processing steps to integrate and prepare data for analysis. The type, number, and sequence of steps depends on the pipeline architecture.
The ETL architecture is one of the most widespread. This design pattern has been around since the 1970s. It evolved, survived the move of data platforms from on-premises to the cloud, and embraced the rise of microservice software.
What is ETL architecture?
Without getting into much detail about what ETL is, let’s define it as a process of getting information from siloed sources, reshaping it, and pushing it to storage. The sequence of steps in this process—extract, transform, load—plays a key role in defining the pipeline architecture.
The distinctive feature of the ETL architecture is that data goes through all required preparation procedures before it reaches the warehouse. As a result, the final repository contains clean, complete, and trustworthy data to be used for BI without further amendments.
Now let’s visualize an ETL pipeline architecture in the form of a simplified diagram. It will show the main components of the pipeline and the processes that take place inside them.
The ETL architecture diagram
Every company has a unique set of information sources, tech stack, and business logic. So, there won’t be two identical real-word ETL architecture diagrams. For example, if you use a particular cloud service, your diagram may name the special tools it provides for ETL:
- Azure Data Factory and Azure Synapse Analytics by Microsoft
- Cloud Data Fusion, Dataflow, and Dataproc by Google
- Amazon S3, AWS Glue, and AWS Redshift by Amazon
Alternatively, you may have the whole pipeline inside one SaaS tool, such as Coupler.io, Skyvia, or Fivetran. Inside such a data integration tool, you’ll deal with an interface without bothering about the software components running under its hood.
In any case, an ETL architecture diagram shows the data sources, analytics tools, and data flow between them. It outlines the exact sequence of sub-processes in three main areas: ingestion, processing, and storage.
This area temporarily stores a copy of the raw data extracted from multiple sources. These can be production, operational, sales, marketing, customer support, and other software and files:
- Operational systems registering day-to-day transactions
- Databases, CRM, ERP, and project management apps
- XBRL, XML, JSON, flat files, or files in other formats
The information can be sourced using integration tools, APIs, webhooks, or web scraping tools. Every software has quotas and rates for requests. This requires you to balance the frequency and volume of extractions to stay within the limits.
This part of the pipeline hosts a set of automated processes transforming the data. The most common processes that take place at this stage are:
- Cleansing. Raw data may contain corrupt information that should be detected and removed before any other actions.
- Dropping irrelevant. If the information has no value for BI, it should be removed to save your computing and storage resources.
- Enrichment. Information extracted from different sources can be merged into a more comprehensive data set.
- Reshaping. You may need to change the format or the level of granularity for some data before you can use it for analytics.
- Business logic. Some data points may not exist in the source but should be calculated using particular rules and formulas.
- Validation. After several transformations, the resulting data set should be checked against a set of criteria to exclude errors.
Most of the transformations are simple. Yet, the implementation of business logic is often challenging, even more so if it changes with time. A change in business definitions may affect several sequential transformations and make you reconsider the processing stage of the pipeline.
This area represents the destination point where ready-to-use data is loaded. That’s the database that connects to reporting, visualization, and BI tools used by the company employees and management.
Depending on how much space your storage has, you can decide whether to overwrite existing data with every new load or maintain both historical and new data side-by-side. Overwriting data conserves storage and enhances performance but discards potentially valuable information.
The target repository may reside on-site on your proprietary server or use AWS, Azure, GCP, or other public cloud infrastructure. In any case, the storage represents a monolithic, centralized warehouse that serves as a single source of truth for the whole company.
ETL architecture best practices
When choosing ETL architecture, it’s important to be aware of its limitations and know ways to overcome them. You might face some performance and maintenance issues if your pipeline doesn’t embrace the following best practices.
The ETL pipeline consists of sequential and interdependent operations that form one big end-to-end process. Yet, from the technical perspective, every component of the extract, transform, and load stages should represent a separate microservice. This helps you distribute computing between independent modules that run in parallel. As a result, the whole system becomes more resilient, scalable, and easier to maintain.
Optimal extraction volumes
During the first extraction, you might need to load all the available data from the source systems. Every subsequent extraction should only capture the data that was added or updated since the last collection. This incremental approach helps you save computing power and stay within your quotas for webhook calls and API requests.
Incremental batch loading
Common reporting and BI tasks rarely require real-time updates. Instead of using complex data streaming technology, you can configure your pipeline to load new information to the storage in batches weekly, daily, or hourly… Similar to extraction, the load should be incremental and push new and updated data only.
Process logging and monitoring
To stay on top of what’s going on inside your ETL pipeline, you should monitor its performance and log every sub-process. For every step of the pipeline, the logs should contain beginning and ending timestamps, current statuses, error codes, and the amount of data processed. The logs should be comprehensive enough for troubleshooting and be stored for a reasonable time.
ETL security and access
Your central warehouse might be used by employees from different departments. You need to manage rights and permissions to control what information users can access. In addition, regardless of their status, end users should not be able to access or manipulate the data in the ingestion and processing areas.
Is ETL architecture right for your pipeline?
The ETL architecture has stood the test of time. Yet, it slowly gives way to newer pipeline design patterns—such as ELT, Data Mesh, or Data Fabric—that proved to be more flexible and scalable. Whether you should use ETL or choose one of its advanced alternatives depends on several factors.
|Factors||Use ETL||Use alternatives|
|Company size||SMB||SMB, Enterprise|
|Raw & output data||Structured||Structured, semi-structured, unstructured|
|Repository||Single, centralized, domain-agnostic warehouse with limited space||Multiple domain-specific, decentralized, interoperable storages, data lakes, unlimited space|
|Data volume||Small data sets that require complicated transformations||Large datasets that require high processing speed and efficiency|
|Privacy||PII data to be encrypted||Non-PII data|
Summarising the comparison table, we should admit that alternative pipeline designs can do everything that the ETL does. Moreover, they can do it at scale. There are still a few cases where you may prefer to adhere to the classic ETL architecture, for example:
- You need to feed data into a legacy, on-site warehouse.
- You collect PII data and need to encrypt it to make your storage privacy compliant.
- You move data that requires complex transformations before it’s loaded to a repository.
One more reason to choose the ETL architecture is its simplicity. All the ETL practices are well documented. There’s also no lack of highly-qualified talent. You can easily find a data architect able to build an ETL pipeline for you.
Moreover, for simple ETL pipelines, you don’t even have to build anything from scratch. Instead, you can choose among the best ETL tools available on the market and get a solid pipeline even if you don’t know how to code.
Connect your ETL application to a data analytics service, and you’ll have a fully-fledged BI tool.
With this setup, you can quickly turn raw data into actionable information. This is the fastest way to start making informed decisions, get improved outcomes, and achieve your business goals.Back to Blog