Data Transformation in ETL

Today, businesses and organizations generate and collect massive amounts of data from a variety of sources, including social media, IoT devices, and legacy systems. However, the data collected is often incomplete, inconsistent, and spread across different data sources.

This data can be transformed into a useful format and integrated into a single repository, such as a data warehouse, to enable data-driven decision-making. This is where data transformation with ETL (Extract, Transform, Load) comes in.

What is ETL transformation?

ETL transformation is the process of converting raw data from source systems into a format that is suitable for the target system. A good example of this is a retail business that operates multiple stores across different regions. The business collects data on daily sales, inventory, and customer demographics on a daily basis and wants to integrate this data into a data warehouse or data lake for data analysis and reporting.

The first step is to extract the data from different sources. These sources could be relational databases, a Customer Relationship Management (CRM) tool, or a Point of Sale (POS) system. The next step is to transform the extracted data into a format that is suitable for the data warehouse. The final step is to load the transformed data into the data warehouse. This involves mapping the data and matching it appropriately to the current schema then ensuring data load happens in the target system.

Data transformations don’t always happen after data extraction. They could also happen after the data is loaded into the target system. This process is called ELT (Extract Load Transform) and enables users to take advantage of the massive processing capabilities of modern data warehouses to run more efficient queries. Since most data warehouse tools like Redshift and Snowflake support massive parallel processing of large volumes of data and are now more accessible due to affordable pricing, ELT has become more popular.

By using ETL (or ELT) to centralize data from its various data sources, the retail store from the example above can create a single source of truth for its sales, inventory, and customer data. This data can be used to analyze trends and patterns across different stores and regions, identify opportunities for growth and optimization, and make data-driven decisions to improve business performance.

What are ETL transformation types?

It’s important for both data engineers and analysts to be aware of and understand data transformations, regardless of where they occur in the data pipeline.

Ingested data may exist in different formats like JSON, XML, or CSV. They may be structured, semi-structured, or unstructured. Initial data transformation steps require standardizing the data and choosing the ETL tools that appropriately fit the type and format of the data.

There are some additional steps that are often required to shape data into a dataset that’s easy to extract business insights from. These steps can happen within a data pipeline automated by an ETL tool, but they often require a mix of SQL and Python scripting to build out an end-to-end workflow:

Deduplication

Data deduplication is a technique used in data management to identify and eliminate duplicate data entries within a data set.

Data deduplication involves analyzing the content of data blocks within a data set and comparing them to identify duplicates. This process can be done using a variety of techniques, such as checksums, hashes, or content-aware algorithms.

The goal of data deduplication is to reduce data storage space and improve data management efficiency by eliminating redundant data.

Derivation

Data derivation is the process of creating new data elements or modifying existing data elements from one or more source data elements using a defined set of rules or formulas. It is a type of data transformation that involves manipulating data to generate new data values.

At a technical level, data derivation involves using programming logic or mathematical formulas to create new data values based on one or more source data elements.

A simple example of data derivation is calculating a new data element based on existing data elements in a database. For instance, consider a database that stores customer order information, including the order date and order amount. It would be valuable to derive a new data element that indicates the month in which each order was placed.

A data derivation rule that extracts the month from the order date and stores it in a new data element could be used to generate the order month data element. For example, the following formula could be used in SQL:

SQL
SELECT order_id, order_date, order_amount, DATE_TRUNC('month', order_date) AS order_month
FROM orders;

In this example, the DATE_TRUNC function is used to extract the month from the order_date field, and the result is stored in a new field called order_month.

Joining

Data joining is the process of combining two or more data sets by matching rows based on a common attribute or set of attributes. Data joining involves comparing the values in one or more columns of two or more tables or data sets and identifying rows with matching values. These matching rows are then combined into a single output table or data set with the columns from each input table being concatenated to form the columns of the output table.

Data joining is a powerful technique that allows data from multiple sources to be combined and analyzed together, enabling new insights and a deeper understanding of the data. It is commonly used in business intelligence, data warehousing, and analytics applications, as well as in scientific research and other data-driven fields.

Aggregating

Data aggregation is the process of summarizing, grouping, or consolidating data from multiple rows or records into a single value or set of values. It allows large amounts of data to be reduced to more manageable and meaningful representations.

At a technical level, data aggregation involves applying a function or calculation to a set of data elements and producing a new output value or set of values. The function used for data aggregation can be a simple mathematical operation, such as summing or averaging, or a more complex statistical or analytical function.

Say you need to calculate the total sales revenue by product category from a sales data set. Suppose you have a sales data set that includes information on individual sales transactions, including the date, product ID, quantity sold, and sale price.

To aggregate this data by product category and calculate the total sales revenue for each category, we would group the data by the product category attribute and apply a sum function to the sale price column. This could be done using SQL, for example, with the following query:

SQL
SELECT product_category, SUM(sale_price) AS total_sales
FROM sales_data
GROUP BY product_category;

Splitting

Data splitting is a technique used in data management and analytics to divide a large data set into smaller, more manageable subsets.

At a technical level, data splitting can be performed using a variety of methods, depending on the specific requirements of the analysis or application. One common method is to randomly split the data set into two or more subsets, with each subset containing a proportionate sample of the original data. This can be useful for data scientists who need to perform data modeling or training machine learning algorithms, where it is necessary to have a representative sample of the data for testing and validation

Cleaning

Data cleaning refers to identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. Data cleaning involves a range of tasks, including:

  • Removing duplicates and correcting errors.
  • Standardizing data and converting it to a specific format, such as converting all dates to a consistent format or converting currency values to a standard currency.
  • Handling missing data by imputing missing values or removing records with missing data.
  • Identifying and removing outliers to get rid of records with extreme values that may skew the analysis.

Sorting and ordering

Sorting and ordering are techniques used in data management and analysis to arrange data in a specific sequence based on certain criteria. Sorting refers to the process of arranging data in a specific order. Ordering refers to the process of arranging data in ascending or descending order.

Suppose you have a database of customer orders, and you want to retrieve the top 10 orders by order value. To do this, you would use a SQL query that includes sorting and ordering statements, such as:

SQL
SELECT customer_name, order_date, order_value
FROM orders
ORDER BY order_value DESC
LIMIT 10;

Mapping

Data mapping is the process of transforming data from one format or structure to another. It involves identifying the relationships between data elements in different datasets and creating a mapping between them to enable data integration and transformation.

Filtering

Filtering is a data processing technique that involves selecting a subset of data based on specified criteria. It is commonly used to extract relevant data from large datasets or to remove unwanted data from a dataset.

An example of data filtering would be filtering a dataset of customer orders to show only the orders from a specific date range. Let's say you have a dataset of customer orders with the following columns: OrderID, CustomerID, OrderDate, ProductID, Quantity, and TotalPrice.

To filter the dataset to show only the orders from January 1, 2022, to March 31, 2022, you could use a filtering function or query that specifies the date range criteria. For example, in SQL, you could use the following query:

SQL
SELECT * FROM Orders
WHERE OrderDate >= '2022-01-01' AND OrderDate <= '2022-03-31'

Conclusion

Data transformations play a crucial role in the Extract-Transform-Load (ETL) process of data integration. ETL transformations are used to clean, validate, and transform data from various sources into a common format, so they can be fed into data warehousing tools for data analytics and business intelligence tools for building rich visualizations.

Changing business needs require data professionals to quickly adapt pipelines to handle new use cases. This makes it crucial to automate as much of this as possible in order to streamline data transformation processes and eliminate the potential of any human error that may occur.

Ultimately, data transformations within the ETL process are a key component of any data management strategy and are essential to unlocking data-driven decision-making.

The Data Maturity Guide

Learn how to build on your existing tools and take the next step on your journey.

Build a data pipeline in less than 5 minutes

Create an account

See RudderStack in action

Get a personalized demo

Collaborate with our community of data engineers

Join Slack Community