Easy and quick validator for tables present in same DB for ETL validations

“ETL testing refers to the process of validating, verifying, and qualifying data while preventing duplicate records and data loss.”

In ETL testing many times we come across scenarios in which the source table/s and target table being compared are present in same database. Being on the same database gives us the power to make use of minus query.

We will be making use of JDBC through a script in JAVA for achieving this.

Parts of the Validator:

METADATA VALIDATOR: Before starting any kind of validation metadata should look good.

Before proceeding with validation we need two things:

  1. SOURCE QUERY: We can make use of INFORMATION_SCHEMA.COLUMNS for getting the metadata. Based on transformation logic and (n)number of source tables being used we can write the script to prepare n number of metadata queries and use UNION to put them together in a single query. ( *preparing a single query with all tables in it can lead to duplicate entries in the source query result as joins are definitely going to be used in case of multiple sources ). In case of complex and inconsistent mapping logic we can design our script to read the mapping from an excel sheet.

2. TARGET QUERY: Similarly we need a Target query which is a simple query like: SELECT * FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = ‘target_table_name’ .

Now as we have both the queries we can start the validations and prepare reports. ( this is covered in the last part )

COUNT VALIDATOR: Before proceeding with Data Validations we must ensure Source and Target counts are matching.

Here also we need two queries:

  1. SOURCE QUERY: It can be a simple “Select Count(*) from table”. For a complex query with joins we can enclose the same in Select count(*) from (query) . Based on the requirements we can avoid doing count for complex queries and directly proceed with Data Validations. In case all tables are one to one mapping and even with some columns dropped we can make use of “svv_table_info” table (just an example in Redshift). With help of this we can get counts of all the tables in just a single query.
  2. TARGET QUERY: Similar to our source we can prepare our Target query.

Now as we have both the queries we can start the validations and prepare reports. ( this is covered in the last part )

DATA VALIDATOR: At last comes Data Validations which gives us the final confidence that everything is looking good. (* A filter might be added here to proceed with Data Validations only for those tables which passed count validation)

  1. SOURCE QUERY: For preparing Data Validation Queries we need the actual column names present. So the script does need some reference to prepare those queries( totally depends on the Business Logic). If its a direct one to one mapping between source and target we can simply make use of INFORMATION_SCHEMA.COLUMNS to get the column names. In case some particular columns are dropped off we can add same in filters while querying INFORMATION_SCHEMA . In case of complex mapping logics with joins and all, there needs to be a staging table/ excel sheet present with all the relevant info required by the script to generate the data validation query.
  2. TARGET QUERY: Similarly we can prepare the Target query. Generally Target queries doesn’t have any join conditions or data filters , so are pretty simple to make.

Now as we have both the queries we can start the validations and prepare reports.

Doing validations and Preparing reports :

As we are done with preparation of all the queries we can proceed with validations now. For the Validations we can have a common function which will take in say the Test Case name, Source query, Target query , perform validations and prepare the reports too. Reports are prepared in csv files as we can store lot of data in them and are space efficient.

For executing the queries we will open a JDBC connection and make use of ResultSet to execute the query.

Example:

Class dbDriver = Class.forName(“com.amazon.redshift.jdbc42.Driver”);
String jdbcURL = “”;
Connection connection = DriverManager.getConnection(jdbcURL);
Statement statement = connection.createStatement();
ResultSet rs = statement.executeQuery(“Query”);
while(rs.next())
{

}

We will be making 4 type of reports:

1.ONLY SOURCE.csv : query = (Source Query) minus (Target Query) — this gives us the records present only in source . While iterating the result set we can keep writing in the csv file line by line. We can also maintain a counter which will give us the count of records present only in source.

”ResultSet rs = stmt.executeQuery(query);

while (rs.next())

{…write in csv , counter increment}“

2. ONLY TARGET.csv: (Target Query) minus (Source Query) — this gives us the records present only in target. While iterating the result set we can keep writing in the csv file line by line. We can also maintain a counter which will give us the count of records present only in target.

3. MATCH.csv: (Target Query) minus ((Target Query) minus (Source Query)) — this gives us the records present in both. While iterating the result set we can keep writing in the csv file line by line. We can also maintain a counter which will give us the count of records present in both.

4. SUMMARY.csv: This will contain the status of a particular test case Pass/Fail. It will also contain details such as Source Query, Target Query, Execution Time, Executors name, Only Source Count, Only Target Count, Matching Records Count.

This summarizes the whole validator which can be designed to achieve full automation and just providing table names in some excel sheet should be sufficient.

NOTE: This is just an overview on what steps can be followed to make such a process and it will vary on user needs and complexity.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store