Oswald Regular
OpenSans Regular
Data Quality

It matters. Data quality problems can have a significant impact on a company’s bottom line. Bad data can result in redundant work and missed opportunities. Data quality problems can accumulate, increasing in scope and impact, as data moves through the enterprise. In the worst cases, this can cause executives to reach incorrect conclusions and make bad business decisions. Pretty serious stuff. Yet most companies have no formal data quality programs that can measure and mitigate data quality problems. Most companies are not even aware that they have a data quality problem.

The solution is to institute an enterprise data quality (DQ) program. By its very nature, an enterprise DQ program is beyond the capabilities of any single canned solution. DQ requires a holistic approach – with touchpoints throughout the business and implemented across a range of technologies. DQ should be an integral part of the data processing pipeline and should not be limited to just offline, retrospective analysis. DQ is not just about customer name and address cleansing. It’s about the consistency and representation of all enterprise information.

If the technologies used for DQ are to be part of the processing pipeline, they have to be production-level robust. They have to deal with complex legacy data, real-time transactions, and high sustained processing volumes. Approaches that do not meet all these requirements end up being relegated to offline deployments and rarely meet expectations. This is what typically happens with special-purpose niche DQ tools that specialize in certain types of data and that can be used only in limited circumstances.

Ab Initio’s approach to data quality is different – it is end-to-end. Because the Ab Initio Co>Operating System is a complete application development and execution environment, Ab Initio’s approach to data quality works anywhere the Co>Operating System can be deployed, which is in practically any operational or analytics environment. The Co>Operating System natively processes complex legacy data, runs distributed across heterogeneous sets of servers, is very high performance and completely scalable, and can implement highly complex logic. (Learn more about the Co>Operating System.)

Ab Initio’s end-to-end approach to data quality is based on design patterns using Ab Initio’s seamlessly coupled technologies – they are all architected together – including the Co>Operating System, the Enterprise Meta>Environment (EME), the Business Rules Environment (BRE), and the Data Profiler. Using Ab Initio, a company can implement a complete data quality program including detection, remediation, reporting, and alerting.

Architectural overview

When it comes to DQ, one size does not fit all, especially for large organizations with many legacy systems. Ab Initio, therefore, provides a series of powerful building blocks that allow users to put together custom data quality solutions that meets their specific needs, whatever those needs might be. For users who are just starting to put a data quality program in place, Ab Initio supplies a reference implementation that can serve as the foundation of a complete program. For users who have different needs, or who already have pieces of a data quality program in place, Ab Initio’s DQ building blocks may be plugged together with existing infrastructure as desired.

A typical data quality implementation starts with constructing a powerful, reusable DQ processing component with the Co>Operating System, as shown below:

The Co>Operating System enables components to contain whole applications. This particular reusable DQ process component is an application in its own right, and includes the following:

  • A subsystem that detects and possibly corrects data quality problems. The Co>Operating System serves as the foundation for implementing defect detection. The BRE can be used to specify validation rules in an analyst-friendly interface, and the Data Profiler can be integrated into the process for trend analysis and detailed problem detection.
  • A data quality reporting system. The EME includes built-in data quality reporting that integrates with the rest of an enterprise’s metadata, data quality metrics and error counts, and data profile results. Users can extend the EME schema to store additional data quality information and to augment the base EME capabilities with their own reporting infrastructure.
  • An issue reporting database. Records that have data quality issues are logged in a database or file so they can be examined as part of a complete data quality workflow. Ab Initio provides the technology to store, retrieve, and view those records, although users are free to select any data storage technology that meets their needs.

This DQ processing component is typically run as part of existing applications. If an application has been built with Ab Initio, the DQ component can easily be plugged into it. For applications not built with Ab Initio, the DQ processing component has to be explicitly invoked. The DQ component can also be implemented as an independent job that sources data directly. Below are examples of both deployment cases, standalone and integrated into an existing application:

Data quality processing workflow

The diagram below illustrates a sample complete data quality detection workflow. It is important to remember that each DQ deployment is tailored to the user’s specific needs.

As indicated earlier, the input to this DQ Process A can be any type of data from any source. It can be a flat file, a database table, a message queue, or a transaction in a web service. It can also be the output of some other process implemented with Ab Initio or another technology. Because the DQ Process runs on top of the Co>Operating System, the data can be anything the Co>Operating System can handle: complex legacy data, hierarchical transactions, international data, and so on.

The output of the DQ Process B can also be any type of data going to any target.

The first step is to apply Validation Rules 1 to the data. Validation rules can be run against individual fields, whole records, or whole datasets. Since each record may have one or more issues, the validation rules may produce a set of DQ issues on a per-record basis E . The severity of these issues and what to do about them is decided further downstream.

Next, cleanup rules are applied to the data 2 , and the output is the result of the DQ Process B . Users may use built-in Ab Initio cleansing rules or build their own with the Co>Operating System. While validation and cleansing rules are easily entered with the BRE, there is no limit to the sophistication of these rules since they can use the full power of the Co>Operating System’s data processing.

Records that cannot be cleansed are output to a Problems Archive 4 . These problem records then typically go through a human workflow to resolve their issues.

The list of issues for each record E may also be analyzed 3 to generate reports and alerts 5 . Because this process is built using standard Ab Initio “graphs” with the Co>Operating System, practically any type of reporting and processing can be done. Ab Initio’s standard DQ approach includes:

  • Calculating data quality metrics, such as completeness, accuracy, consistency, and stability
  • Determining frequency distributions for individual fields
  • Generating aggregate counts of error codes and values
  • Comparing current values for all the above with historical values
  • Signaling significant deviations in any of the current measurements from the past

All the information generated above is stored in the Ab Initio EME for monitoring and future reference. All DQ information can be integrated with all other metadata, including reference data that is also stored in the EME.

While all the computation associated with these steps may consume significant CPU resources, the Co>Operating System’s ability to distribute workload across multiple CPUs, potentially on multiple servers, allows full data quality processing to always be part of the processing pipeline.

As demonstrated above, Ab Initio’s approach to data quality measurement includes a rich set of options that can be customized and configured to meet a user’s needs. The processing of the data, calculation of results, and all the steps in between are implemented using the Ab Initio Co>Operating System. This means that data quality detection can be run on almost any platform (Unix, Windows, Linux, mainframe z/OS), and on any type of data, with very high performance. In situations where large volumes of data are being processed, the entire data quality detection process can be run in parallel to minimize latency.

The next several sections present examples of the analyst-friendly user interfaces for creating validation rules and reporting on data quality results.

Validation rules

Most data quality issues are detected by applying validation rules to the source dataset. With the Ab Initio data quality design pattern, record-at-a-time validation rules can be defined using the Ab Initio Business Rules Environment (BRE). The BRE is designed to allow less technical users, subject matter experts, and business analysts to create and test validation rules using a spreadsheet-like interface.

Using the BRE, there are two ways to define validation rules. In most cases, users define rules by filling out a simple spreadsheet (validation grid) with field names down the left side and validation tests across the top:

This interface makes it very easy to specify which validation tests should be applied to each field or column in a dataset. The BRE includes a number of built-in validation tests (nulls, blanks, value ranges, data formats, domain membership, etc.). But it is also possible for the development staff to define custom validation tests that can be applied to individual fields. Custom validation tests are written by developers using the Ab Initio Data Manipulation Language, and then made available in the BRE.

For more complex validation rules, the BRE allows for the definition of “tabular rules.” These complex validation rules may process multiple input fields within a record to determine whether there are data quality issues. Each rule can produce an error and disposition code, which together drive the amelioration process.

The BRE enables subject matter experts to design, enter, and test validation rules all from the same user interface. The BRE’s testing capability allows users to interactively see which rules trigger for various inputs. This makes it easy to ensure that the rules are behaving as expected.

The screen shot below shows validation rules during testing. The BRE displays trigger counts for every validation test, as well as the details for each test record.

Validation rules are saved in the EME, which provides for version control, access control, and configuration management. For applications that are built entirely with Ab Initio, including the DQ process, the application and the DQ rules are versioned, tagged, and promoted into production together. This ensures a robust DQ process.

While the BRE makes it easy for less technical users to define validation rules, it is not the only way to define such rules. The full power of the Co>Operating System’s transformation technology is available for implementing the most complex rules. Because the BRE and transformation rules both run on top of the Co>Operating System, it’s possible to create a very comprehensive data quality measurement strategy.

Reporting

Detection is the first part of a complete data quality implementation. The second major component of a data quality program is reporting.

Data quality reporting is driven by the Enterprise Meta>Environment (EME). Ab Initio’s EME is an enterprise-class and enterprise-scale metadata system architected to manage the metadata needs of business analysts, developers, operations staff, and others. It handles many types of metadata from different technologies in three categories – business, technical, and operational – and this metadata includes data quality statistics.

Ab Initio stores data quality statistics in the EME for reporting purposes. One type of DQ information stored in the EME is the aggregated counts of error codes (issues) of individual fields and datasets. The counts are linked to the dataset being measured and to the fields with issues. Issues are aggregated and reported by error code, which is in a global set of reference codes, stored in the EME (the EME supports reference code management).

The screen shot below shows the EME’s ability to display field-level issues along with historical trending graphs. Counts that exceed configurable thresholds are highlighted in yellow or red.

As shown below, Ab Initio is able to calculate data quality metrics for datasets and fields (columns), and these too are also stored in the EME. There is a corresponding tabular report of these metrics, which includes trending graphs and yellow/red thresholds.

When data quality measurements are captured throughout a large environment, it is possible to aggregate the information according to the user’s organizational structure. This makes it possible for managers to view data quality metrics for entire systems, applications, and/or subject areas in one report. From this report, problem areas can then be investigated by drilling down into the details.

The screen shot below shows a number of higher-level subject areas and their aggregated data quality metrics:

Reporting: Lineage

Many users roll out a data quality program by implementing data quality detection across a number of datasets in a single system. For example, it is not unusual to see data quality measured for all the tables in an enterprise data warehouse, but nowhere else. Although measuring data quality in one system is better than not measuring data quality at all, a more useful data quality program includes data quality checks at multiple stages across the entire enterprise processing pipeline. For example, data quality might be measured at the enterprise data warehouse, but also at the system of record, at intermediate processing points, and downstream in the various data marts or extract systems. Each of these systems can capture quality metrics whether or not they were built with Ab Initio.

The EME multiplies the value of a data quality program when data quality measurements are made at multiple points in an enterprise. This is because the EME can combine data lineage with data quality metrics to help pinpoint the systems in which and precisely where data quality problems are being introduced.

Consider the following screen shot:

This screen shot shows an expanded lineage diagram in the EME. Each large gray box represents a different system. The smaller green, red, and gray boxes represent datasets and applications.

Data quality metrics may flag individual elements. Green is good. Red indicates a data quality problem. With these diagrams, it is easy to follow the path of data quality problems, from where they start to where they go. For the first time, management can actually see how data and problems are flowing through their environment.

Finally, DQ reporting is not limited to the built-in EME screens. The EME’s information is stored in a commercial relational database, and Ab Initio provides documentation on the schema. Users are free to use business intelligence reporting tools of their choice to develop custom views of their enterprise data quality.

Reporting: Data Profiler

The Ab Initio Data Profiler results can also be used as part of a DQ workflow. As with all other DQ measurements, these results are stored in the EME and can be viewed through the EME web portal.

Many organizations consider data profiling to be an activity reserved for data discovery at the beginning of a project. But periodic automated data profiling can add significant value to a complete data quality program. While data quality metrics can capture the overall health and characteristics of the data, data profiler statistics allow for drilling down into a more detailed analysis of the contents of various datasets.

Below is a screen shot of the top-level report of a Data Profiler run on a particular dataset. Diversity (distinct values), validity, and completeness are just some of the information discovered by the Data Profiler. This information can be used to select which fields require further inspection.

Below is a screen shot of a particular field chosen by the user for additional analysis.

From this screen, it is possible to drill down to a display of the actual records that contain specific values in the selected field.

Conclusion

While data quality is a problem that every company faces, there is no single approach to detecting, reporting, and studying data quality problems that fits every organization’s need.

Ab Initio’s end-to-end data quality design patterns can be used with little or no customization. For users with specific data quality needs, such as additional types of detection, reporting, or issue management, Ab Initio provides a general-purpose, flexible approach based on powerful pre-existing building blocks.

Ab Initio’s approach to data quality is based on the Co>Operating System. The Co>Operating System provides a high-performance multi-platform computing environment that performs data quality detection, amelioration, data profiling, and statistics aggregation for any type of data. The Co>Operating System provides unlimited scalability, and so can perform all of these tasks on very large data volumes.

Ab Initio’s Business Rules Environment allows validation rules to be developed and tested by analysts and/or subject matter experts using an easy-to-use graphical interface. The result is significantly improved productivity and agility around creating and maintaining data quality rules.

And, Ab Initio’s Enterprise Meta>Environment provides an unprecedented level of integration of data quality statistics with other metadata, including data lineage, data dictionaries, domain code sets, operational statistics, data stewardship, and other technical, operational, and business metadata.

The unmatched combination of these capabilities within a single integrated technology puts Ab Initio’s data quality capabilities in a class of its own.

Language:
English
Français
Español
Deutsch
简体中文
日本語