The Impact of
Data Integrity on Your Data Warehouse
You don't need to perform
lexical analysis to recognise that the first word data warehouse is data.
The bottom line is: without data integrity, every effort your organisation
puts into a data warehouse is wasted. Even the most elegant, relational, client/server
solutions depend on information extracted from various data-clumsy legacy
systems. This means that to be successful, your data warehouse requires legacy
data that is cleaned, validated and properly aggregated. The quality of your
data has a profound impact on retention, sales, marketing, customer service
and risk assessment. In large organisations, an integrity improvement of only
4% can yield millions of dollars. This article examines the impact of data
integrity on the bottom line, and outlines software options and strategies
for achieving data integrity in your data warehouse.
The Impact of
Data Integrity on Your Corporation
Corporations implementing
corporate-wide data warehouse initiatives are realising huge gains, wile those
ignoring this technology shift are being left behind. High quality data in
these systems doesn't guarantee success, but poor quality data always ensures
failure. Here are some areas of business affected by data quality:
Decision Support
- If the data is suspect, your users will lose faith in the data. If important
decisions are made based on questionable data, wrong decisions will be made.
Customer Retention
- Sophisticated customer retention campaigns begin with modelling those customers
who have defected to identify patterns that led to their defection. These
models are then applied to the current customers to identify likely defectors
so that preventative action can be initiated. In order to accomplish this,
it is necessary to have detailed profiles of each customer, with as many accurate
variables as possible. Inaccurate data leads to inaccurate profiles, resulting
in weakened campaigns.
Sales and Customer
Service - In today's environment, superior customer service creates
the sales leaders. When information is properly aggregated and delivered to
front-line sales and service professionals, customer service is greatly enhanced.
If customer information is available, rule-based software can be employed
to automatically recommend products. For example, IF a customer is female
AND is over 40 years of age AND is married AND husband's birthday is within
next three months THEN promote special on golf clubs. Such programs have shown
phenomenal gains in cross-selling ratios.
Marketing
- Marketing depends heavily on accurate information to execute retention campaigns,
lifetime value analysis, trending, targeted promotions, etc. For example,
only by having a complete customer profile can promotions be targeted, and
targeting dramatically increases response rates and decreases campaign costs.
Direct mail costs are directly proportional to the accuracy of customer data.
Risk Assessment/Fraud
Detection - An accessible customer base significantly reduces the
chance of entering into undo risk. For example, a mail order retailer can
identify payment patterns from different customers at the same address - identifying
potentially fraudulent practises by an individual using different names; an
insurance company can identify its complete relationship with a client who
may have different kinds of policies totalling more than acceptable level
of exposure; before making a loan to one company that appears to be solvent,
a bank can identify fiscally related companies that may be in financial jeopardy.
The 4 Percent
Factor
This brings us to the
4 percent factor - how an apparently small increase in data integrity, 4 percent,
is worth millions of dollars to larger organisations.
Suppose you were considering
two data integrity solutions, one of which would yield a 95 percent data accuracy
rate, and the other would guarantee 99 percent accuracy. What, in real dollars,
would this mean to your organisation? If your company is small, it probably
wouldn't make that much of a difference. However, consider a large catalogue
company that typically mails 10 million catalogues a quarter at a cost of
$2 per mailing. That's $80m per year in direct mailing costs alone and 4 percent
of $80m is $3.2m per year!
Through a 4 percent increase
in data integrity, an Innovative Systems Inc. client, a large bank holding
company approximates revenue increases, loss avoidance and cost savings in
the following areas:
Collections $1.3m
Credit Cards $1.3m
Lending $1.4m
Deposits $1.0m
Insurance $200k Total $5.2m
Many types of sophisticated
calculations can be performed to determine the bottom line impact of a 4 percent
improvement in data integrity. With 95 percent integrity, it is probable that
2 to 4 percent of all revenue from data-based marketing, sales and service
efforts is at risk.
Market Strategies
Drive Data Requirements
To understand the relationship
between data and warehousing efforts, it is important to examine the trends
that shaped today's information requirements. These are illustrated in Figure
1.
Figure
1 - Data trends for evolving marketing dynamics
Initially companies focused
on specific product activity, such as the number of men's jumpers sold, the
total number of deposit accounts opened and the number of car insurance policies
sold. This data could easily be extracted from operational systems.
When companies recognised
that by offering superior service they could sell more product, customer service
became central. Accordingly, operational systems were extended with customer
information files (CIF) to identify all products held by a single client.
With better customer profiles, companies could more appropriately address
their customers' needs.
Sophisticated target marketing
drove the next data revolution, making householding a priority. The term "householding"
originally meant bringing together all customer information for a single residence.
For example, how much business is done with everyone at 1/23 Harvey Street?
Today, the role of householding has been extended beyond physical address.
It is especially important in commercial relationships, where companies want
to view their total relationship with each entity in a conglomerate. For example,
if you know that SAG, Sydney Apples and Sydney Apple Growers are all part
of the same organisation, and you can see how much money you are spending
with the total organisation, you can negotiate better volume purchase agreements.
Customer relationship
management (CRM), the current focus of many large organisations, is the next
iteration of data warehousing. This is the process of gathering complete customer
and product information, and making it available both at the front-lines,
to sales and customer service and also to the departmental users in the areas
of marketing, decision support, risk management, profitability measurement,
etc. The same concept applies to vendor relationship management. Many organisations
have easily justified CRM on the potential benefits of leveraging integrated
data.
Re-engineering
Legacy Data
To serve today's warehousing
efforts, legacy data must be "re-engineered". Data re-engineering
is the process of taking data from disparate legacy systems and manipulating
it to serve flexible relational databases. This process includes parsing and
analysing legacy data, transforming the data into more meaningful integrated
information and then remapping the data to relational databases.
For example, at the level
of "customer", a legacy system of trust accounts may contain as
one of its input lines: Bill and Brenda Smith in trust for Marshall Smith.
There are two custodians, Bill and Brenda Smith, and one owner, Marshall Smith.
Modern software should be able to automatically identify how many customers
are contained in the line, that Bill is male and is related to female Brenda
Smith and male Marshall Smith; it should also identify and store the nature
of these relationships. After mapping this data to a relational database,
it should be simple to identify each of the individuals and their relationships.
If the data from another
legacy system, eg. savings accounts, is then re-engineered, you should be
able to link Bill Smith in the trust system to William Smith in the savings
account system. In doing so, you begin to build a complete profile of Bill
Smith and, accordingly, Brenda Smith and Marshall Smith. That is, querying
the database for any of these individuals should enable you to view all of
their relationships.
Take the Data
Integrity Test
Any serious data re-engineering
effort begins with an analysis of the current level of data quality and a
specification of the desired level of quality. Can you answer the following
questions about your database?
1. Can you find
the customer when you need to? If you search the database for Bill
Church, can you find him? To do so, his name needed to be correctly identified
as a person and not as an organisation (eg, a church).
2. Does extraneous
residual information exist in the database? If a customer service
representative typed in, "for the benefit of Gladys Smith, Incompetent"
you want to be sure that "Incompetent" doesn't appear on the mailing
label.
3. Are there address
errors? The address information should be correct and standardised,
with no key elements missing.
4. Are all customers
set up on the database? The original record had "John and Mary
Smith". Does "Mary" exist on the database, or just "John?".
5. Is there relationship
fragmentation? If a customer exists on the database as two or more
separate entities, either entity may reflect an incomplete profile of your
organisation's relationship with that customer. Also, that customer's relationship
with other customers of your organisation may be fragmented.
6. Are any key
pieces of customer information missing? Some organisations require
specific elements, such as Tax File Number (TFN), Date of birth (DOB) and
Drivers License Number (DLN).
7. Does all financial
information appear correct? Specific checks are performed to identify
missing or incorrect information.
Items one through five
can be measured using diagnostic software tools that are tuned to detect even
the slightest error. This identifies the potential error rate, and includes
many conditions that are correct but appear suspicious. For example, "John
Bar" may in fact be a mistyped business "John's Bar". Items
six and seven only require simple tools to check for presence or absence of
information, perform simple range checking, strip unnecessary punctuation
and stray characters, total occurrences, etc. The software generates a potential
error report that is manually reviewed by trained professionals to determine
the actual error rate.
The analysis only needs
to be performed on a sample of the database. Depending on the size of the
institution and contents of the database, 2 to 5 percent is statistically
significant.
You need to have an idea
of what is an acceptable level of data integrity - and benchmark your findings
against that level. This diagnosis should be performed at least once a year,
so that not only are you benchmarking against you theoretical goals, but also
against the previous year's analysis.
Software Options
for Achieving Data Integrity
Once a corporation commits
to re-engineering its data, it must determine the most cost effective method
to ensure complete success. The options are: custom-built solutions, in-house
standards solutions, third-party vendor solutions and outsourcing.
Figure
2 - Data integrity solutions
Custom-built solutions
are in-house attempts to automate a manual review procedure. These are developed
for specific source files on a one-off basis. Internal resources are used
for these projects, and the initial learning curve is time-consuming and steep.
These solutions are inflexible; often geared for a single application, they
do not achieve standards across the enterprise. In the long run, custom-built
solutions are expensive to develop, document and maintain. Scarce IS resources
must be devoted to all phases of development and ongoing maintenance.
In-house standards
solutions attempt to create data standards for the organisation,
so that all data may ultimately be uniform. Often this solution requires expensive
CASE-type tools to gain sufficient productivity to deliver results in a workable
time frame. However, no matter what tools are employed, the successful in-house
standards solution requires a lot of data integrity expertise and a lot of
ongoing tuning. This approach requires several person-years to develop, and
calendar-years to tune. It takes years to build up a verifiably accurate knowledge
base of acceptable first and last names, gender identifications, word and
phrase elements and patterns to be able to identify every possible way someone
could enter "avenue" or "in trust", for example.
Another problem with this
solution is that the resulting system requires documentation and maintenance.
Considerable expertise must be devoted to managing such a system, and organisations
become subject to a small group of in-house experts who may someday leave
the company, taking their detailed knowledge with them.
The third-party
vendor solution uses complete, but tuneable, existing software. There
are essentially two classes of software: "industrial-strength" data
integrity software and "merge/purge" routines. The latter are insufficient
for any significant data integrity needs and are used primarily by mail and
list houses. The industrial-strength solution typically uses a professional
services group to work hand-in-hand with the organisation to investigate needs
and implement a data re-engineering program with appropriate software.
Over the long run, the
third-party solution is the least expensive approach. It offers expertise
on demand without having to hire and maintain that expertise. This solution
also boasts the least maintenance and documentation costs. This is the only
viable solution for large organisations that require the highest level of
data integrity.
Very high levels of data
integrity can also be achieved using outsourcing. This is an ideal solution
for medium-to-large organisations that require high data integrity, but don't
have, or want to allocate the technical resources to maintain the data.
Figure 2 illustrates applicable
software solutions based on an organisation's size and its required level
of data integrity. While large organisations with data integrity requirements
should always use third-party tools or outsource, organisations with no need
for high integrity can use inexpensive merge/purge programs, especially if
they are only using the information for the limited direct marketing. For
example, if you are performing a single mailing at $0.40 a piece to a database
of 100,000 customers, the difference between 75 percent accuracy and 95 percent
accuracy is only $8,000.
Even if your database
is small, if the value of the data is high, then high integrity data solutions
are warranted. This is especially true, for example, for trading systems,
where clients are of very high value, and for commercial files, where business
is transacted with a large number of organisations. Retaining just a few of
these customers can completely justify the cost of implementing enhanced data
integrity.
Given infinite time and
resources, it is possible to get over 95 percent accuracy with in-house solutions
and CASE tools, but the reality is that the job is much bigger than any IS
organisation would want to attempt, and the labour cost is not justifiable.
Maintaining Data
Quality Over Time
Successful data maintenance
begins with an understanding of your organisation's quality expectations.
In approaching maintenance solutions, it is essential to first establish data
standards, policies and procedures.
Make it as easy as possible
for users entering data to adhere to those standards. In an online environment,
an editor should be able to recognise all the variations of "Avenue",
for example, and translate that to your standard (eg, "Ave"). Further,
it helps to have the screen navigation set up to support the user, for example,
by highlighting the data elements especially important to capture, and not
allowing users to casually exit without entering key information.
If data is being entered
in a batch mode, editors are still needed to check the new data. Potential
errors should be automatically fed to a report, if not a separate database.
Someone should be responsible for reviewing and correcting the errors.
Perform periodic maintenance
processes to find duplicate records that creep in over time. The processes
perform the analysis discussed above, purge the database and do reconciliation
checks of redundant data sources using diagnostic software.
Post-processing can be
performed for further verification. For example, addresses can be run against
a POSTCODE database to ensure that your postcodes are correct and to automatically
fill in any that are missing.
Establish a small team
to be the custodians of the database, ensuring the diagnostics are run when
they should be; that the errors identified are corrected; that people entering
data are trained; that data standards are up-to-date and relevant; and that
accountability is enforced. The data custodians may put processes into place
to have the data entry personnel accountable for acquiring complete and accurate
data.
Summary
Data warehouses fail when
the attention is on the warehouse and not the data. The 4 percent factor is
driving organisations to a higher standard of data quality, realising significant
gains in critical benefit areas such as customer acquisition and retention,
customer service and risk avoidance. Whatever level of quality you pursue,
it must precede any successful data warehousing effort. You must identify
the desired level of quality, and establish a means for auditing your data
and achieving and maintaining the quality to which you aspire. A robust warehouse
requires a strong foundation, and data integrity is the foundation of any
data warehousing effort.