Home
Partners
Services
White Papers
Our Promise
Contact
About us
   
 


The Magic 4% (Innovative Systems, Inc.)

The Impact of Data Integrity on Your Data Warehouse You don't need to perform lexical analysis to recognise that the first word data warehouse is data. The bottom line is: without data integrity, every effort your organisation puts into a data warehouse is wasted. Even the most elegant, relational, client/server solutions depend on information extracted from various data-clumsy legacy systems. This means that to be successful, your data warehouse requires legacy data that is cleaned, validated and properly aggregated. The quality of your data has a profound impact on retention, sales, marketing, customer service and risk assessment. In large organisations, an integrity improvement of only 4% can yield millions of dollars. This article examines the impact of data integrity on the bottom line, and outlines software options and strategies for achieving data integrity in your data warehouse. The Impact of Data Integrity on Your Corporation Corporations implementing corporate-wide data warehouse initiatives are realising huge gains, wile those ignoring this technology shift are being left behind. High quality data in these systems doesn't guarantee success, but poor quality data always ensures failure. Here are some areas of business affected by data quality: Decision Support - If the data is suspect, your users will lose faith in the data. If important decisions are made based on questionable data, wrong decisions will be made. Customer Retention - Sophisticated customer retention campaigns begin with modelling those customers who have defected to identify patterns that led to their defection. These models are then applied to the current customers to identify likely defectors so that preventative action can be initiated. In order to accomplish this, it is necessary to have detailed profiles of each customer, with as many accurate variables as possible. Inaccurate data leads to inaccurate profiles, resulting in weakened campaigns. Sales and Customer Service - In today's environment, superior customer service creates the sales leaders. When information is properly aggregated and delivered to front-line sales and service professionals, customer service is greatly enhanced. If customer information is available, rule-based software can be employed to automatically recommend products. For example, IF a customer is female AND is over 40 years of age AND is married AND husband's birthday is within next three months THEN promote special on golf clubs. Such programs have shown phenomenal gains in cross-selling ratios. Marketing - Marketing depends heavily on accurate information to execute retention campaigns, lifetime value analysis, trending, targeted promotions, etc. For example, only by having a complete customer profile can promotions be targeted, and targeting dramatically increases response rates and decreases campaign costs. Direct mail costs are directly proportional to the accuracy of customer data. Risk Assessment/Fraud Detection - An accessible customer base significantly reduces the chance of entering into undo risk. For example, a mail order retailer can identify payment patterns from different customers at the same address - identifying potentially fraudulent practises by an individual using different names; an insurance company can identify its complete relationship with a client who may have different kinds of policies totalling more than acceptable level of exposure; before making a loan to one company that appears to be solvent, a bank can identify fiscally related companies that may be in financial jeopardy. The 4 Percent Factor This brings us to the 4 percent factor - how an apparently small increase in data integrity, 4 percent, is worth millions of dollars to larger organisations. Suppose you were considering two data integrity solutions, one of which would yield a 95 percent data accuracy rate, and the other would guarantee 99 percent accuracy. What, in real dollars, would this mean to your organisation? If your company is small, it probably wouldn't make that much of a difference. However, consider a large catalogue company that typically mails 10 million catalogues a quarter at a cost of $2 per mailing. That's $80m per year in direct mailing costs alone and 4 percent of $80m is $3.2m per year! Through a 4 percent increase in data integrity, an Innovative Systems Inc. client, a large bank holding company approximates revenue increases, loss avoidance and cost savings in the following areas: Collections $1.3m
Credit Cards $1.3m
Lending $1.4m
Deposits $1.0m
Insurance $200k
Total $5.2m Many types of sophisticated calculations can be performed to determine the bottom line impact of a 4 percent improvement in data integrity. With 95 percent integrity, it is probable that 2 to 4 percent of all revenue from data-based marketing, sales and service efforts is at risk. Market Strategies Drive Data Requirements To understand the relationship between data and warehousing efforts, it is important to examine the trends that shaped today's information requirements. These are illustrated in Figure 1. Figure 1 - Data trends for evolving marketing dynamics Initially companies focused on specific product activity, such as the number of men's jumpers sold, the total number of deposit accounts opened and the number of car insurance policies sold. This data could easily be extracted from operational systems. When companies recognised that by offering superior service they could sell more product, customer service became central. Accordingly, operational systems were extended with customer information files (CIF) to identify all products held by a single client. With better customer profiles, companies could more appropriately address their customers' needs. Sophisticated target marketing drove the next data revolution, making householding a priority. The term "householding" originally meant bringing together all customer information for a single residence. For example, how much business is done with everyone at 1/23 Harvey Street? Today, the role of householding has been extended beyond physical address. It is especially important in commercial relationships, where companies want to view their total relationship with each entity in a conglomerate. For example, if you know that SAG, Sydney Apples and Sydney Apple Growers are all part of the same organisation, and you can see how much money you are spending with the total organisation, you can negotiate better volume purchase agreements. Customer relationship management (CRM), the current focus of many large organisations, is the next iteration of data warehousing. This is the process of gathering complete customer and product information, and making it available both at the front-lines, to sales and customer service and also to the departmental users in the areas of marketing, decision support, risk management, profitability measurement, etc. The same concept applies to vendor relationship management. Many organisations have easily justified CRM on the potential benefits of leveraging integrated data. Re-engineering Legacy Data To serve today's warehousing efforts, legacy data must be "re-engineered". Data re-engineering is the process of taking data from disparate legacy systems and manipulating it to serve flexible relational databases. This process includes parsing and analysing legacy data, transforming the data into more meaningful integrated information and then remapping the data to relational databases. For example, at the level of "customer", a legacy system of trust accounts may contain as one of its input lines: Bill and Brenda Smith in trust for Marshall Smith. There are two custodians, Bill and Brenda Smith, and one owner, Marshall Smith. Modern software should be able to automatically identify how many customers are contained in the line, that Bill is male and is related to female Brenda Smith and male Marshall Smith; it should also identify and store the nature of these relationships. After mapping this data to a relational database, it should be simple to identify each of the individuals and their relationships. If the data from another legacy system, eg. savings accounts, is then re-engineered, you should be able to link Bill Smith in the trust system to William Smith in the savings account system. In doing so, you begin to build a complete profile of Bill Smith and, accordingly, Brenda Smith and Marshall Smith. That is, querying the database for any of these individuals should enable you to view all of their relationships. Take the Data Integrity Test Any serious data re-engineering effort begins with an analysis of the current level of data quality and a specification of the desired level of quality. Can you answer the following questions about your database? 1. Can you find the customer when you need to? If you search the database for Bill Church, can you find him? To do so, his name needed to be correctly identified as a person and not as an organisation (eg, a church). 2. Does extraneous residual information exist in the database? If a customer service representative typed in, "for the benefit of Gladys Smith, Incompetent" you want to be sure that "Incompetent" doesn't appear on the mailing label. 3. Are there address errors? The address information should be correct and standardised, with no key elements missing. 4. Are all customers set up on the database? The original record had "John and Mary Smith". Does "Mary" exist on the database, or just "John?". 5. Is there relationship fragmentation? If a customer exists on the database as two or more separate entities, either entity may reflect an incomplete profile of your organisation's relationship with that customer. Also, that customer's relationship with other customers of your organisation may be fragmented. 6. Are any key pieces of customer information missing? Some organisations require specific elements, such as Tax File Number (TFN), Date of birth (DOB) and Drivers License Number (DLN). 7. Does all financial information appear correct? Specific checks are performed to identify missing or incorrect information. Items one through five can be measured using diagnostic software tools that are tuned to detect even the slightest error. This identifies the potential error rate, and includes many conditions that are correct but appear suspicious. For example, "John Bar" may in fact be a mistyped business "John's Bar". Items six and seven only require simple tools to check for presence or absence of information, perform simple range checking, strip unnecessary punctuation and stray characters, total occurrences, etc. The software generates a potential error report that is manually reviewed by trained professionals to determine the actual error rate. The analysis only needs to be performed on a sample of the database. Depending on the size of the institution and contents of the database, 2 to 5 percent is statistically significant. You need to have an idea of what is an acceptable level of data integrity - and benchmark your findings against that level. This diagnosis should be performed at least once a year, so that not only are you benchmarking against you theoretical goals, but also against the previous year's analysis. Software Options for Achieving Data Integrity Once a corporation commits to re-engineering its data, it must determine the most cost effective method to ensure complete success. The options are: custom-built solutions, in-house standards solutions, third-party vendor solutions and outsourcing. Figure 2 - Data integrity solutions Custom-built solutions are in-house attempts to automate a manual review procedure. These are developed for specific source files on a one-off basis. Internal resources are used for these projects, and the initial learning curve is time-consuming and steep. These solutions are inflexible; often geared for a single application, they do not achieve standards across the enterprise. In the long run, custom-built solutions are expensive to develop, document and maintain. Scarce IS resources must be devoted to all phases of development and ongoing maintenance. In-house standards solutions attempt to create data standards for the organisation, so that all data may ultimately be uniform. Often this solution requires expensive CASE-type tools to gain sufficient productivity to deliver results in a workable time frame. However, no matter what tools are employed, the successful in-house standards solution requires a lot of data integrity expertise and a lot of ongoing tuning. This approach requires several person-years to develop, and calendar-years to tune. It takes years to build up a verifiably accurate knowledge base of acceptable first and last names, gender identifications, word and phrase elements and patterns to be able to identify every possible way someone could enter "avenue" or "in trust", for example. Another problem with this solution is that the resulting system requires documentation and maintenance. Considerable expertise must be devoted to managing such a system, and organisations become subject to a small group of in-house experts who may someday leave the company, taking their detailed knowledge with them. The third-party vendor solution uses complete, but tuneable, existing software. There are essentially two classes of software: "industrial-strength" data integrity software and "merge/purge" routines. The latter are insufficient for any significant data integrity needs and are used primarily by mail and list houses. The industrial-strength solution typically uses a professional services group to work hand-in-hand with the organisation to investigate needs and implement a data re-engineering program with appropriate software. Over the long run, the third-party solution is the least expensive approach. It offers expertise on demand without having to hire and maintain that expertise. This solution also boasts the least maintenance and documentation costs. This is the only viable solution for large organisations that require the highest level of data integrity. Very high levels of data integrity can also be achieved using outsourcing. This is an ideal solution for medium-to-large organisations that require high data integrity, but don't have, or want to allocate the technical resources to maintain the data. Figure 2 illustrates applicable software solutions based on an organisation's size and its required level of data integrity. While large organisations with data integrity requirements should always use third-party tools or outsource, organisations with no need for high integrity can use inexpensive merge/purge programs, especially if they are only using the information for the limited direct marketing. For example, if you are performing a single mailing at $0.40 a piece to a database of 100,000 customers, the difference between 75 percent accuracy and 95 percent accuracy is only $8,000. Even if your database is small, if the value of the data is high, then high integrity data solutions are warranted. This is especially true, for example, for trading systems, where clients are of very high value, and for commercial files, where business is transacted with a large number of organisations. Retaining just a few of these customers can completely justify the cost of implementing enhanced data integrity. Given infinite time and resources, it is possible to get over 95 percent accuracy with in-house solutions and CASE tools, but the reality is that the job is much bigger than any IS organisation would want to attempt, and the labour cost is not justifiable. Maintaining Data Quality Over Time Successful data maintenance begins with an understanding of your organisation's quality expectations. In approaching maintenance solutions, it is essential to first establish data standards, policies and procedures. Make it as easy as possible for users entering data to adhere to those standards. In an online environment, an editor should be able to recognise all the variations of "Avenue", for example, and translate that to your standard (eg, "Ave"). Further, it helps to have the screen navigation set up to support the user, for example, by highlighting the data elements especially important to capture, and not allowing users to casually exit without entering key information. If data is being entered in a batch mode, editors are still needed to check the new data. Potential errors should be automatically fed to a report, if not a separate database. Someone should be responsible for reviewing and correcting the errors. Perform periodic maintenance processes to find duplicate records that creep in over time. The processes perform the analysis discussed above, purge the database and do reconciliation checks of redundant data sources using diagnostic software. Post-processing can be performed for further verification. For example, addresses can be run against a POSTCODE database to ensure that your postcodes are correct and to automatically fill in any that are missing. Establish a small team to be the custodians of the database, ensuring the diagnostics are run when they should be; that the errors identified are corrected; that people entering data are trained; that data standards are up-to-date and relevant; and that accountability is enforced. The data custodians may put processes into place to have the data entry personnel accountable for acquiring complete and accurate data. Summary Data warehouses fail when the attention is on the warehouse and not the data. The 4 percent factor is driving organisations to a higher standard of data quality, realising significant gains in critical benefit areas such as customer acquisition and retention, customer service and risk avoidance. Whatever level of quality you pursue, it must precede any successful data warehousing effort. You must identify the desired level of quality, and establish a means for auditing your data and achieving and maintaining the quality to which you aspire. A robust warehouse requires a strong foundation, and data integrity is the foundation of any data warehousing effort.