Data Warehouse & Mining: Data Cleaning

Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning routines attempt to fill in missing values, smooth out noise and correct inconsistencies in the data. Basic methods for data cleaning are as follows

· Missing Values: Many tuples may have no recorded value for several attributes, such as customer income. This scenario can be handled in following ways.

o Ignore the tuple: This is usually done when the class label is missing. This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.

o Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.

o Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown” or 􀀀¥. If missing values are replaced by, say, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of “Unknown.” Hence, although this method is simple, it is not foolproof.

o Use the attribute mean to fill in the missing value: For example, suppose that the average income of customers is $56,000. Use this value to replace the missing value for income.

o Use the attribute mean for all samples belonging to the same class as the given tuple: For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.

o Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income. Most popular and most accurate method.

· Noisy data: “What is noise?” Noise is a random error or variance in a measured variable. Given a

numerical attribute such as, say, price, how can we “smooth” out the data to remove the noise? Following are popular data smoothing techniques:

o Binning: Binning methods smooth a sorted data value by consulting its “neighborhood” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins. Because binning methods consult the neighbourhood of values, they perform local smoothing. In general, the larger the width, the greater the effect of the smoothing. Binning is also used as a discretization technique. Eg.

Sorted data for price : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:

Bin 1: 4, 8, 15

Bin 2: 21, 21, 24

Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9

Bin 2: 22, 22, 22

Bin 3: 29, 29, 29

Smoothing by bin boundaries:

Bin 1: 4, 4, 15

Bin 2: 21, 21, 24

Bin 3: 25, 25, 34

o Regression: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

o Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers.

Data Cleaning

2 comments: