Real-world
data tend to be incomplete, noisy, and inconsistent. Data cleaning routines
attempt to fill in missing values, smooth out noise and correct inconsistencies
in the data. Basic methods for data cleaning are as follows
·
Missing
Values: Many tuples may have no recorded value for several
attributes, such as customer income. This scenario can be handled in
following ways.
o
Ignore
the tuple: This is usually done when the class label is missing. This
method is not very effective, unless the tuple contains several attributes with
missing values. It is especially poor when the percentage of missing values per
attribute varies considerably.
o
Fill
in the missing value manually: In general, this approach is
time-consuming and may not be feasible given a large data set with many missing
values.
o
Use a
global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ¥. If
missing values are replaced by, say, “Unknown,” then the mining program
may mistakenly think that they form an interesting concept, since they all have
a value in common—that of “Unknown.” Hence, although this method is
simple, it is not foolproof.
o
Use the
attribute mean to fill in the missing value: For example, suppose that the average
income of customers is $56,000. Use this value to replace the missing value for
income.
o
Use
the attribute mean for all samples belonging to the same class as
the given tuple: For example, if classifying customers according to credit
risk, replace the missing value with the average income value for
customers in the same credit risk category as that of the given tuple.
o
Use
the most probable value to fill in the missing value: This
may be determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other customer
attributes in your data set, you may construct a decision tree to predict the
missing values for income. Most popular and most accurate method.
·
Noisy
data: “What is noise?” Noise is a random error or
variance in a measured variable. Given a
numerical attribute such as,
say, price, how can we “smooth” out the data to remove the noise? Following
are popular data smoothing techniques:
o
Binning:
Binning methods smooth a sorted data value by consulting its “neighborhood”
that is, the values around it. The sorted values are distributed into a number
of “buckets,” or bins. Because binning methods consult the neighbourhood
of values, they perform local smoothing. In general, the larger the
width, the greater the effect of the smoothing. Binning is also used as a discretization
technique. Eg.
Sorted
data for price : 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition
into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing
by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing
by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
o
Regression: Data
can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or
variables), so that one attribute can be used to predict the other. Multiple
linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface.
o Clustering:
Outliers may be detected by clustering, where similar values are organized into
groups, or “clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.
I truly appreciate this post. I’ve been looking everywhere for this! Thank goodness I found it on Bing. You’ve made my day! Thank you again
ReplyDeletetop data cleansing tools
This comment has been removed by the author.
ReplyDelete