Data cleaning

Use the attribute mean (or majority nominal value) to fill in the missing
value.

Use the attribute mean (or majority nominal value) for all samples belonging
to the same class.

Predict the missing value by using a learning algorithm: consider the attribute
with the missing value as a dependent (class) variable and run a learning
algorithm (usually Bayes or decision tree) to predict the missing value.

Identify outliers and smooth out noisy data:

Binning

Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);

Then smooth by bin means, bin median, or bin boundaries.

Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)

Regression: smooth by fitting the data into regression functions.

Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

Normalization:

Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1],
apply V'=(V-Min)/(Max-Min)

Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev

Aggregation: moving up in the concept hierarchy on numeric attributes.

Generalization: moving up in the concept hierarchy on nominal attributes.