Data Preprocessing

Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.1

Data preprocessing in Python

importpandasaspddf=pd.read_csv('../data/titanic-train.csv')df.head()#shows the head of the loaded dataset
df.info()#number of entries for each feature and feature type
df.describe()#see information about the numerical features

IMDB dataset

We’ll want to build a reverse dictionary and pad our data (make all reviews have the same length)

fromkeras.datasetsimportimdb(X_train,y_train),(X_test,y_test)=imdb.load_data('/tmp/imdb.npz',num_words=None,skip_top=0,maxlen=None,start_char=1,oov_char=2,index_from=3)max(idx.values())#number of different words
rev_idx={v+3:kfork,vinidx.items()}rev_idx0='padding_char'rev_idx1='start_char'rev_idx2='oov_char'rev_idx3='unk_char'#transform review from indices to words
example_review=' '.join([rev_idx[word]forwordinX_train0])print(example_review)fromkeras.preprocessing.sequenceimportpad_sequences#this type of padding preserves the last maxlen datapoints
X_train_pad=pad_sequences(X_train,maxlen=maxlen)X_test_pad=pad_sequences(X_test,maxlen=maxlen)

Train set split

Normalization

Standard normalization

We want our data to have μ(mean)=0 and σ(Xi)=σ(Xj) for any j!=i (variance).

fromsklearn.preprocessingimportStandardScalerss=StandardScaler()#for training data and unseen data
train=ss.fit_transform(train)#learn a set of scaling/shifting operations to fit the data in a standard distribution with mean 0 and variance 1
test=ss.transform(test)#apply the same operations to previously unseen test data

MinMax normalization

We’ll scale our data to fit on a scale from 0.0 to 1.0

fromsklearn.preprocessingimportMinMaxScalermms=MinMaxScaler()#for training data and unseen data
train=mms.fit_transform(train)#learn a set of scaling/shifting operations to fit the data in the [0,1] range
test=mms.transform(test)#apply the same operations to previously unseen test data