一 Building a Classification Model with Spark:

1 从以下地址下载数据。 http://www.kaggle.com/c/stumbleupon/data.
2 启动hadoop和Spark环境，
3 Change to the directory in which you downloaded the data (referred to as PATH here) and run the following command to remove the first line and pipe the result to a new file called train_noheader.tsv:

sed 1d train.tsv > train_noheader.tsv

Now, we are ready to start up our Spark shell (remember to run this command from your Spark installation directory):

二 Building a Regression Model with Spark:

Spark’s MLlib library offers two broad classes of regression models: linear models and decision tree regression models.
（1）Linear models are essentially the same as their classification counterparts, the only difference is that linear regression models use a different loss function, related link function, and decision function. MLlib provides a standard least squares regression model (although other types of generalized linear models for regression are planned).
（2）Decision trees can also be used for regression by changing the impurity measure.

1 数据准备。
The dataset is available at http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Once you have downloaded the Bike-Sharing-Dataset.zip file, unzip it. This will create a directory called Bike-Sharing-Dataset, which contains the day.csv, hour.csv, and the Readme.txt files
The Readme.txt file contains information on the dataset, including the variable names and descriptions. Take a look at the file, and you will see that we have the following variables available:
instant: This is the record ID
dteday: This is the raw date
season: This is different seasons such as spring, summer, winter, and fall
yr: This is the year (2011 or 2012) mnth: This is the month of the year hr: This is the hour of the day
holiday: This is whether the day was a holiday or not
weekday: This is the day of the week
workingday: This is whether the day was a working day or not
weathersit: This is a categorical variable that describes the weather at a particular time
temp: This is the normalized temperature
atemp: This is the normalized apparent temperature
hum: This is the normalized humidity
windspeed: This is the normalized wind speed
cnt: This is the target variable, that is, the count of bike rentals for that hour

We will work with the hourly data contained in hour.csv. If you look at the first line of the dataset, you will see that it contains the column names as a header. You can do this by running the following command:

head -1 hour.csv

This should output the following result:
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp
,atemp,hum,windspeed,casual,registered,cnt
Before we work with the data in Spark, we will again remove the header from the first line of the file using the same sed command that we used previously to create a new file called hour_noheader.csv:

sed 1d hour.csv > hour_noheader.csv

Since we will be doing some plotting of our dataset later on, we will use the Python shell for this chapter. This also serves to illustrate how to use MLlib’s linear model and decision tree functionality from PySpark.

2 启动hadoop和Spark环境。

3 将数据复制到HDFS文件系统。

4 安装PIP: apt-get install python-pip

5 Start up your PySpark shell from your Spark installation directory. If you want to use IPython, which we highly recommend, remember to include the IPYTHON=1 environment variable together with the pylab functionality:
进入spark-1.3.1/bin目录，运行： ipython=1 ipython_opts=”–pylab” ./pyspark

from pyspark.mllib.regression import LabeledPoint
import numpy as np
path = "./hour_noheader.csv"
raw_data = sc.textFile(path)
num_data = raw_data.count()
records = raw_data.map(lambda x: x.split(","))
first = records.first()
print first
print num_data
// We will first cache our dataset, since we will be reading from it many times:
records.cache()
/* In order to extract each categorical feature into a binary vector form, we will need to know the feature mapping of each feature value to the index of the nonzero value in our binary vector. Let’s define a function that will extract this mapping from our dataset for a given column: */
defget_mapping(rdd, idx):return rdd.map(lambda fields: fields[idx]).distinct().zipWithIndex().collectAsMap()
/* Our function first maps the field to its unique values and then uses the zipWithIndex transformation to zip the value up with a unique index such that a key-value RDD is formed, where the key is the variable and the value is the index. This index will be the index of the nonzero entry in the binary vector representation of the feature. We will finally collect this RDD back to the driver as a Python dictionary. */
//We can test our function on the third variable column (index 2):
print"Mapping of first categorical feasture column: %s" % get_mapping(records, 2)
/* Now, we can apply this function to each categorical column (that is, for variable indices 2 to 9): */
mappings = [get_mapping(records, i) for i in range(2,10)] cat_len = sum(map(len, mappings))
num_len = len(records.first()[11:15]) total_len = num_len + cat_len
/*We now have the mappings for each variable, and we can see how many values in total we need for our binary vector representation: */
print"Feature vector length for categorical features: %d" % cat_len
print"Feature vector length for numerical features: %d" % num_len
print"Total feature vector length: %d" % total_len
//Creating feature vectors for the linear model
defextract_features(record): cat_vec = np.zeros(cat_len) i = 0
step = 0for field in record[2:9]:
m = mappings[i]
idx = m[field]
cat_vec[idx + step] = 1
i = i + 1
step = step + len(m)
num_vec = np.array([float(field) for field in record[10:14]])
return np.concatenate((cat_vec, num_vec))
defextract_label(record):return float(record[-1])
/*In the preceding extract_features function, we ran through each column in the row of data. We extracted the binary encoding for each variable in turn from the mappings we created previously. The step variable ensures that the nonzero feature index in the full feature vector is correct (andis somewhat more efficient than, say, creating many smaller binary vectors and concatenating them). The numeric vector is created directly by first converting the data to floating point numbers and wrapping these in a numpy array. The resulting two vectors are then concatenated. The extract_label function simply converts the last column variable (the count) into a float.
With our utility functions defined, we can proceed with extracting feature vectors and labels from our data records:*/
data = records.map(lambda r: LabeledPoint(extract_label(r), extract_features(r)))
Let’s inspect the first record in the extracted feature RDD:
first_point = data.first()
print"Raw data: " + str(first[2:])
print"Label: " + str(first_point.label)
print"Linear Model feature vector:\n" + str(first_point.features)
print"Linear Model feature vector length: " + str(len(first_point.features))
//Creating feature vectors for the decision tree
/* As we have seen, decision tree models typically work on raw features (that is, it isnot required to convert categorical features into a binary vector encoding; they can, instead, be used directly). Therefore, we will create a separate function to extract the decision tree feature vector, which simply converts all the values to floats and wraps them in a numpy array: */
defextract_features_dt(record):return np.array(map(float, record[2:14]))
data_dt = records.map(lambda r: LabeledPoint(extract_label(r), extract_features_dt(r)))
first_point_dt = data_dt.first()
print"Decision Tree feature vector: " + str(first_point_dt.features)
print"Decision Tree feature vector length: " + str(len(first_point_dt.features))
/* for the decision tree model, which has a trainRegressor method (in addition to a trainClassifier method for classification models): */
from pyspark.mllib.regression import LinearRegressionWithSGD
from pyspark.mllib.tree import DecisionTree help(LinearRegressionWithSGD.train)
help(DecisionTree.trainRegressor)
//Training a regression model on the bike sharing dataset
/* We’ re ready to use the features we have extracted to train our models on the bike sharing data. First, we’ll train the linear regression model and take a look at the first few predictions that the model makes on the data: */
linear_model = LinearRegressionWithSGD.train(data, iterations=10, step=0.1, intercept=False)
true_vs_predicted = data.map(lambda p: (p.label, linear_model.predict(p.features)))
print"Linear Model predictions: " + str(true_vs_predicted.take(5))
/* Next, we will train the decision tree model simply using the default arguments to the trainRegressor method (which equates to using a tree depth of 5). Note that we need to passin the other form of the dataset, data_dt, that we created from the raw feature values (as opposed to the binary encoded features that we used for the preceding linear model).
We also need to passin an argument for categoricalFeaturesInfo. This is a dictionary that maps the categorical feature index to the number of categories for the feature. If a feature isnotin this mapping, it will be treated as continuous. For our purposes, we will leave this asis, passing in an empty mapping: */
dt_model = DecisionTree.trainRegressor(data_dt,{})
preds = dt_model.predict(data_dt.map(lambda p: p.features))
actual = data.map(lambda p: p.label)
true_vs_predicted_dt = actual.zip(preds)
print"Decision Tree predictions: " + str(true_vs_predicted_dt.take(5))
print"Decision Tree depth: " + str(dt_model.depth())
print"Decision Tree number of nodes: " + str(dt_model.numNodes())
//Evaluating the performance of regression models
defsquared_error(actual, pred):return (pred - actual)**2defabs_error(actual, pred):return np.abs(pred - actual)
defsquared_log_error(pred, actual):return (np.log(pred + 1) - np.log(actual + 1))**2
//Linear model
/* Our approach will be to apply the relevant error function to each record in the RDD we computed earlier, which is true_vs_predicted for our linear model: */
mse = true_vs_predicted.map(lambda (t, p): squared_error(t, p)).mean()
mae = true_vs_predicted.map(lambda (t, p): abs_error(t, p)).mean()
rmsle = np.sqrt(true_vs_predicted.map(lambda (t, p): squared_log_error(t, p)).mean())
print"Linear Model - Mean Squared Error: %2.4f" % mse print"Linear Model - Mean Absolute Error: %2.4f" % mae
print"Linear Model - Root Mean Squared Log Error: %2.4f" % rmsle
//Decision tree
/* We will use the same approach for the decision tree model, using the
true_vs_predicted_dt RDD: */
mse_dt = true_vs_predicted_dt.map(lambda (t, p): squared_error(t, p)).mean()
mae_dt = true_vs_predicted_dt.map(lambda (t, p): abs_error(t, p)).mean() rmsle_dt = np.sqrt(true_vs_predicted_dt.map(lambda (t, p): squared_log_error(t, p)).mean())
print"Decision Tree - Mean Squared Error: %2.4f" % mse_dt print"Decision Tree - Mean Absolute Error: %2.4f" % mae_dt
print"Decision Tree - Root Mean Squared Log Error: %2.4f" % rmsle_dt