<mark>回归算法属于监督式学习</mark>,每个个体都有一个与之相关联的实数标签,并且我们希望在给出用于表示这些实体的数值特征后,所预测出的标签值可以尽可能接近实际值。
<br/>
MLlib 目前支持回归算法有:线性回归、岭回归、Lasso 和决策树。
<br/>
**1. 线性回归**
(1)数据`$SPARK_HOME/data/mllib/ridge-data/lpsa.data`
```txt
-0.1625189,-1.57881887548545 -2.1887840293994 1.36116336875686 ...
-0.1625189,-2.16691708463163 -0.807993896938655 -0.787896192088153 ...
0.3715636,-0.507874475300631 -0.458834049396776 -0.250631301876899 ...
0.7654678,-2.03612849966376 -0.933954647105133 -1.86242597251066 ...
1.2669476,-2.28833047634983 -0.0706369432557794 -0.116315079324086 ...
...
```
(2)代码
```scala
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.{LabeledPoint, LinearRegressionModel, LinearRegressionWithSGD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object LinearAlgorithm {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]")
.setAppName(this.getClass.getName)
val sc: SparkContext = SparkContext.getOrCreate(conf)
val data: RDD[String] = sc.textFile("F:/mllib/lpsa.data")
val parsedData: RDD[LabeledPoint] = data.map {
line =>
val parts: Array[String] = line.split(",")
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(" ").map(x => x.toDouble).toArray))
}
// 迭代次数
val numIterations = 100
// 模型训练
val model: LinearRegressionModel = LinearRegressionWithSGD.train(parsedData, numIterations)
// 可以选择将模型保存
// model.save(sc, "F:/mllib/model/linear")
// 加载模型
// val model: LinearRegressionModel = LinearRegressionModel.load(sc, "F:/mllib/model/linear")
// 模型预测
val valuesAndPreds: RDD[Tuple2[Double, Double]] = parsedData.map {
point =>
val prediction: Double = model.predict(point.features)
(point.label, prediction)
}
// 统计回归错误的样本比例
// 使用均方差来评估预测值与实际值的吻合度
val trainErr: Double = valuesAndPreds.map {
case (v, p) => math.pow((v - p), 2)
}.reduce(_ + _) / valuesAndPreds.count()
println(trainErr) // 6.207597210613578
}
}
```