當前位置：首頁 > 编程语言 > python >内容正文

python

梯度提升树python_梯度增强树回归— Spark和Python

發布時間：2023/12/15 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了梯度提升树python_梯度增强树回归— Spark和Python 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

梯度提升樹python

This story demonstrates the implementation of a “gradient boosted tree regression” model using python & spark machine learning. The dataset used is “bike rental info” from 2011–2012 in the capital bike share system. Our goal is to predict the count of bike rentals.

這個故事展示了如何使用python和spark機器學習實現“梯度提升樹回歸”模型。在首都自行車共享系統中，使用的數據集是2011-2012年的“ 自行車租賃信息 ”。我們的目標是預測自行車租賃的數量 。

1.加載數據 (1. Load the data)

The data in store is a CSV file. We are to create a spark data frame containing the bike data set. We cache this data so that we read it only once from the disk.

存儲中的數據是CSV文件。我們將創建一個包含自行車數據集的spark數據框。我們緩存此數據，以便僅從磁盤讀取一次。

#load the dataset & cache
df = spark.read.csv("/databricks-datasets/bikeSharing/data-001/hour.csv", header="true", inferSchema="true")df.cache()df.cache()#view the imported dataset
display(df)

輸出： (Output:)

2.預處理數據 (2. Pre-Process the data)

Fields such as “weekday” are indexed, and all the other fields except date “dteday” are numerical. The count is our target "label". The “cnt” column we aim to predict equals the sum of the “casual” & “registered” columns.

索引諸如“工作日”的字段，除日期“ dteday”以外的所有其他字段均為數字。計數是我們的目標“標簽”。我們旨在預測的“ cnt”列等于“休閑”和“注冊”列的總和。

The next steps involve removing the “casual” and “registered” columns from the dataset to make sure we do not use them in predicting “cnt”. So, we discard the “dteday” and use the columns “season”, “yr”, “mnth” and “weekday”.

下一步涉及從數據集中刪除“休閑”和“注冊”列，以確保我們在預測“ cnt”時不使用它們。因此，我們丟棄“ dteday”并使用“ season”，“ yr”，“ mnth”和“ weekday”列。

#drop the features mentioned
df = df.drop("instant").drop("dteday").drop("casual").drop("registered")#print the schema of our dataset to see the type of each column
df.printSchema()Image by Author圖片作者

3.轉換數據類型 (3. Cast Data types)

The DataFrame uses string categories, and we know that the columns are numerical in nature. So we cast them in order to proceed.

DataFrame使用字符串類別，并且我們知道這些列本質上是數字。因此，我們將其投放以進行下一步。

# casts all columns to a numeric typefrom pyspark.sql.functions import col # for indicating a column using a string in the line belowdf = df.select([col(c).cast("double").alias(c) for c in df.columns])df.printSchema()Image by Author圖片作者

4.訓練和測試集 (4. Train & Test Sets)

The data prep step splits the dataset into train and test sets. We train/tune the model on the training set.

數據準備步驟將數據集分為訓練集和測試集。我們在訓練集上訓練/調整模型。

# Split 70% for training and 30% for testingtrain, test = df.randomSplit([0.7, 0.3])print("We have %d training examples and %d test examples." % (train.count(), test.count())

There are 12160 training samples & 5219 test samples.

有12160個訓練樣本和5219個測試樣本。

5.機器學習管道 (5. Machine Learning Pipeline)

Since the data is prepared, let’s learn the ML model to predict rentals for the future.

由于數據已經準備好，讓我們學習機器學習模型來預測未來的租金。

For every row in the data, feature vectors should describe what we know: such as the weather, week(day), etc., & the label is generally what we aim to predict, in this case — (“cnt”).

對于數據中的每一行，特征向量都應描述我們所知道的：例如天氣，星期(天)等，在這種情況下，標簽通常是我們要預測的目標(“ cnt”)。

We then put a Pipeline with the stages mentioned:

然后，我們將管道與提到的階段放在一起：

VectorAssembler: This assembles feature columns into a feature vector.
VectorAssembler：將特征列組裝成特征向量。
VectorIndexer: This identifies columns that are meant to be categorical heuristically, and identifies any column with a small number of distinct values as being categorical.
VectorIndexer：這將標識按啟發式分類的列，并將具有少量不同值的任何列標識為分類。
GBTRegressor: This uses the (GBT) algorithm to learn & predict rental aggregates from feature vectors.
GBTRegressor：使用(GBT)算法從特征向量中學習和預測租金總額。
CrossValidator: The GBT algorithm & it’s parameters, are tuned to improve accuracy of our models.
CrossValidator：對GBT算法及其參數進行了調整，以提高模型的準確性。

from pyspark.ml.feature import VectorAssembler, VectorIndexerfeaturesCols = df.columnsfeaturesCols.remove('cnt')# Concatenates all feature columns into a single feature vector in a new column "rawFeatures"vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures")# Identifies categorical features and indexes themvectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features", maxCategories=4)

Next, we define training stage of the Pipeline. GBTRegressor takes in vectors of the features and the labels as input in order to learn to predict the target labels of newer samples.

接下來，我們定義管道的培訓階段。 GBTRegressor接受要素的矢量和標簽作為輸入，以便學習預測較新樣本的目標標簽。

from pyspark.ml.regression import GBTRegressor# Takes the "features" column and learns to predict "cnt"
gbt = GBTRegressor(labelCol="cnt")

We then use cross validation to tune the parameters & achieve the best results. It trains multiple models, chooses the best, minimizing a metric. Our metric is Root Mean Squared Error (RMSE).

然后，我們使用交叉驗證來調整參數并獲得最佳結果。它訓練多個模型，選擇最佳模型，從而最小化指標。我們的指標是均方根誤差(RMSE)。

from pyspark.ml.tuning import CrossValidator, ParamGridBuilderfrom pyspark.ml.evaluation import RegressionEvaluator# Define a grid of hyperparameters to test:
# - maxDepth: max depth of each decision tree in the GBT ensemble
# - maxIter: iterations, i.e., number of trees in each GBT ensemble
# In this example notebook, we keep these values small. In practice, to get the highest accuracy, you would likely want to try deeper trees (10 or higher) and more trees in the ensemble (>100)paramGrid = ParamGridBuilder()\
.addGrid(gbt.maxDepth, [2, 5])\
.addGrid(gbt.maxIter, [10, 100])\
.build()
# We define an evaluation metric. This tells CrossValidator how well we are doing by comparing the true labels with predictions.
evaluator = RegressionEvaluator(metricName="rmse", labelCol=gbt.getLabelCol(), predictionCol=gbt.getPredictionCol())
# Declare the CrossValidator, which runs model tuning for us.
cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid)

Lastly, we tie our features & model training together into one Pipeline.

最后，我們將功能和模型培訓結合在一起，形成一條管道。

Image by Author圖片作者 from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])

6.訓練和測試管道 (6. Train & Test the Pipeline)

pipelineModel = pipeline.fit(train)

MLlib will allow trials in MLflow. After tuning fit() call is done, the MLflow UI can be accessed to view the logged runs.

MLlib將允許在MLflow中進行試用。調優fit()調用完成后，可以訪問MLflow UI以查看記錄的運行。

predictions = pipelineModel.transform(test)display(predictions.select("cnt", "prediction", *featuresCols))Image by Author圖片作者

The result may not be the best, but that’s where model tuning kicks in.

結果可能不是最好的，但這就是進行模型調整的地方。

The (RMSE) mentioned above, tells us how well our model predicts on new samples.

上面提到的(RMSE)告訴我們模型對新樣本的預測效果如何。

Lower the RMSE, the better.

RMSE越低越好。

rmse = evaluator.evaluate(predictions)
print("RMSE on our test set: %g" % rmse)

RMSE of the test set: 44.6918

測試集的RMSE：44.6918

7.改進模型的技巧 (7. Tips on improving the model)

There are several ways we could further improve our model:

有幾種方法可以進一步改善模型：

Expert knowledge
專業知識
Better Tuning
更好的調音
Feature Engineering
特征工程

Different combinations of the hyperparameters are used to find the best solution.

使用超參數的不同組合來找到最佳解決方案。

Connect on LinkedIn and check out my Github for the complete notebook.

在LinkedIn上連接并查看我的Github以獲取完整的筆記本。

翻譯自: https://towardsdatascience.com/gradient-boosted-tree-regression-spark-dd5ac316a252

梯度提升樹python

總結

以上是生活随笔為你收集整理的梯度提升树python_梯度增强树回归— Spark和Python的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： “耀宇视芯”完成数千万天使轮融资聚焦X
下一篇：变异函数 python_使用Python