當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！

發布時間：2023/12/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

投影儀投影粉色

內部AI (Inside AI)

Scikit-learn is one of the most famous machine learning library in Python. It offers several classification, regression and clustering algorithms and its key strength, in my opinion, is seamless integration with Numpy, Pandas and Scipy.

Scikit-learn是Python中最著名的機器學習庫之一。它提供了幾種分類，回歸和聚類算法，在我看來，它的主要優勢是與Numpy，Pandas和Scipy無縫集成。

Scikit-learn is so well written by the developers, that with a couple of lines of code we can check the model predictions by many different algorithms. I feel sometimes, this strength of Scikit-learn inadvertently works to its disadvantage. Machine learning developers esp. with relatively lesser experience implements an inappropriate algorithm for prediction without grasping particular algorithms salient feature and limitations.

Scikit-learn由開發人員編寫得非常好，以至于只需幾行代碼，我們就可以通過許多不同的算法檢查模型預測。有時我感到，Scikit學習的這種力量在無意中起到了不利的作用。機器學習開發人員，尤其是。經驗相對較少的人在不掌握特定算法的顯著特征和局限性的情況下，實施了不合適的預測算法。

In this article, I will discuss the reason we should not use the decision tree regression algorithm in making a prediction involving extrapolating the data.

在本文中，我將討論在進行涉及外推數據的預測時不應使用決策樹回歸算法的原因。

Objective

目的

We have the iron, calcium and protein content of peas since the time it is picked from the farm until 1142 days. Let us assume that it is easier and economical to measure the iron and calcium content compare to protein content.

自豌豆從農場被采摘到1142天，我們已經擁有其鐵，鈣和蛋白質的含量。讓我們假設與蛋白質含量相比，測量鐵和鈣含量更容易，更經濟。

We will use this data to train the DecisionTreeRegressor algorithm and then predict the protein content based on new data points related to iron content, calcium, and days passed.

我們將使用這些數據來訓練DecisionTreeRegressor算法，然后根據與鐵含量，鈣和通過天數有關的新數據點預測蛋白質含量。

Sample Data File

樣本數據文件

I think that the data file is pretty much self-explanatory. The rows show the iron, calcium and protein content of peas with days since harvesting.

我認為數據文件幾乎是不言自明的。這些行顯示了自收獲以來的豌豆中鐵，鈣和蛋白質的含量。

Step 1- We will import the packages pandas, matplotlib, and DecisionTreeRegressor and NumPy which we are going to use for our analysis.

第1步 -我們將導入將用于分析的軟件包pandas，matplotlib，DecisionTreeRegressor和NumPy。

from sklearn.tree import DecisionTreeRegressor
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Step 2- Read the full data sample data excel file into the PandasDataframe called “data”.

步驟2-將完整的數據樣本數據excel文件讀入稱為“ data”的PandasDataframe中。

data= pd.read_excel("Peas Nutrient.xlsx")

I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.

在本文中，我將不著重于初步的數據質量檢查，例如空白值，離群值等，以及相應的校正方法，并假設沒有與差異有關的數據系列。

Step 3- We will split the full data set into two parts viz. training and testing set. As the name suggests, we will be using the training dataset to train the decision tree regressor algorithm and compare the protein predictions with actual content based on data in the testing set.

步驟3-我們將整個數據集分為兩部分。培訓和測試集。顧名思義，我們將使用訓練數據集來訓練決策樹回歸算法，并根據測試集中的數據將蛋白質預測值與實際含量進行比較。

In the below code, data records from day 1 to day 900 are sliced as training data and data records from day 901 to 1142 as testing data.

在下面的代碼中，將第1天到第900天的數據記錄切成訓練數據，將第901天到1142天的數據記錄切成測試數據。

Training_data= data[:900]
Test_data=data.loc[901:1142]

Step 4- “Days passed”, “iron content” and “calcium content” are independent variables used for prediction.”Protein content” predicted is the dependent variable. Generally, the independent variable is denoted with “X “and the dependent variable with “y”.

步驟4- “經過的天數”，“鐵含量”和“鈣含量”是用于預測的自變量。預測的“蛋白質含量”是因變量。通常，自變量用“ X”表示，因變量用“ y”表示。

In the code below, “Protein content” data column is dropped from the DataFrame and remaining, data i.e independent variables datapoints is declared as X_train. Similarly, all the data columns except “Protein content” is dropped and declared as y_train.

在下面的代碼中，“蛋白質內容”數據列從DataFrame中刪除，剩余的數據(即自變量數據點)聲明為X_train。同樣，所有數據列(“蛋白質內容”除外)都將被刪除并聲明為y_train。

X_train=Training_data.drop(["Protein Content "], axis=1)
y_train=Training_data.drop(["Days Passed", "Iron Content " ,"Calcium Content "], axis=1)

The same process is repeated in the below code for the testing data set i.e. values from day 901 to day 1142,

在下面的代碼中針對測試數據集重復相同的過程，即從901天到1142天的值，

X_test=Test_data.drop(["Protein Content "], axis=1)
y_test=Test_data.drop(["Days Passed", "Iron Content " ,"Calcium Content "], axis=1)

Step 5- DecisionTreeRegressor model is trained with the training dataset. Further, the score is checked to understand how well the algorithm is trained on this data.

步驟5-使用訓練數據集對DecisionTreeRegressor模型進行訓練。此外，檢查分數以了解算法在該數據上的訓練程度。

tree_reg = DecisionTreeRegressor().fit(X_train, y_train)
print("The model training score is" , tree_reg.score(X_train, y_train))

A perfect score of 1.0 itself indicates the overfitting of the model.

1.0的完美得分本身表明該模型過度擬合。

Step 5- To address the overfitting due to unconstrained depth of tree during training the model, we will put a constraint of the max depth of 4.

步驟5-為了解決在訓練模型期間因樹深度不受約束而導致的過度擬合，我們將最大深度限制為4。

tree_reg = DecisionTreeRegressor(max_depth=6).fit(X_train, y_train)
print("The model training score is" , tree_reg.score(X_train, y_train))

This solves the overfitting of the model on training data, and the model is ready to predict the protein content based on test data points.

這解決了模型對訓練數據的過度擬合問題，并且該模型已準備好根據測試數據點預測蛋白質含量。

Step 6- In the below code, “protein content” of test data set i.e. from days 901 to 1142 is predicted based on respective “days passed”, “iron content” and “calcium content” data.

步驟6-在下面的代碼中，根據相應的“經過的天數”，“鐵含量”和“鈣含量”數據預測測試數據集(即從901天到1142天)的“蛋白質含量”。

y_pred_tree = tree_reg.predict(X_test)

Step 7- We will plot the predicted protein content by the decision tree regression model and compare with actual protein content in the test dataset from day 901 to 1142.

步驟7-我們將通過決策樹回歸模型繪制預測的蛋白質含量，并與從901天到1142天的測試數據集中的實際蛋白質含量進行比較。

plt.plot(X_test["Days Passed"],y_test, label="Actual Data")
plt.plot(X_test["Days Passed"],np.rint(y_pred_tree), label="Predicted Data")
plt.ylabel("Days Passed")
plt.xlabel('Protin Content (in Grams)')
plt.legend(loc='best')
plt.show()

We can see that the decision tree regressor model, which is trained quite well in training dataset with 0.93 score fails miserably to predict the protein content on test data. The model predicts the same protein content of ~ 51.34 for all days.

我們可以看到，在0.93分數的訓練數據集中訓練得很好的決策樹回歸模型在預測測試數據中的蛋白質含量方面失敗了。該模型預測所有天的蛋白質含量相同，約為51.34。

We should not use the Decision Tree Regression model for prediction involving extrapolating the data. This is just an example, and the main takeaway for us machine learning practitioners are to consider the data, prediction objective, algorithms strengths and limitations before starting modelling.

我們不應該將決策樹回歸模型用于涉及外推數據的預測。這只是一個例子，對于我們的機器學習從業人員來說，主要的收獲是在開始建模之前要考慮數據，預測目標，算法的優勢和局限性。

We can make similar mistakes while selecting the independent variables for Machine Learning Supervised Algorithms. In the article, “How to identify the right independent variables for Machine Learning Supervised Algorithms? ” I have discussed a structured approach to identify the appropriate independent variables to make accurate predictions.

在為機器學習監督算法選擇自變量時，我們可能會犯類似的錯誤。在文章“如何為機器學習監督算法中確定正確的自變量？ ”我已經討論了結構化的方法來確定適當的獨立變量做出準確的預測。

翻譯自: https://towardsdatascience.com/decisiontreeregressor-stop-using-for-future-projections-e27104537f6a

投影儀投影粉色

總結

以上是生活随笔為你收集整理的投影仪投影粉色_DecisionTreeRegressor —停止用于将来的投影！的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：本田成立汽车电动化新部门，全球业务整合为
下一篇：机器学习中的随机过程_机器学习过程