机器学习中的随机过程_机器学习过程
機器學習中的隨機過程
If you would like to get a general introduction to Machine Learning before this, check out this article:
如果您想在此之前獲得機器學習的一般介紹,請查看本文:
Now that we understand what Machine Learning is, let us now learn about how Machine Learning is applied to solve any problem.
現在我們了解了什么是機器學習,現在讓我們了解如何將機器學習應用于解決任何問題。
This is the basic process which is used to apply machine learning to any problem :-
這是將機器學習應用于任何問題的基本過程:
資料收集 (Data Gathering)
The first step to solving any machine learning problem is to gather relevant data. It could be from different sources and in different formats like plain text, categorical or numerical. Data Gathering is important as the outcome of this step directly affects the nature of our problem.
解決任何機器學習問題的第一步是收集相關數據。 它可能來自不同的來源,并且格式不同,例如純文本,分類或數字。 數據收集非常重要,因為此步驟的結果直接影響我們問題的性質。
In most cases, data is not handed to us on a silver platter all ready-made, that is, it is not usually the case that the data we have decided is relevant may be available right away. It is very much possible that we may have to perform some sort of an exercise or a controlled experiment to gather data that we can work with. We must also keep in mind that the data we are collecting is from legitimate and legal processes such that all the parties involved are well aware of what is being collected.
在大多數情況下,數據并沒有全部準備就緒地交給我們,也就是說,通常我們不會立即獲得我們認為相關的數據。 我們很有可能必須執行某種練習或受控實驗來收集可以使用的數據。 我們還必須記住,我們正在收集的數據來自合法和合法的流程,因此所有相關方都非常了解所收集的內容。
Let us, for the purpose of this article, assume that we have gathered data about cars and that we are trying to predict the price of a new car with the help of machine learning.
出于本文的目的,讓我們假設我們已經收集了有關汽車的數據,并且我們正在嘗試借助機器學習來預測新車的價格。
數據預處理 (Data Preprocessing)
Now that we have gathered data that is relevant to the problem in hand, we must bring it to a homogeneous state. The present form of our data could include datasets of various types, maybe a table made up of a thousand rows and multiple columns of car data, or maybe pictures of cars from different angles. It is always advisable to keep thing simple and work with data of one particular type, that is, we should decide before we start working on our algorithm whether we want to work with image data, text data, or video data if we are feeling a little too adventurous!
現在,我們已經收集了與手頭問題相關的數據,我們必須將其置于同類狀態。 我們數據的當前形式可能包括各種類型的數據集,可能是由一千行和多列汽車數據組成的表格,也可能是來自不同角度的汽車圖片。 始終建議保持簡單并使用一種特定類型的數據,也就是說,如果我們感覺要使用圖像數據,文本數據或視頻數據,則應該在開始算法之前就決定要使用圖像數據,文本數據還是視頻數據。有點冒險!
Types of Data. Photo by Author.數據類型。 圖片由作者提供。Like every computer program, Machine Learning algorithms also only understand 1s and 0s. So in order to run any such algorithm, we have to first convert the data into a machine-readable format. It simply won’t understand if we put on a slideshow of our pictures!We can go with any type of data -numerical, image, video or text- but we will have to configure it such that it is machine understandable. We make sure this happens by Encoding the data — a process in which we take all the types of data and represent them numerically.
像每個計算機程序一樣,機器學習算法也只能理解1和0。 因此,為了運行任何此類算法,我們必須首先將數據轉換為機器可讀格式。 它根本無法理解是否要對圖片進行幻燈片放映!我們可以處理任何類型的數據-數字,圖像,視頻或文本-但我們必須對其進行配置,以使機器可以理解。 我們通過對數據進行編碼來確保做到這一點-在此過程中,我們將獲取所有類型的數據并以數字形式表示它們。
For a simple and comprehensible introduction to Data Preprocessing and all the steps involved, check out this article :
有關數據預處理及其涉及的所有步驟的簡單而易懂的介紹,請查看本文:
訓練和測試數據 (Train and Test Data)
Before we start building a Machine Learning model, we have to first identify our features and decide on our goal. Features are the attributes of our data which tell us about the different entities in the data. For instance, we could be having a huge dataset about cars to predict the price of a new car using machine learning. With these cars being the entities, features, in this case, might be the engine power, mileage, top speed, color, seating capacity, type of car etc. etc..The goal or the Target variable, in this case, would be the price of the car.
在開始構建機器學習模型之前,我們必須首先確定我們的功能并確定我們的目標 。 特征是數據的屬性,可告訴我們數據中的不同實體。 例如,我們可能擁有大量有關汽車的數據集,以便使用機器學習來預測新車的價格。 以這些汽車為實體,在這種情況下,功能可能是發動機功率,里程,最高速度,顏色,座位容量,汽車類型等。在這種情況下, 目標或目標變量將是汽車的價格。
When we work on any machine learning problem, we always split the dataset that we have into a Training Set and a Test set, usually a (70/30) or (80/20) split respectively. The Training set, as the name suggests, is used to train the model. When we “train” the model, it tries to understand how all the features of the dataset form the target variable — in case of supervised learning, or the relationships and correlations between all the features — in case of unsupervised learning. After this, the Test set is then used to find out how well the model’s understanding is of the data.
當我們處理任何機器學習問題時,我們總是將擁有的數據集劃分為訓練集和測試集 ,通常分別是(70/30)或(80/20)劃分。 顧名思義,訓練集用于訓練模型。 當我們“訓練”模型時,它試圖了解數據集的所有特征如何形成目標變量(如果是監督學習的話),或者所有特征之間的關系和相關性(如果是無監督學習的話)。 此后,然后使用測試集找出模型對數據的理解程度。
機器學習算法選擇 (Machine Learning Algorithm Selection)
After transforming data such that it is clean and workable, we get a better idea of the solution we will try and implement to solve the problem. This is because it is actually the data that decides what we can and cannot use.
在對數據進行了整潔和可操作的轉換之后,我們對解決方案有了一個更好的了解,我們將嘗試并實現該解決方案。 這是因為實際上是數據決定了我們可以使用和不能使用的內容。
Say we want to build a chatbot. The chatbot will answer as per the user queries. So, we can say that the first step to any conversation will be the chatbot trying to identify the intent of the user, and there is our first machine learning problem — Intent Classification.
假設我們要構建一個聊天機器人。 聊天機器人將根據用戶查詢進行回答。 因此,可以說,任何對話的第一步都是聊天機器人試圖識別用戶的意圖,這是我們的第一個機器學習問題-意圖分類。
This problem requires us to use a particular type of data — Text based data. The machine learning algorithm we choose must be a classification algorithm, that is, it classifies the new input data to a certain label class as per the data which it has already seen. Before this step, of course, the text from the user will get encoded and go through all the data preprocessing steps necessary and then it will be fed into the machine learning algorithm. Although we have to be careful in selecting our machine learning algorithm, it is good to explore all the available options and working out with various appropriate machine learning algorithms before selecting the final one — it is considered as a best practice anyway.
這個問題要求我們使用一種特定類型的數據-基于文本的數據。 我們選擇的機器學習算法必須是分類算法,也就是說,它將新輸入數據根據已經看到的數據分類為某個標簽類。 當然,在此步驟之前,來自用戶的文本將被編碼并經過所有必要的數據預處理步驟,然后將其輸入到機器學習算法中。 盡管我們在選擇機器學習算法時必須謹慎,但是在選擇最后一種算法之前,最好先探究所有可用的選項,并嘗試各種合適的機器學習算法-仍然被認為是最佳實踐。
成本函數 (Cost Function)
A Cost Function, in a nutshell, is a mathematical function which gives out cost as a metric; And as you may have heard — there is a cost associated with every decision that we take.
簡而言之,成本函數是一種數學函數,它給出了成本作為度量標準。 如您所知,我們做出的每項決定都會產生一定的成本。
“A Cost function or a Loss function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event.” — Wikipedia
“成本函數或損失函數是將一個事件或一個或多個變量的值映射到一個實數上的函數,該函數直觀地表示與該事件相關的某些”成本”。” — 維基百科
This function is used to quantify the penalty correspondent to every step in any procedure. In terms of an optimization problem, we must work on minimizing this cost value.
此功能用于量化與任何過程中每個步驟相對應的懲罰 。 就優化問題而言,我們必須努力使該成本值最小化。
Let us go through an example -Suppose you are climbing down a cliff. At any point, you have several paths to take to eventually reach the bottom, but you will :
讓我們來看一個例子-假設您正在攀登懸崖。 在任何時候,您都可以采取幾種方法最終達到最低點,但是您將:
- Look for the best path from that point 從那一點尋找最佳路徑
- Reject all the paths taking you any higher than you already are 拒絕所有使您比以前更高的路徑
If we associate going up with a penalty or a cost, we will be increasing the total cost (in terms of time and effort) if we go up. So we can potentially keep time and effort as factors if we were to design a mathematical function to quantify this cost metric.
如果我們將漲價與罰款或成本聯系起來,那么如果漲價 ,我們將增加總成本(在時間和精力方面)。 因此,如果我們要設計數學函數來量化此成本指標,則可以將時間和精力作為因素。
Another example -Suppose you are driving down on a road trip from Place A to Place B. Again, we have several paths to reach B but we :
另一個示例-假設您正在開車從A地點到B地點。同樣,我們有幾種到達B的路徑,但我們:
- Look for the shortest path 尋找最短的路徑
- Reject all the paths which take us somewhere else (obviously!) 拒絕所有帶我們到其他地方的路徑(顯然!)
If we associate this situation with cost, we will have a high cost if we neglect the two points mentioned above. Here we can keep time and gas money as the factors making up our cost function and judge the path taken henceforth.
如果我們把這種情況與成本聯系起來 ,那么如果我們忽略上述兩點,我們將付出高昂的代價。 在這里,我們可以將時間和汽油費作為構成成本函數的因素,并判斷今后的路徑。
Photo by Vidar Nordli-Mathisen on Unsplash Vidar Nordli-Mathisen在Unsplash上拍攝的照片機器學習中的成本函數 (Cost Function in Machine Learning)
Any machine learning algorithm must reach an optimal state for it to function properly. A Cost Function helps us in determining whether or not our model is at that optimal state. That optimal state is found out by the model by continuously comparing the model hypothesis value to the original value in the training set. Woah…back up! What!? Don’t worry we will go through all the concepts carefully!
任何機器學習算法都必須達到最佳狀態才能正常運行。 成本函數可幫助我們確定模型是否處于最佳狀態。 通過不斷將模型假設值與訓練集中的原始值進行比較, 模型可以找到最佳狀態。 哇...備份! 什么!? 不用擔心,我們將仔細研究所有概念!
Hypothesis Function
假設函數
Behind any machine learning model is essentially a mathematical function which explains the role of various features in the data to either form the target variable or to form correlations between different features.
任何機器學習模型的背后都實質上是一個數學函數,該函數解釋了數據中各種特征在形成目標變量或形成不同特征之間的相關性方面的作用。
Hypothesis Function. Photo by Author.假設函數。 圖片由作者提供。As mentioned before, during training the machine learning model tries to understand how different combinations of values of the training data features form the corresponding target variables. To understand better let us take one training record, training essentially means taking all the features of this record and somehow mapping it to this training record’s target value. A brilliant example would be the cars dataset we were talking about earlier. Notation-wise, the features are taken as X and the Target variable is taken as Y. During this process, training data is fed into a learning algorithm which is chosen based on the problem we are trying to solve. It could be a classification problem, a regression problem or maybe something else entirely. It is the job of this learning algorithm to output this Hypothesis Function.
如前所述,在訓練期間,機器學習模型試圖了解訓練數據特征值的不同組合如何形成相應的目標變量。 為了更好地理解,讓我們記錄一個訓練記錄,訓練本質上是指利用該記錄的所有功能,并以某種方式將其映射到該訓練記錄的目標值。 一個很好的例子是我們之前討論的汽車數據集。 從符號上來說,特征取為X,目標變量取為Y。在此過程中,訓練數據被饋送到根據我們要解決的問題選擇的學習算法中。 這可能是分類問題,回歸問題或其他完全問題。 輸出此假設函數是該學習算法的工作。
Hypothesis Function. Source : CS229 Lecture notes by Andrew Ng 假設函數。 資料來源: CS229吳安國講義For a two-variable problem, this could be our Hypothesis function. All the θ values are parameters, or weights, which are chosen such that we get an estimate value closest to the corresponding Target value for each record.
對于兩個變量的問題,這可能是我們的假設函數。 所有θ值都是參數或權重,選擇這些參數或權重是為了使我們得到最接近每個記錄的相應目標值的估計值。
The Hypothesis Function then takes in the features from each training record and tries to estimate the corresponding target value. This function could be a simple linear function or maybe something complex, it really depends on the data and the type of algorithm that is being used. And because it is an estimator function, the output values are not expected to be exactly equal to the target values, at least not in the first attempt. Let us take our cars dataset once more, if we put a learning algorithm to use on this dataset and try to train it using the features, we will get an estimate of the price of each car in the dataset. Now as this is a training dataset, we already have the price of each car as the Target variable.
然后,假設功能從每個訓練記錄中獲取特征,并嘗試估計相應的目標值。 此函數可以是簡單的線性函數,也可以是復雜的函數,它實際上取決于數據和所使用算法的類型。 并且由于它是一個估計器函數,因此期望輸出值不完全等于目標值,至少在第一次嘗試中不會如此。 讓我們再次獲取汽車數據集,如果我們在該數據集上使用學習算法并嘗試使用功能進行訓練,我們將獲得數據集中每輛汽車的價格估算。 現在這是一個訓練數據集,我們已經將每輛車的價格作為目標變量。
Hypothesis trendline of estimated values in blue cutting close to the Target values in red. Photo by Author. 藍色切割的估計值的假設趨勢線接近紅色的目標值。 圖片由作者提供。Cost Function
成本函數
This is where the Cost Function comes into play. We want the difference between the estimated value and the actual Target value present in the training data to be as low as possible, only then can we call say our model is a strong one, meaning that it will give out the correct value of the Target value or at least return a value which is very very close to this Target value for a particular training record. So, this becomes a minimization problem. The difference is what is called the cost and the minimization function is what is called the cost function. There are several ways to achieve a state of minima. We could simply minimize the difference between the estimated value and the target value over the whole training set, or we could take the squared difference, or maybe some other variation to achieve the same thing. One of the most widely accepted and quite a reasonable cost function is this one which you will stumble upon very easily if you are reading up on machine learning algorithms :
這就是成本函數發揮作用的地方。 我們希望訓練數據中的估計值與實際目標值之間的差異盡可能小,只有這樣我們才能稱模型為強模型,這意味著它將給出目標的正確值值或至少返回非常非常接近于特定訓練記錄的目標值的值。 因此,這成為最小化的問題。 差異就是所謂的成本,而最小化函數就是所謂的成本函數。 有幾種方法可以達到最低狀態。 我們可以簡單地將整個訓練集上的估計值和目標值之間的差異最小化,或者可以采用平方差或其他一些變化來實現同一目的。 這個功能是最廣泛接受且相當合理的成本函數之一,如果您正在學習機器學習算法,那么您將很容易發現它:
Cost Function. Source : CS229 Lecture notes by Andrew Ng 成本函數。 資料來源: CS229吳安國講義This function works well for most of the regression problems. Yeah, I know, I know I said I will keep it simple and not scare you with weird equations. Worry not, we are not going to fill up a giant chalkboard with formulas or formulae if you were to be so formal. Let me give you a quick explanation and make everything crystal clear.
此函數對大多數回歸問題都適用。 是的,我知道,我知道我說過我會保持簡單,不要用怪異的方程式嚇到你。 別擔心,我們不會,如果你要這么正式填補了公式或公式一個巨大的黑板。 讓我給您一個簡短的解釋,讓一切變得清晰。
J(θ) — Cost Function Notationm — Number of training recordshθ — Hypothesis Functionx(?)- i?? training data recordhθ(x(?)) — Hypothesis Function value for i?? training recordy(?)- i?? target value
J(θ)—成本函數符號m —訓練記錄數hθ—假設函數x(?)-i??訓練數據記錄hθ(x(?))—i??訓練記錄(?)-i??目標值的假設函數值
- As I mentioned before we denote the cost as the difference between the hypothesis function value and the target values 如前所述,我們將成本表示為假設函數值和目標值之間的差
- We have to eventually sum up all the squared differences we get from each training record to get the total cost over the complete training data 我們最終必須總結從每個培訓記錄中獲得的所有平方差,以獲取完整培訓數據上的總成本
- We perform a squaring of the difference in each case, which has two significant advantages 我們對每種情況的差異進行平方,這有兩個明顯的優點
The difference between the estimated value from Hypothesis function and the Target value might be negative, squaring it eliminates the possibility of us decreasing the total cost value while summing the differences we get from each training record simply because a few of them turned up to be negative.
假設函數的估計值與目標值之間的差異可能為負,對它進行平方可以消除我們減少總成本值的可能性,同時將我們從每條訓練記錄中獲得的差異相加,只是因為其中一些變成負數。
| Note that we can also take absolute values, but that leads to an issue — absolute value function is much harder to analyse mathematically than a squared function — let me know in the comments if you need further explanation on this — basically the absolute function is not differentiable at all points, something necessary to find the point at which the cost function is at a minima |
| 請注意,我們也可以采用絕對值,但這會導致一個問題-絕對值函數比平方函數難于數學分析-如果需要進一步說明,請在評論中告知我-基本上,絕對函數不是在所有點上都是可微的,這是找到成本函數為最小值的點所必需的。
- Squaring also puts more emphasis on larger differences as the effect of these differences is squared as well. Thus, the contribution of these larger differences increases and stands out. 平方也更加強調較大的差異,因為這些差異的影響也是平方的。 因此,這些較大差異的貢獻增加并脫穎而出。
- Performing minimization by differentiating the function over θ values will remove the 1/2 which is present in the formula 通過對θ值求微分來執行最小化將刪除公式中存在的1/2
機器學習模型 (Machine Learning Model)
As shown in the Hypothesis trendline graph above, the main objective of minimizing the cost function is to get a straight trendline which covers most of the Target values, or at least is as close to the points as possible. This is why we calculate the differences and write a cost function to reduce them.
如上面的假設趨勢線圖所示,最小化成本函數的主要目標是要獲得一條直線的趨勢線,該趨勢線可以覆蓋大多數目標值,或者至少盡可能接近這些點。 這就是為什么我們計算差異并編寫成本函數以減少差異的原因。
And this is not a one-time process, it is more of an iterative process in which we choose our parameters for the Hypothesis function, calculate the estimated values, then use the cost function to find out the cost. After that we minimize this cost and perform the whole activity again. In this way we re-do the whole calculation to get to the point where we think we have the most optimized function. We can check the state of the current result at any time by just plotting the function against the Target values. However, this whole iterative process is what is at the heart of all optimizing algorithms in place today, so you don’t have to perform this activity repeatedly. The most popular one, and you might even have heard about it, is Gradient Descent.
這不是一個一次性的過程,而是一個反復的過程,在該過程中,我們為假設函數選擇參數,計算估計值,然后使用成本函數來找出成本。 之后,我們將成本降到最低,然后再次執行整個活動。 這樣,我們重新進行了整個計算,以至于認為我們擁有最優化的功能。 只需將函數相對于目標值作圖,就可以隨時檢查當前結果的狀態。 但是,整個迭代過程是當今所有優化算法的核心,因此您不必重復執行此活動。 梯度下降是最受歡迎的一種,您甚至可能已經聽說過。
And when we do have that Hypothesis function which has the estimated values closest to the Target values, we can take that function and claim it as the function which fit the data in the best possible manner. And there we have our model!
并且當我們的假設函數的估計值最接近目標值時,我們可以采用該函數并將其聲明為以最佳方式擬合數據的函數。 在那里,我們有了我們的模型!
In this article, I wanted to write about the general process which is followed when solving any machine learning problem and building a machine learning model. This was more of a theoretical explanation, but I do have more technical guides lines up.
在本文中,我想寫一些解決任何機器學習問題并建立機器學習模型時遵循的一般過程。 這更多是理論上的解釋,但是我確實有更多的技術指南。
I hope you liked this article. If you have any concerns/questions regarding the content, let me know in the comments!Thanks for reading!
希望您喜歡這篇文章。 如果您對內容有任何疑問/疑問,請在評論中告訴我!謝謝閱讀!
Originally published at https://thedatascienceportal.com on August 7, 2020. For more such content head over to thedatascienceportal.
最初于 2020年8月7日 在 https://thedatascienceportal.com 上 發布。 有關更多此類內容,請轉到datascienceportal。
翻譯自: https://towardsdatascience.com/machine-learning-process-7beab5c4f31b
機器學習中的隨機過程
總結
以上是生活随笔為你收集整理的机器学习中的随机过程_机器学习过程的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 开发票的税点
- 下一篇: ci/cd heroku_在Heroku