知乎 开源机器学习_使用开源数据和机器学习预测海洋温度
知乎 開源機(jī)器學(xué)習(xí)
In this tutorial, we’re going to show you how to take open source data from the National Oceanic and Atmospheric Administration (NOAA), clean it, and forecast future temperatures using no-code machine learning methods.
在本教程中,我們將向您展示如何從美國國家海洋和大氣管理局(NOAA)獲取開源數(shù)據(jù),進(jìn)行清理以及使用無代碼機(jī)器學(xué)習(xí)方法預(yù)測未來的溫度。
This particular data comes from the Harmful Algal BloomS Observation System (HABSOS). There are several interesting questions to ask of this data — namely, what is the relationship between algal blooms and water temperature fluctuations. For this tutorial, we’re going to start with a basic question: can we predict what temperatures will be over the next five months?
此特定數(shù)據(jù)來自有害藻華觀測系統(tǒng)(HABSOS)。 這個數(shù)據(jù)有幾個有趣的問題要問-即藻華與水溫波動之間的關(guān)系是什么。 對于本教程,我們將從一個基本問題開始:我們可以預(yù)測未來五個月的溫度嗎?
The first part of this tutorial deals with acquiring and cleaning the dataset. There are a lot of approaches to this; what is shown below is just one approach. Further, if your dataset is already clean, you can skip all that “data engineering” and jump straight into no-code AI bliss :)
本教程的第一部分涉及獲取和清理數(shù)據(jù)集。 有很多方法可以解決這個問題。 下面顯示的只是一種方法。 此外,如果您的數(shù)據(jù)集已經(jīng)干凈,則可以跳過所有的“數(shù)據(jù)工程”,直接跳入無代碼的AI幸福:)
步驟1:下載并清理數(shù)據(jù) (Step 1: Download & Clean the Data)
First, we download the data from the HABSOS site linked above. For convenience, we are posting the file here as well.
首先,我們從上面鏈接的HABSOS網(wǎng)站下載數(shù)據(jù)。 為了方便起見, 我們也在此處發(fā)布文件。
This CSV has 21 columns, which we discovered with this bash command.
該CSV共有21列,我們是通過bash命令發(fā)現(xiàn)的。
$ awk '{print NF}' habsos_20200310.csv | sort -nu | tail -n 121
We’ll explore the rest of the data in subsequent tutorials, but, of these 21 columns, the only columns I’m interested in for now are:
我們將在后續(xù)教程中探索其余數(shù)據(jù),但是在這21列中,我目前唯一感興趣的列是:
- sample_date sample_date
- sample_depth sample_depth
- water_temp 水溫
In addition to only needing a subset of the columns in the data, there are other issues to deal with in order to get the data ready for analysis. We need to:
除了只需要數(shù)據(jù)中列的子集之外,還有其他問題需要處理才能準(zhǔn)備好數(shù)據(jù)進(jìn)行分析。 我們要:
Remove rows with NaN values (i.e. empty values) in thewater_temp column,
刪除water_temp列中具有NaN值(即空值)的行,
- Select only the measurements made at a depth of 0.5 meters (to remove temperature variability due to ocean depth), and 僅選擇在0.5米深度處進(jìn)行的測量(以消除由于海洋深度引起的溫度變化),并且
- Regularize the data periods by turning the datetime values into date values. 通過將日期時間值轉(zhuǎn)換為日期值來規(guī)范化數(shù)據(jù)周期。
from datetime import datetime as dtdf = pd.read_csv('habsos_20200310.csv', sep=',', error_bad_lines=False, index_col=False, dtype='unicode')
pd.set_option('display.max_rows', None)# Get only the columns we care about
dfSub = df[['sample_date','sample_depth','water_temp']]# Remove the NaN values
dfClean = dfSub.dropna()# Select 0.5 depth measurements only
dfClean2 = dfClean.loc[df['sample_depth'] == '0.5']# Split the datetime values
dfClean2['sample_date'] = dfClean2['sample_date'].str.split(expand=True)[0]dfClean2.to_csv(r'/PATH/TO/YOUR/OUTPUT/out.csv', index = False)
There’s another big problem with this data: on certain days, there are multiple sensor readings; on other days, there are no sensor readings. Sometimes there are entire months without readings.
這些數(shù)據(jù)還有另一個大問題:在某些日子里,會有多個傳感器讀數(shù)。 在其他日子里,沒有傳感器讀數(shù)。 有時整整幾個月都沒有閱讀。
These problems are quicker to address in spreadsheets by using pivot tables. And, now that we have reduced the size of the data with the preceding Python script, we areable to load it into a Google Sheet.
通過使用數(shù)據(jù)透視表,可以更快地在電子表格中解決這些問題。 而且,既然我們已經(jīng)使用前面的Python腳本減小了數(shù)據(jù)的大小,現(xiàn)在可以將其加載到Google表格中了。
What we ended up doing is making a pivot table of each month of each year (1954 to 2020) and took the median water temperature for that month. We used median instead of average values in case there were wild outlier measurements that might skew our summarized data.
我們最終要做的是制作每年(1954年至2020年)每個月的數(shù)據(jù)透視表,并獲取該月的水溫中位數(shù) 。 如果存在異常的異常測量結(jié)果可能會歪曲匯總數(shù)據(jù)的情況,我們將使用中位數(shù)而不是平均值。
Our results are available for viewing in the third tab of this Google Sheet.
我們的結(jié)果可在此Google表格的第三個標(biāo)簽中查看。
Let’s take those results and bring them into Monument!
讓我們將這些結(jié)果帶入Monument!
步驟2:繪制數(shù)據(jù)圖表并使用無代碼機(jī)器學(xué)習(xí)生成預(yù)測 (Step 2: Chart the Data & Use No-Code Machine Learning Generate a Forecast)
To chart the data, we’re first going to load it into Monument (www.monument.ai). Monument is an artificial intelligence/machine learning platform that allows you to use advanced algorithms without touching a line of code.
為了繪制數(shù)據(jù)圖表,我們首先將其加載到Monument( www.monument.ai )中。 Monument是一個人工智能/機(jī)器學(xué)習(xí)平臺,可讓您使用高級算法而無需編寫任何代碼。
First, we’re going to import our freshly cleaned data into Monument as a CSV file. In the INPUT tab, you’ll see the data as it exists in the source file on the top and the data as it will be imported into Monument on the bottom. If you’re satisfied with how it will be imported, click OK in the bottom right.
首先,我們將剛清理的數(shù)據(jù)作為CSV文件導(dǎo)入到Monument(紀(jì)念碑)中。 在“輸入”選項(xiàng)卡中,您將在頂部看到源文件中存在的數(shù)據(jù),而在底部將看到導(dǎo)入到Monument中的數(shù)據(jù)。 如果對如何導(dǎo)入感到滿意,請單擊右下角的“確定”。
Load the data!加載數(shù)據(jù)!When you click OK, you’ll be brought into the MODEL tab. You can drag the “data pills” from the far left into the COLS(X) and ROWS(Y) areas to chart the data. You will clearly see the gaps in the data, where there were months with no temperature readings.
單擊“確定”后,您將進(jìn)入“模型”選項(xiàng)卡。 您可以將“數(shù)據(jù)丸”從最左側(cè)拖動到COLS(X)和ROWS(Y)區(qū)域以繪制數(shù)據(jù)圖表。 您會清楚地看到數(shù)據(jù)中的差距,那里有數(shù)月沒有溫度讀數(shù)。
Monument’s algorithms can handle missing data.Monument的算法可以處理丟失的數(shù)據(jù)。This data has a visually recognizable pattern: it resembles a sine wave. In general — and especially when data has a repetitive pattern — it’s good to start an analysis with AutoRegression (AR). AR is one of the more “primitive” algorithms, but it often learns obvious patterns quickly.
該數(shù)據(jù)具有視覺上可識別的模式:類似于正弦波。 通常,尤其是當(dāng)數(shù)據(jù)具有重復(fù)模式時,最好使用AutoRegression(AR)開始分析。 AR是更“原始”的算法之一,但是它經(jīng)常可以快速學(xué)習(xí)明顯的模式。
When we apply AR to the water temperature data by dragging it into the chart, we see a spiked divergence from the actual historical data early in the training period, but that the algorithm quickly gets a handle on what is occurring in the dataset.
當(dāng)我們通過將AR拖入圖表將AR應(yīng)用于水溫數(shù)據(jù)時,我們發(fā)現(xiàn)在訓(xùn)練初期它與實(shí)際歷史數(shù)據(jù)存在明顯的差異,但是該算法可以快速掌握數(shù)據(jù)集中的情況。
By the end of the training data, it almost perfectly overlays onto the training set. When an algorithm does a good job anticipating known historical data in the training period, it can be an indication that the algorithm will do well forecasting the future. (However, a concern is “overfitting,” which we will explore in future articles.)
到訓(xùn)練數(shù)據(jù)結(jié)束時,它幾乎完美地覆蓋了訓(xùn)練集。 當(dāng)算法在訓(xùn)練期間很好地預(yù)測已知?dú)v史數(shù)據(jù)時,可能表明該算法可以很好地預(yù)測未來。 (但是,關(guān)注點(diǎn)是“過度擬合”,我們將在以后的文章中進(jìn)行探討。)
Off to a good start!開啟良好的開端!Now, let’s try a Dynamic Linear Model (DLM). DLM is a slightly more complex algorithm — let’s see if it gets us even better results. When we drag DLM into the chart, we notice immediately that something seems off: DLM appears out of sync with the training data. It has trouble anticipating where the peaks and troughs are in the historical data.
現(xiàn)在,讓我們嘗試動態(tài)線性模型(DLM)。 DLM是一種稍微復(fù)雜一些的算法-讓我們看看它能否為我們帶來更好的結(jié)果。 當(dāng)我們將DLM拖到圖表中時,我們立即注意到似乎有些不對勁:DLM似乎與訓(xùn)練數(shù)據(jù)不同步。 很難預(yù)測高峰和低谷在歷史數(shù)據(jù)中的位置。
Uh oh…呃哦If we zoom in by dragging the windowing widget below the chart and mute the AR results by clicking the color box above the cart, the effect is even more pronounced. The historical data and DLM are out of sync, so it’s unlikely that the forecasted results — beyond the historical data — will be reliable.
如果我們通過拖動圖表下方的窗口小部件進(jìn)行放大,并通過單擊購物車上方的顏色框使AR結(jié)果靜音,則效果會更加明顯。 歷史數(shù)據(jù)和DLM不同步,因此,超出歷史數(shù)據(jù)的預(yù)測結(jié)果不太可能可靠。
Not looking good…不好看...Let’s try Time-Varying AutoRegression (TVAR). It looks like it produces similar results to AR.
讓我們嘗試時變自動回歸(TVAR)。 看起來它產(chǎn)生與AR類似的結(jié)果。
Looking good.看起來不錯。Now, let’s try Long Short-Term Memory (LSTM). This is way off! An LSTM often produces great results for “noisier” data that has less regular patterns. However, on highly patterned data like this dataset, it has trouble.
現(xiàn)在,讓我們嘗試長短期記憶(LSTM)。 這是路! LSTM通常會為規(guī)則模式較少的“噪點(diǎn)”數(shù)據(jù)產(chǎn)生很好的結(jié)果。 但是,在像該數(shù)據(jù)集這樣的高度模式化的數(shù)據(jù)上,它會遇到麻煩。
There are ways to improve the performance of the LSTM (and any algorithm) by adjusting the algorithm’s parameters, but we already have algorithms performing well, so it doesn’t seem worth the effort.
有多種方法可以通過調(diào)整算法的參數(shù)來提高LSTM(和任何算法)的性能,但是我們已經(jīng)擁有性能良好的算法,因此這似乎不值得付出努力。
The LSTM has forsaken us…LSTM拋棄了我們……Now, let’s zoom in to see what we are working with by using the windowing widget on the bottom of the chart. Let’s also click the circles icon in the top right of Monument and select “forecast” to remove the training period and only show the prediction.
現(xiàn)在,讓我們使用圖表底部的窗口小部件放大以查看我們正在使用什么。 我們還單擊“紀(jì)念碑”右上角的圓圈圖標(biāo),然后選擇“預(yù)測”以刪除訓(xùn)練時間并僅顯示預(yù)測。
The TVAR had looked good when zoomed out, but up close all of our algorithms seem to agree with one another, with the exception of TVAR. Let’s drop TVAR.
縮小時,TVAR看起來不錯,但近距離我們的所有算法似乎彼此一致,但TVAR除外。 讓我們放下TVAR。
TVAR does not look so good up close.近距離來看,TVAR看起來不太好。Let’s bring back “training+forecast,” remove everything but AR, and apply the Gaussian Dynamic Boltzmann Machine (G-DyBM). Things are looking pretty good now :)
讓我們帶回“訓(xùn)練+預(yù)測”,除去AR之外的所有內(nèi)容,然后應(yīng)用高斯動態(tài)玻爾茲曼機(jī)(G-DyBM)。 現(xiàn)在情況看起來不錯:)
The sweet spot.最好的地方。Let’s flip over to the OUTPUT tab and scroll to the bottom to see our forecasts. Because we made our data periods monthly, p1, p2, p3, p4, and p5 are Month-1, Month-2, Month-3, Month-4, and Month-5 into the future.
讓我們轉(zhuǎn)到“輸出”選項(xiàng)卡并滾動到底部以查看我們的預(yù)測。 因?yàn)槲覀儗?shù)據(jù)周期設(shè)為每月,所以p1,p2,p3,p4和p5分別是未來的第1個月,2個月,3個月,4個月和5個月。
In this tutorial, we took open source data from the internet, cleaned it, loaded it into Monument, and — in minutes! — used advanced data science methods to get forecasts for future median monthly water temperatures in the Gulf of Mexico at a depth of 0.5 meters.
在本教程中,我們從互聯(lián)網(wǎng)上獲取了開放源數(shù)據(jù),將其清理,然后將其加載到Monument中,然后-只需幾分鐘! -使用先進(jìn)的數(shù)據(jù)科學(xué)方法來獲得對墨西哥灣0.5米深處未來每月平均水溫的預(yù)測。
You can download the .mai file of our results from this link.
您可以從此鏈接下載結(jié)果的.mai文件。
In the next tutorial, we’ll look deeper at the error rates for each of the algorithms we tried above and discuss why we might select one algorithm over another. We’ll also calculate the standard deviation for the outliers and discuss why this is important.
在下一個教程中,我們將更深入地研究上面嘗試的每種算法的錯誤率,并討論為什么我們可能選擇一種算法而不是另一種算法。 我們還將計算離群值的標(biāo)準(zhǔn)偏差,并討論為什么這很重要。
翻譯自: https://medium.com/swlh/using-open-source-data-machine-learning-to-predict-ocean-temperatures-2c8d65165665
知乎 開源機(jī)器學(xué)習(xí)
總結(jié)
以上是生活随笔為你收集整理的知乎 开源机器学习_使用开源数据和机器学习预测海洋温度的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 我的世界扁桃在哪里直播(汉典我字的基本解
- 下一篇: :)xception_Xception: