當前位置：首頁 >

爆破登录测试网页_预测危险的地震爆破第一部分：EDA，特征工程和针对不平衡数据集的列车测试拆分

發(fā)布時間：2023/12/15 42 豆豆

生活随笔收集整理的這篇文章主要介紹了爆破登录测试网页_预测危险的地震爆破第一部分：EDA，特征工程和针对不平衡数据集的列车测试拆分小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

爆破登錄測試網(wǎng)頁

介紹： (Introduction:)

The seismic bumps dataset is one of the lesser-known binary classification datasets that capture geological conditions using seismic and seismo-acoustic systems in longwall coal mines to assess if they are prone to rockburst causing seismic hazards or not.

地震顛簸數(shù)據(jù)集是鮮為人知的二進制分類數(shù)據(jù)集之一，該數(shù)據(jù)集使用長壁煤礦中的地震和地震聲波系統(tǒng)捕獲地質(zhì)條件，以評估它們是否容易發(fā)生巖爆而引起地震危險。

Link to the dataset: https://archive.ics.uci.edu/ml/datasets/seismic-bumps

鏈接到數(shù)據(jù)集： https : //archive.ics.uci.edu/ml/datasets/seismic-bumps

This is a good dataset that gives practical exposure to unbalanced datasets, works on different kinds of data splits, and assessing a classifier’s performance metrics including exhibiting accuracy paradox.

這是一個很好的數(shù)據(jù)集，可以實際接觸不平衡的數(shù)據(jù)集 ，處理不同類型的數(shù)據(jù)拆分，并評估分類器的性能指標，包括表現(xiàn)出準確性悖論 。

The other thing about this dataset is that it has both categorical as well as numerical features which provides a playground to wrangle and try out different feature transformation methods to use.

關(guān)于此數(shù)據(jù)集的另一件事是，它既具有分類特征又具有數(shù)字特征，這提供了一個爭用和嘗試使用不同特征轉(zhuǎn)換方法的場所。

This article is not code-heavy but a bit more intuitive in understanding what went right and wrong! The code can be found in my GitHub repository.

本文不是編寫大量代碼，而是讓您更直觀地了解什么是對與錯！該代碼可以在我的GitHub存儲庫中找到。

Note — I haven’t elaborated on each feature in the EDA and feature engineering steps since they are repetitive. I only have samples in this blog and the full code is available on GitHub in this link.

注意—由于它們是重復(fù)性的，因此我沒有詳細介紹EDA和功能設(shè)計步驟中的每個功能。我在此博客中只有示例，完整的代碼可在GitHub的此鏈接中找到。

探索性數(shù)據(jù)分析 (Exploratory Data Analysis)

Photo by Andrew Neel on Unsplash 安德魯·尼爾 ( Andrew Neel)在Unsplash上攝

數(shù)據(jù)事實： (Data facts:)

This dataset has 2584 instances with 19 columns, out of which there are 4 categorical features, 8 discrete features, and 6 numeric features. The last one is the label column which contains 0 for non-hazardous and 0 for non-hazardous seismic bumps. For ease of use, I categorized and saved the feature names as follows:

該數(shù)據(jù)集有2584個實例，其中有19列，其中有4個分類特征，8個離散特征和6個數(shù)字特征。最后一個是標簽列，其中包含0表示非危險地震波，0表示非危險地震波。為了易于使用，我對功能名稱進行了分類和保存，如下所示：

col_list_categorical = ['seismic', 'seismoacoustic', 'shift', 'ghazard']col_list_numerical = ['genergy', 'gpuls', 'gdenergy', 'gdpuls', 'energy', 'maxenergy']col_list_discrete = ['nbumps', 'nbumps2', 'nbumps3', 'nbumps4', 'nbumps5', 'nbumps6', 'nbumps7', 'nbumps89']label = 'class'

Attribute information [Source]:1. seismic: the result of shift seismic hazard assessment in the mine working obtained by the seismic method (a — lack of hazard, b — low hazard, c — high hazard, d — danger state);2. seismoacoustic: the result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;3. shift: information about the type of a shift (W — coal-getting, N -preparation shift);4. genergy: seismic energy recorded within the previous shift by the most active geophone (GMax) out ofgeophones monitoring the longwall;5. gpuls: a number of pulses recorded within the previous shift by GMax;6. gdenergy: a deviation of energy recorded within the previous shift by GMax from average energy recorded during eight previous shifts;7. gdpuls: a deviation of a number of pulses recorded within the previous shift by GMax from the average number of pulses recorded during eight previous shifts;8. ghazard: the result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming to from GMax only;9. nbumps: the number of seismic bumps recorded within the previous shift;10. nbumps2: the number of seismic bumps (in energy range [1?2,1?3)) registered within the previous shift;11. nbumps3: the number of seismic bumps (in energy range [1?3,1??)) registered within the previous shift;12. nbumps4: the number of seismic bumps (in energy range [1??,1??)) registered within the previous shift;13. nbumps5: the number of seismic bumps (in energy range [1??,1??)) registered within the last shift;14. nbumps6: the number of seismic bumps (in energy range [1??,1??)) registered within the previous shift;15. nbumps7: the number of seismic bumps (in energy range [1??,1??)) registered within the previous shift;16. nbumps89: the number of seismic bumps (in energy range [1??,1?1?)) registered within the previous shift;17. energy: the total energy of seismic bumps registered within the previous shift;18. maxenergy: the maximum energy of the seismic bumps registered within the previous shift;19. class: the decision attribute — ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

屬性信息[ 來源 ]： 1.地震：通過地震方法在礦山作業(yè)中進行移動地震危險性評估的結(jié)果(a-缺乏危險性，b-低危險性，c-高危險性，d-危險狀態(tài))； 2 。地震聲：通過地震聲法獲得的在礦山工作中轉(zhuǎn)移地震危險性評估的結(jié)果； 3。班次：班次類型的信息(W-取煤，N-準備班次); 4。 genergy genergy：在監(jiān)測長壁的地震檢波器中，最活躍的地震檢波器(GMax)在上一班次內(nèi)記錄的地震能量； 5。 gpuls：上一次移位GMax記錄的脈沖數(shù); 6。 gdenergy：前一次移位記錄的能量GMax與前八次移位記錄的平均能量的偏差; 7。 gdpuls：前一個移位內(nèi)記錄的脈沖數(shù)與前八個移位中記錄的平均脈沖數(shù)之間的偏差為GMax; 8。 ghazard：僅基于來自GMax的配準，通過地震聲法獲得的礦山工作中移動地震危險性評估的結(jié)果； 9。 nbumps：上一次班次內(nèi)記錄的地震顛簸數(shù)量; 10。 nbumps2：上一次班次內(nèi)記錄的地震顛簸數(shù)(在能量范圍[1?2，1?33])； 11。 nbumps3：上一次班次內(nèi)記錄的地震顛簸數(shù)(在能量范圍[1?3,1??]中)； 12。 nbumps4：上一個班次內(nèi)記錄的地震顛簸數(shù)量(在能量范圍[1??，1??]中); 13。 nbumps5：最后一次班次內(nèi)記錄的地震顛簸數(shù)(在能量范圍[1??，1??]中)； 14。 nbumps6：上一個班次內(nèi)記錄的地震顛簸數(shù)量(在能量范圍[1??，1??]中); 15。 nbumps7：上一個班次內(nèi)記錄的地震顛簸數(shù)量(在能量范圍[1??，1??]中); 16。 nbumps89：在上一個班次內(nèi)記錄的地震顛簸的數(shù)量(在能量范圍[1??，1?11?]中)； 17。能量：上一班次記錄的地震顛簸的總能量; 18。 maxenergy：前一次移位中記錄的地震顛簸的最大能量; 19。類別：決策屬性-“ 1”表示在下一班次(“危險狀態(tài)”)發(fā)生高能地震顛簸，“ 0”表示在下一班次(“非危險狀態(tài)”)沒有發(fā)生高能地震顛簸。

目標類別分布 (Target Class distribution)

Out of 2584 records, there are only 6.5% hazardous instances. In other words, 3 out of 50 seismic bumps are hazardous.

在2584個記錄中，只有6.5％的危險實例。換句話說，每50個地震顛簸中就有3個是危險的。

sns.countplot(x=label, data=df, palette=colors)
plt.xlabel('CLASS')
plt.ylabel('COUNT')

分類特征 (Categorical Features)

First, I looked at the categorical features to understand if there exists a relationship bias between any of them and the label categories. For this, I constructed contingency tables using pandas.crosstab function and assessed the ratios and proportions of the content of each category in categorical features versus the class labels to determine if there is a bias.

首先，我查看了分類特征，以了解它們與標簽類別之間是否存在關(guān)系偏差。為此，我使用pandas.crosstab函數(shù)構(gòu)造了列聯(lián)表，并評估了類別特征與類別標簽中每個類別的內(nèi)容的比率和比例，以確定是否存在偏差。

data_crosstab = pd.crosstab(df['seismoacoustic'], df[label], colnames=['class'])

The ratios for each seismo-acoustic category for class 1 to class 0 are ~0.06 which doesn’t exhibit any strong bias. The features — “seismic” and “shift” contained some distribution bias. The category ‘b’ in ‘seismic’ feature contains a greater fraction of hazardous seismic bumps while that of “shift”’s category “W” contains more seismic bumps than the category “N”. The contingency tables for these categorical features are below:

1級到0級的每個地震聲類別的比率均為?0.06，這沒有任何強烈的偏差。特征“地震”和“偏移”包含一些分配偏差。 “地震”功能中的類別“ b”包含更大比例的危險地震波，而“班次”類別“ W”包含的地震波波比類別“ N”更多。這些分類功能的列聯(lián)表如下：

Left — Contingency Table for ‘Shift’ vs ‘Class’; Center — Contingency Table for ‘Seismic’ vs ‘’Class’; Right — Contingency Table for ‘ghazard’ vs ‘Class’左-“班次”與“班級”的列聯(lián)表；中心-“地震”與“類別”的權(quán)變表；右-“危險” vs“類別”的權(quán)變表

數(shù)值特征 (Numeric Features)

Next, I looked at the data and picked out the numeric columns to understand their descriptive stats, correlations, and distribution plots.

接下來，我查看了數(shù)據(jù)并選擇了數(shù)字列，以了解它們的描述性統(tǒng)計量，相關(guān)性和分布圖。

Descriptive Statistics
描述性統(tǒng)計

In Pandas, the describe method only provides count, mean, min, max, std, and the percentiles from which we can assess the skewness of the data to some extent. To better understand their distributions, visuals are essential like distribution plots and histograms.

在Pandas中， describe方法僅提供計數(shù)，均值，最小值，最大值，std和百分位數(shù)，從中我們可以在一定程度上評估數(shù)據(jù)的偏度。為了更好地理解它們的分布，視覺是必不可少的，例如分布圖和直方圖。

df[col_list_numerical].describe()

Correlation & Heatmap
關(guān)聯(lián)和熱圖

I assessed the correlation among these features using pandas.DataFrame.corr and visualized using seaborn.heatmap .

我評估使用這些功能之間的相關(guān)性pandas.DataFrame.corr和使用可視化seaborn.heatmap 。

df_corr = df[col_list_numerical].corr()
plt.figure(figsize=[8, 8])
sns.heatmap(data=df_corr, vmin=-1, vmax=1, cmap=’gist_earth_r’, annot=True, square=True, linewidths=1)
plt.xticks(rotation=90)
plt.yticks(rotation=0)

Observations from Descriptive Stats and Correlations:
描述性統(tǒng)計和相關(guān)性的觀察：

The distribution of energy and max energy looks similar. genergy looks right-skewed. To better understand the distributions, I used seaborn.distplot to visualize each feature as demoed below.

能量分布和最大能量看起來相似。 genergy看起來右偏。為了更好地了解分布，我使用seaborn.distplot可視化每個功能，如下所示。

In the heat-map, it is evident that genergy and gpuls, gdenergy, and gdpuls are strongly correlated. energy and maxenergy are almost perfectly correlated. In the later sections of EDA, I attempted to construct scatterplots for specifically these highly correlated pairs and visualize their spread for each target class.

在熱圖中，很明顯，能量與gpuls，gdenergy和gdpuls密切相關(guān)。能量和最大能量幾乎完全相關(guān)。在EDA的后續(xù)部分中，我嘗試為這些高度相關(guān)的對構(gòu)建散點圖，并可視化它們在每個目標類別中的分布。

More EDA visualizations: Distribution Plot & Scatterplots
更多EDA可視化：分布圖和散點圖

Like I said, I wanted to visualize the distributions like below, it is evident that genergy is right-skewed, like all other numeric features (shown in Github Notebook).

就像我說的那樣，我想可視化如下所示的分布，很明顯，像其他所有數(shù)字特征一樣，能量是右偏的(顯示在Github Notebook中)。

sns.distplot(df['genergy'], hist=True)

More instances of genergy are in the range 0 to 200000 and the distribution then tapers off towards higher energy values. The pattern is the same for all the other numeric features but the ranges for gdenergy and gdpuls are smaller than the other numeric features.

更多的能量實例在0到200000的范圍內(nèi)，然后分布逐漸向更高的能量值傾斜。所有其他數(shù)字功能的模式都相同，但是gdenergy和gdpuls的范圍小于其他數(shù)字功能。

Additionally, I used a scatterplot to visualize the correlated features with a correlation coefficient of more than 0.70. I constructed the scatterplot to see how these numerical values are scattered for each target class.

另外，我使用散點圖以大于0.70的相關(guān)系數(shù)來可視化相關(guān)特征。我構(gòu)造了散點圖，以查看這些數(shù)值如何分散到每個目標類別。

plt.figure(figsize=[10, 8])
sns.scatterplot(x='genergy', y='gpuls', hue='class', data=df)Left — Scatterplot for genergy vs gpuls; Right — Scatterplot for energy vs maxenergy左-能量與gpuls的散點圖；右—能量與最大能量的散點圖

In the left figure, the seismic bumps are more concentrated at higher values of gpuls and genergy. There is also a strong observed linear relationship in the scatterplot on the right with energy vs maxenergy, as suggested by the descriptive stats, their distplots, and this scatterplot.

在左圖中，地震波更集中在較高的gpuls和genergy值上。正如描述性統(tǒng)計數(shù)據(jù)，其散布圖和此散點圖所暗示的，在右側(cè)的散點圖中，能量與最大能量之間也存在很強的線性關(guān)系。

離散功能 (Discrete Features)

The ‘nbumpsX’ are discrete features that contain integers from 0 to 9. As can be seen here, ‘nbumps6’ is all zeros, the same as ‘nbumps7’ and ‘nbumps89’. Also, they are all right-skewed or positively skewed distributions.

“ nbumpsX”是離散特征，包含從0到9的整數(shù)。如此處所示，“ nbumps6”全為零，與“ nbumps7”和“ nbumps89”相同。而且，它們都是右偏或正偏分布。

I dropped ‘nbumps6’, ‘nbumps7’, and ‘nbumps89’ using the code below. Everything else looks fine, but there is a high possibility of any ML model treating ‘nbumpsX’ information as ordinal.

我使用以下代碼刪除了“ nbumps6”，“ nbumps7”和“ nbumps89”。其他一切看起來都很好，但是任何ML模型都極有可能將'nbumpsX'信息視為序數(shù)。

df.drop(columns=['nbumps6', 'nbumps7', 'nbumps89'], inplace=True)

I also constructed contingency tables again to see how the nbumpsX are jointly distributed with the target class using the following code:

我還再次構(gòu)造了列聯(lián)表，以了解如何使用以下代碼將nbumpsX與目標類聯(lián)合分發(fā)：

for each_col in col_list_discrete:
data_crosstab = pd.crosstab(df[each_col], df[label], colnames=['class'])
print(data_crosstab)
print('-----'

In the tables, there was no stark difference in the distributions for hazardous and non-hazardous instances. They occur proportionally throughout the nbumpsX count in all these discrete features.

在表中，危險和非危險實例的分布沒有明顯差異。它們在所有這些離散功能的整個nbumpsX計數(shù)中成比例出現(xiàn)。

Since all of these look just fine, I moved on to ‘Feature Engineering’, the next section, where I dealt with the categorical and numeric features.

由于所有這些看起來都很好，因此我進入下一部分的“功能工程”，在其中介紹了分類和數(shù)字功能。

特征工程 (Feature Engineering)

I one-hot encoded the categorical variables and transformed the numeric columns for scaling the range down. I kept the discrete features the same.

我一鍵編碼分類變量，并轉(zhuǎn)換了數(shù)字列以縮小范圍。我將離散功能保持不變。

One-hot Encoding for Categorical Features
分類特征的一鍵編碼

I encoded all the two-class categorical features into one feature containing 0s and 1s while I converted the features more than two classes into a binary column for each category. This one-hot encoded feature column represents if that instance contains that particular category or not.

我將所有兩個類別的分類要素編碼為一個包含0和1的要素，而將兩個以上類別的要素轉(zhuǎn)換為每個類別的二進制列。這個一鍵編碼的要素列表示該實例是否包含該特定類別。

For example, the seismo-acoustic feature contains three classes — a, b and c (as on the left figure) which got transformed into the expanded into three binary features (as on the right figure). I used drop=’first’ to avoid the dummy variable trap.

例如，地震聲學(xué)特征包含三個類-a，b和c(如左圖所示)，這些類被轉(zhuǎn)換為擴展為三個二元特征(如右圖所示)。我使用drop ='first'來避免虛擬變量陷阱。

label_encoder = LabelEncoder()
onehot_encoder = OneHotEncoder(drop='first', sparse=False)
encoded_array = label_encoder.fit_transform(df[col_name])
encoded_array_reshaped = encoded_array.reshape(len(encoded_array),1)
one_hot_encoded_array = onehot_encoder.fit_transform(encoded_array_reshaped)
num_features = one_hot_encoded_array.shape[1]
new_enc_col_names = [col + '_enc_' + str(num) for num in range(0, num_features)]
df_enc = pd.DataFrame(one_hot_encoded_array)
df_enc.columns = new_enc_col_names
df = pd.concat([df, df_enc], axis=1)
df.drop(columns=col, inplace=True)

Transformation of Numeric Columns
數(shù)值列的轉(zhuǎn)換

I then transformed all numeric features using log-transform. z-score transform and compared the transformations of the original distributions. The distributions now seem a bit more “normalized” and the range much better. Refer to the code and visualization below to compare.

然后，我使用對數(shù)轉(zhuǎn)換來轉(zhuǎn)換所有數(shù)字特征。 z分數(shù)變換，并比較原始分布的變換。現(xiàn)在，分布似乎更加“規(guī)范化”，范圍也更好。請參考下面的代碼和可視化進行比較。

sns.distplot(np.log(df['genergy']), hist=True)
# Results in the figure on the rightLeft — Actual distribution of genergy; Right — Log-transformed distribution of genergy左-發(fā)電量的實際分布；右-能量的對數(shù)轉(zhuǎn)換分布

I basically created a dictionary to define what transformation I’d like to apply on these columns and then looped through the dictionary items like below:

我基本上創(chuàng)建了一個字典來定義要對這些列應(yīng)用的轉(zhuǎn)換，然后像下面這樣遍歷字典項：

def shifted_log_func(df_col):
return np.log(1 + df_col)dict_num_cols_trnsfm = {'genergy': np.log,
'gpuls' : np.log,
'gdenergy': stats.zscore,
'gdenergy': stats.zscore,
'energy': shifted_log_func}for col_names, transfm_func in dict_num_cols_trnsfm.items():
df['scaled_' + col_names] = transfm_func(df[col_names])
df.drop(columns=col_list_numerical, inplace=True)
df[[col for col in df.columns if 'scaled_' in col]].describe()

Final descriptive stats for the numeric after dropping maxenergy since the data are similar for both.

降低maxenergy后數(shù)字的最終描述性統(tǒng)計信息，因為兩者的數(shù)據(jù)相似。

Left — The descriptive stats for scaled numeric features; Right — The descriptive stats for original numeric data左-比例數(shù)字特征的描述性統(tǒng)計信息；右-原始數(shù)字數(shù)據(jù)的描述性統(tǒng)計

The range seems good in the left table. The right table contains some zero values for energy and maxenergy. To tackle that, I added a 1 to the ‘energy’ column (since I dropped maxenergy) and then applied log-transform.

左表中的范圍似乎不錯。右表包含一些零能量和最大能量值。為了解決這個問題，我在'energy'列中添加了1(因為我放棄了maxenergy)，然后應(yīng)用了log-transform。

分割訓(xùn)練和測試數(shù)據(jù) (Splitting Traning and Test data)

Since this is an imbalanced dataset, I used Stratified Shuffle Split to split and proportionally distribute the target classes.

由于這是一個不平衡的數(shù)據(jù)集，因此我使用了分層混洗拆分來拆分并按比例分配目標類。

stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.20)for train_idx, test_idx in stratified_split.split(X, y):
y_train= y[train_idx]
X_train = X[train_idx]
X_test, y_test = X[test_idx], y[test_idx]print("Training Set Target Class Distribution:")
print(y_train.value_counts()/len(y_train))
print("Test Set Target Class Distribution:")
print(y_test.value_counts()/len(y_test))

Had I not used this technique, I would have ended up with slightly more imbalanced proportions of hazardous seismic bumps, like below. The machine learning algorithm will be able to learn from a much fair proportion of the class labels rather than random distributions where the smaller category of the target class can be too little to learn from, therefore, bringing down the model’s performance.

如果我不使用這種技術(shù)，那么最終會出現(xiàn)危險的地震顛簸部分比例失衡的情況，如下所示。機器學(xué)習(xí)算法將能夠從相當大比例的類標簽中學(xué)習(xí)，而不必從隨機分布中學(xué)習(xí)，因為在隨機分布中目標類的較小類別可能太少，因此無法學(xué)習(xí)，因此降低了模型的性能。

(Left) — Distribution on splitting using train_test_split with shuffle=False; (Right) — Distribution on splitting using train_test_split with shuffle=True(左)-使用帶有shuffle = False的train_test_split進行拆分時的分布； (右)—使用帶有shuffle = True的train_test_split進行拆分時的分布

The impact of these slight differences could impact the performance of the machine learning algorithms hence for both binary and multi-class classification problems. It is always advisable hence to use stratified splits for classification problems, especially when there is an imbalanced dataset to preserve the percentage of samples for each class and avoid sampling bias.

這些細微差別的影響可能會影響機器學(xué)習(xí)算法的性能，從而影響二進制和多類分類問題。因此始終建議將分層拆分用于分類問題，尤其是在存在不平衡的數(shù)據(jù)集時，應(yīng)保留每個類別的樣本百分比并避免抽樣偏差。

To be continued …

未完待續(xù) …

Thanks for visiting. I hope you enjoyed reading this blog. I will be posting the second part of this blog in the next few days.

感謝造訪。希望您喜歡閱讀此博客。我將在未來幾天內(nèi)發(fā)布此博客的第二部分。

GitHub Link to this Notebook:

GitHub鏈接到此筆記本：

My Links: Medium | LinkedIn | GitHub

我的鏈接： 中 | 領(lǐng)英的GitHub

翻譯自: https://towardsdatascience.com/predicting-hazardous-seismic-bumps-using-supervised-classification-algorithms-part-i-2c5d21f379bc