當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

机器学习数据不平衡不均衡处理之SMOTE算法实现

發(fā)布時(shí)間：2023/11/28 生活经验 36 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习数据不平衡不均衡处理之SMOTE算法实现小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

20201125

當(dāng)多數(shù)類和少數(shù)類數(shù)量相差太大的時(shí)候,少數(shù)類不一定要補(bǔ)充到和多數(shù)類數(shù)量一致

最好的辦法就是全部過采樣到最大記錄數(shù)的類別

調(diào)參
SMOTE：只是過采樣
SMOTEENN：過采樣的同時(shí)欠采樣

要調(diào)ENN的參數(shù),先在前面import

https://blog.csdn.net/Li_yi_chao/article/details/94630920
Borderline-SMOTE 過程
https://blog.csdn.net/weixin_37801695/article/details/86243998

https://www.cnblogs.com/massquantity/p/9382710.html

https://www.cnblogs.com/Determined22/p/5772538.html
SMOTE的詳細(xì)過程
重點(diǎn)

https://www.cnblogs.com/massquantity/p/9382710.html

https://blog.csdn.net/weixin_37801695/article/details/86243998

Border-line SMOTE
這個(gè)算法會(huì)先將所有的少數(shù)類樣本分成三類，如下圖所示：

“noise” ：所有的k近鄰個(gè)樣本都屬于多數(shù)類
“danger” ：超過一半的k近鄰樣本屬于多數(shù)類
“safe”：超過一半的k近鄰樣本屬于少數(shù)類

其k近鄰是和所有的樣本數(shù)據(jù)計(jì)算嗎?

https://blog.csdn.net/a358463121/article/details/52304670
重點(diǎn)

https://juejin.im/post/5e181578f265da3e1e0567c6#heading-27
重點(diǎn) 源碼

https://juejin.im/post/5e181578f265da3e1e0567c6

https://www.cnblogs.com/massquantity/p/9382710.html

https://www.cnblogs.com/43726581Gavin/archive/2018/05/16/9043993.html

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.combine.SMOTEENN.html
API 地址

sampling_type：
The type of sampling. Can be either 'over-sampling',
'under-sampling', or 'clean-sampling'.

sampling_strategy

imblearn.over_sampling.SMOTE(
sampling_strategy = ‘a(chǎn)uto’,
random_state = None, ## 隨機(jī)器設(shè)定
k_neighbors = 5, ## 用相近的 5 個(gè)樣本（中的一個(gè)）生成正樣本
m_neighbors = 10, ## 當(dāng)使用 kind={‘borderline1’, ‘borderline2’, ‘svm’}
out_step = ‘0.5’, ## 當(dāng)使用kind = ‘svm’
kind = ‘regular’, ## 隨機(jī)選取少數(shù)類的樣本
– borderline1：最近鄰中的隨機(jī)樣本b與該少數(shù)類樣本a來自于不同的類
– borderline2：隨機(jī)樣本b可以是屬于任何一個(gè)類的樣本;
– svm：使用支持向量機(jī)分類器產(chǎn)生支持向量然后再生成新的少數(shù)類樣本
svm_estimator = SVC(), ## svm 分類器的選取
n_jobs = 1, ## 使用的例程數(shù)，為-1時(shí)使用全部CPU
ratio=None
)

https://blog.csdn.net/yeziyezi1986/article/details/103202012

https://zhuanlan.zhihu.com/p/81857985

網(wǎng)上關(guān)于數(shù)據(jù)不平衡處理的討論有很多，大致來說，數(shù)據(jù)不平衡的處理方法有三種：一是欠采樣，二是過采樣，三是調(diào)整權(quán)重。

今天要說的是過采樣中的一個(gè)算法SMOTE。在網(wǎng)上找到一個(gè)Python庫imbalance-learn package 。它是專門用來處理數(shù)據(jù)不平衡的，網(wǎng)址在這：https://pypi.python.org/pypi/imbalanced-learn#id27
安裝說明安裝之后就可以使用了，下面是一個(gè)簡(jiǎn)單的例子：

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
import numpy as np
import pandas as pdfrom imblearn.combine import SMOTEENNprint(__doc__)# Generate the dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9],n_informative=3, n_redundant=1, flip_y=0,n_features=20, n_clusters_per_class=1,n_samples=100, random_state=10)print(y)
print(y.shape)
sm = SMOTEENN()
X_resampled, y_resampled = sm.fit_sample(X, y)
print(y_resampled)
print(y_resampled.shape)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

輸出為：

[1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1]
(100,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
(177,)1
2
3
4
5
6
7
8
9
10

可見，該算法將標(biāo)簽為0的樣本擴(kuò)展多了77個(gè)。

![在這里插入圖片描述](https://img-blog.csdnimg.cn/20200409090434964.jpg)

總結(jié)

以上是生活随笔為你收集整理的机器学习数据不平衡不均衡处理之SMOTE算法实现的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：正则体系2
下一篇： python实现glove,gensim