日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码

發(fā)布時間:2025/3/21 编程问答 50 豆豆
生活随笔 收集整理的這篇文章主要介紹了 ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

ML之FE:特征工程中常用的五大數(shù)據(jù)集劃分方法(特殊類型數(shù)據(jù)分割,如時間序列數(shù)據(jù)分割法)講解及其代碼

?

?

目錄

特殊類型數(shù)據(jù)分割

5.1、時間序列數(shù)據(jù)分割TimeSeriesSplit


?

?

特殊類型數(shù)據(jù)分割

5.1、時間序列數(shù)據(jù)分割TimeSeriesSplit

class TimeSeriesSplit?Found at: sklearn.model_selection._split

?

class TimeSeriesSplit(_BaseKFold):

????"""Time Series cross-validator?.. versionadded:: 0.18

????

????Provides train/test indices to split time series data samples?that are observed at fixed time intervals, in train/test sets.?In each split, test indices must be higher than before, and thus shuffling?in cross validator is inappropriate.?This cross-validation object is a variation of :class:`KFold`.??In the kth split, it returns first k folds as train set and the?(k+1)th fold as test set.

????Note that unlike standard cross-validation methods, successive?training sets are supersets of those that come before them.

????Read more in the :ref:`User Guide <cross_validation>`.

????

????Parameters

????----------

????n_splits : int, default=5. Number of splits. Must be at least 2.?.. versionchanged:: 0.22?. ``n_splits`` default value changed from 3 to 5.

????max_train_size : int, default=None. Maximum size for a single training set.

?

?

?

?

?

?

提供訓練/測試索引,以分割時間序列數(shù)據(jù)樣本,在訓練/測試集中,在固定的時間間隔觀察。在每次分割中,測試索引必須比以前更高,因此在交叉驗證器中變換是不合適的。這個交叉驗證對象是KFold 的變體。在第k次分割中,它返回第k次折疊作為序列集,返回第(k+1)次折疊作為測試集

注意,與標準的交叉驗證方法不同,連續(xù)訓練集是之前那些訓練集的超集

更多信息請參見:ref: ' User Guide <cross_validation> '。</cross_validation>

?

參數(shù)

----------

n_splits?:int,默認=5。數(shù)量的分裂。必須至少是2. ..versionchanged:: 0.22。' ' n_split ' ' '默認值從3更改為5。

max_train_size?: int,默認None。單個訓練集的最大容量。

????Examples

????--------

????>>> import numpy as np

????>>> from sklearn.model_selection import TimeSeriesSplit

????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

????>>> y = np.array([1, 2, 3, 4, 5, 6])

????>>> tscv = TimeSeriesSplit()

????>>> print(tscv)

????TimeSeriesSplit(max_train_size=None, n_splits=5)

????>>> for train_index, test_index in tscv.split(X):

????... ????print("TRAIN:", train_index, "TEST:", test_index)

????... ????X_train, X_test = X[train_index], X[test_index]

????... ????y_train, y_test = y[train_index], y[test_index]

????TRAIN: [0] TEST: [1]

????TRAIN: [0 1] TEST: [2]

????TRAIN: [0 1 2] TEST: [3]

????TRAIN: [0 1 2 3] TEST: [4]

????TRAIN: [0 1 2 3 4] TEST: [5]

????

????Notes

????-----

????The training set has size ``i * n_samples // (n_splits + 1)?+ n_samples % (n_splits + 1)`` in the ``i``th split,?with a test set of size ``n_samples//(n_splits + 1)``,?where ``n_samples`` is the number of samples.

?

????"""

????@_deprecate_positional_args

????def __init__(self, n_splits=5, *, max_train_size=None):

????????super().__init__(n_splits, shuffle=False, random_state=None)

????????self.max_train_size = max_train_size

????

????def split(self, X, y=None, groups=None):

????????"""Generate indices to split data into training and test set.

?

????????Parameters

????????----------

????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features.

?

????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

?

????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

?

????????Yields

????????------

????????train : ndarray. The training set indices for that split.

?

????????test : ndarray. The testing set indices for that split.

????????"""

????????X, y, groups = indexable(X, y, groups)

????????n_samples = _num_samples(X)

????????n_splits = self.n_splits

????????n_folds = n_splits + 1

????????if n_folds > n_samples:

????????????raise ValueError(

????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples))

????????indices = np.arange(n_samples)

????????test_size = n_samples // n_folds

????????test_starts = range(test_size + n_samples % n_folds, n_samples,

?????????test_size)

????????for test_start in test_starts:

????????????if self.max_train_size and self.max_train_size < test_start:

????????????????yield indices[test_start - self.max_train_size:test_start], indices

?????????????????[test_start:test_start + test_size]

????????????else:

????????????????yield indices[:test_start], indices[test_start:test_start + test_size]

?

????Examples

????--------

????>>> import numpy as np

????>>> from sklearn.model_selection import TimeSeriesSplit

????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

????>>> y = np.array([1, 2, 3, 4, 5, 6])

????>>> tscv = TimeSeriesSplit()

????>>> print(tscv)

????TimeSeriesSplit(max_train_size=None, n_splits=5)

????>>> for train_index, test_index in tscv.split(X):

????... ????print("TRAIN:", train_index, "TEST:", test_index)

????... ????X_train, X_test = X[train_index], X[test_index]

????... ????y_train, y_test = y[train_index], y[test_index]

????TRAIN: [0] TEST: [1]

????TRAIN: [0 1] TEST: [2]

????TRAIN: [0 1 2] TEST: [3]

????TRAIN: [0 1 2 3] TEST: [4]

????TRAIN: [0 1 2 3 4] TEST: [5]

????

????Notes

????-----

????The training set has size ``i * n_samples // (n_splits + 1)?+ n_samples % (n_splits + 1)`` in the ``i``th split,?with a test set of size ``n_samples//(n_splits + 1)``,?where ``n_samples`` is the number of samples.

?

????"""

????@_deprecate_positional_args

????def __init__(self, n_splits=5, *, max_train_size=None):

????????super().__init__(n_splits, shuffle=False, random_state=None)

????????self.max_train_size = max_train_size

????

????def split(self, X, y=None, groups=None):

????????"""Generate indices to split data into training and test set.

?

????????Parameters

????????----------

????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features.

?

????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

?

????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

?

????????Yields

????????------

????????train : ndarray. The training set indices for that split.

?

????????test : ndarray. The testing set indices for that split.

????????"""

????????X, y, groups = indexable(X, y, groups)

????????n_samples = _num_samples(X)

????????n_splits = self.n_splits

????????n_folds = n_splits + 1

????????if n_folds > n_samples:

????????????raise ValueError(

????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples))

????????indices = np.arange(n_samples)

????????test_size = n_samples // n_folds

????????test_starts = range(test_size + n_samples % n_folds, n_samples,

?????????test_size)

????????for test_start in test_starts:

????????????if self.max_train_size and self.max_train_size < test_start:

????????????????yield indices[test_start - self.max_train_size:test_start], indices

?????????????????[test_start:test_start + test_size]

????????????else:

????????????????yield indices[:test_start], indices[test_start:test_start + test_size]

?

總結(jié)

以上是生活随笔為你收集整理的ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。