當前位置：首頁 >

ML之FE：特征工程中常用的五大数据集划分方法(特殊类型数据分割，如时间序列数据分割法)讲解及其代码

發布時間：2025/3/21 54 豆豆

生活随笔收集整理的這篇文章主要介紹了 ML之FE：特征工程中常用的五大数据集划分方法(特殊类型数据分割，如时间序列数据分割法)讲解及其代码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

ML之FE：特征工程中常用的五大數據集劃分方法(特殊類型數據分割，如時間序列數據分割法)講解及其代碼

特殊類型數據分割

5.1、時間序列數據分割TimeSeriesSplit

特殊類型數據分割

5.1、時間序列數據分割TimeSeriesSplit

class TimeSeriesSplit?Found at: sklearn.model_selection._split

class TimeSeriesSplit(_BaseKFold):

????"""Time Series cross-validator?.. versionadded:: 0.18

????

????Provides train/test indices to split time series data samples?that are observed at fixed time intervals, in train/test sets.?In each split, test indices must be higher than before, and thus shuffling?in cross validator is inappropriate.?This cross-validation object is a variation of :class:`KFold`.??In the kth split, it returns first k folds as train set and the?(k+1)th fold as test set.

????Note that unlike standard cross-validation methods, successive?training sets are supersets of those that come before them.

????Read more in the :ref:`User Guide <cross_validation>`.

????

????Parameters

????----------

????n_splits : int, default=5. Number of splits. Must be at least 2.?.. versionchanged:: 0.22?. ``n_splits`` default value changed from 3 to 5.

????max_train_size : int, default=None. Maximum size for a single training set.

提供訓練/測試索引，以分割時間序列數據樣本，在訓練/測試集中，在固定的時間間隔觀察。在每次分割中，測試索引必須比以前更高，因此在交叉驗證器中變換是不合適的。這個交叉驗證對象是KFold 的變體。在第k次分割中，它返回第k次折疊作為序列集，返回第(k+1)次折疊作為測試集。

注意，與標準的交叉驗證方法不同，連續訓練集是之前那些訓練集的超集。

更多信息請參見:ref: ' User Guide <cross_validation> '。</cross_validation>

參數

----------

n_splits?:int，默認=5。數量的分裂。必須至少是2. ..versionchanged:: 0.22。' ' n_split ' ' '默認值從3更改為5。

max_train_size?: int，默認None。單個訓練集的最大容量。

????Examples

????--------

????>>> import numpy as np

????>>> from sklearn.model_selection import TimeSeriesSplit

????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

????>>> y = np.array([1, 2, 3, 4, 5, 6])

????>>> tscv = TimeSeriesSplit()

????>>> print(tscv)

????TimeSeriesSplit(max_train_size=None, n_splits=5)

????>>> for train_index, test_index in tscv.split(X):

????... ????print("TRAIN:", train_index, "TEST:", test_index)

????... ????X_train, X_test = X[train_index], X[test_index]

????... ????y_train, y_test = y[train_index], y[test_index]

????TRAIN: [0] TEST: [1]

????TRAIN: [0 1] TEST: [2]

????TRAIN: [0 1 2] TEST: [3]

????TRAIN: [0 1 2 3] TEST: [4]

????TRAIN: [0 1 2 3 4] TEST: [5]

????

????Notes

????-----

????The training set has size ``i * n_samples // (n_splits + 1)?+ n_samples % (n_splits + 1)`` in the ``i``th split,?with a test set of size ``n_samples//(n_splits + 1)``,?where ``n_samples`` is the number of samples.

????"""

????@_deprecate_positional_args

????def __init__(self, n_splits=5, *, max_train_size=None):

????????super().__init__(n_splits, shuffle=False, random_state=None)

????????self.max_train_size = max_train_size

????

????def split(self, X, y=None, groups=None):

????????"""Generate indices to split data into training and test set.

????????Parameters

????????----------

????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features.

????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

????????Yields

????????------

????????train : ndarray. The training set indices for that split.

????????test : ndarray. The testing set indices for that split.

????????"""

????????X, y, groups = indexable(X, y, groups)

????????n_samples = _num_samples(X)

????????n_splits = self.n_splits

????????n_folds = n_splits + 1

????????if n_folds > n_samples:

????????????raise ValueError(

????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples))

????????indices = np.arange(n_samples)

????????test_size = n_samples // n_folds

????????test_starts = range(test_size + n_samples % n_folds, n_samples,

?????????test_size)

????????for test_start in test_starts:

????????????if self.max_train_size and self.max_train_size < test_start:

????????????????yield indices[test_start - self.max_train_size:test_start], indices

?????????????????[test_start:test_start + test_size]

????????????else:

????????????????yield indices[:test_start], indices[test_start:test_start + test_size]

????Examples

????--------

????>>> import numpy as np

????>>> from sklearn.model_selection import TimeSeriesSplit

????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])

????>>> y = np.array([1, 2, 3, 4, 5, 6])

????>>> tscv = TimeSeriesSplit()

????>>> print(tscv)

????TimeSeriesSplit(max_train_size=None, n_splits=5)

????>>> for train_index, test_index in tscv.split(X):

????... ????print("TRAIN:", train_index, "TEST:", test_index)

????... ????X_train, X_test = X[train_index], X[test_index]

????... ????y_train, y_test = y[train_index], y[test_index]

????TRAIN: [0] TEST: [1]

????TRAIN: [0 1] TEST: [2]

????TRAIN: [0 1 2] TEST: [3]

????TRAIN: [0 1 2 3] TEST: [4]

????TRAIN: [0 1 2 3 4] TEST: [5]

????

????Notes

????-----

????"""

????@_deprecate_positional_args

????def __init__(self, n_splits=5, *, max_train_size=None):

????????super().__init__(n_splits, shuffle=False, random_state=None)

????????self.max_train_size = max_train_size

????

????def split(self, X, y=None, groups=None):

????????"""Generate indices to split data into training and test set.

????????Parameters

????????----------

????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features.

????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility.

????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility.

????????Yields

????????------

????????train : ndarray. The training set indices for that split.

????????test : ndarray. The testing set indices for that split.

????????"""

????????X, y, groups = indexable(X, y, groups)

????????n_samples = _num_samples(X)

????????n_splits = self.n_splits

????????n_folds = n_splits + 1

????????if n_folds > n_samples:

????????????raise ValueError(

????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples))

????????indices = np.arange(n_samples)

????????test_size = n_samples // n_folds

????????test_starts = range(test_size + n_samples % n_folds, n_samples,

?????????test_size)

????????for test_start in test_starts:

????????????if self.max_train_size and self.max_train_size < test_start:

????????????????yield indices[test_start - self.max_train_size:test_start], indices

?????????????????[test_start:test_start + test_size]

????????????else:

????????????????yield indices[:test_start], indices[test_start:test_start + test_size]

總結

以上是生活随笔為你收集整理的ML之FE：特征工程中常用的五大数据集划分方法(特殊类型数据分割，如时间序列数据分割法)讲解及其代码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Python之pandas：pandas
下一篇： Anaconda：成功解决Anacond

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

ML之FE：特征工程中常用的五大数据集划分方法(特殊类型数据分割，如时间序列数据分割法)讲解及其代码

特殊類型數據分割

5.1、時間序列數據分割TimeSeriesSplit

總結