當前位置：首頁 > 编程资源 > 综合教程 >内容正文

综合教程

Stacking

發布時間：2024/6/21 综合教程 37 生活家

生活随笔收集整理的這篇文章主要介紹了 Stacking 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Ensemble Learning的基本思想就是將多個基學習器組合在一起，產生泛化能力更強的模型。
組合策略有許多種，包括Voting、平均法和Stacking等，Stacking就是選擇某種學習器作為組合基學習器的方式。

既然要結合多個基學習器的優點，那么基學習器的選擇最好是“準而不同”，元學習器一般選擇比較簡單的模型（如邏輯回歸），防止過擬合。
比較簡單的想法就是將全部訓練集用于訓練基學習器，將基學習器的預測結果作為元學習器的訓練集，從而得到整個模型。
這樣做的問題在于：基學習器最終在訓練集上的表現非常好，再用基學習器在該訓練集上的預測結果作為次級訓練集，同樣元學習器在該訓練集上表現也會非常好，但是模型的泛化能力不一定很好，有過擬合的風險。
因此，采用K折交叉驗證的方式，用訓練基學習器未使用的樣本來產生次級訓練集。

具體來講：
在訓練階段（假設訓練集(400*10)），對于每個基學習器（假設有3個基學習器），進行5次訓練與驗證，得到(400*1)的驗證結果，那么最終次級訓練集是(400*3)（類標簽還是原始類標簽），用這些數據訓練次級學習器，完成后再用全部的訓練集訓練所有的基學習器（可選，提高基學習器性能）；
在測試階段，如果訓練階段選了最后一步，那么每個基學習器直接得到一個結果，就會得到3個測試結果，送入次級學習器，得到最終的預測結果；如果訓練時沒有選最后一步，那么每個基學習器都有5個小模型，將測試樣本用5個小模型分別測試，將5個結果平均得到某個基學習器的預測結果，也會得到3個測試結果。

Weka里的實現：

/**
   * Buildclassifier selects a classifier from the set of classifiers
   * by minimising error on the training data.
   *
   * @param data the training data to be used for generating the
   * boosted classifier.
   * @throws Exception if the classifier could not be built successfully
   */
  // 建立整個模型
  public void buildClassifier(Instances data) throws Exception {
    if (m_MetaClassifier == null) {
      throw new IllegalArgumentException("No meta classifier has been set");
    }
    // 判斷分類器是否有能力處理該數據集
    getCapabilities().testWithFail(data);
    // 刪除類標簽缺失數據
    Instances newData = new Instances(data);
    m_BaseFormat = new Instances(data, 0);
    newData.deleteWithMissingClass();
    
    Random random = new Random(m_Seed);
    newData.randomize(random); // 打亂整個數據集
    // 如果是分類問題，分層抽樣
    // 原始數據按照類標簽集中在一起，按m_NumFolds為步長重新抽取數據，保持訓練集/驗證集數據分布一致性, 避免因數據劃分引入額外的偏差
    if (newData.classAttribute().isNominal()) {
      newData.stratify(m_NumFolds);
    }

    // 處理原始數據得到新的數據，建立meta classifier
    generateMetaLevel(newData, random);
  
    // restart the executor pool because at the end of processing
    // a set of classifiers it gets shutdown to prevent the program
    // executing as a server
    // 創建線程池，為下面的基學習器訓練做準備
    super.buildClassifier(newData);

    // 提高基礎模型的準確度，使其在測試數據表現更好，用所有的訓練集進行基學習器的訓練
    // 這里為了節省時間，測試時，可以直接在多個基學習器預測后取平均
    // Rebuild all the base classifiers on the full training data
    buildClassifiers(newData);
  }

/**
   * Generates the meta data
   * 
   * @param newData the data to work on
   * @param random the random number generator to use for cross-validation
   * @throws Exception if generation fails
   */
  protected void generateMetaLevel(Instances newData, Random random) 
    throws Exception {
    // 先用newData得到metaData的格式m_MetaFormat
    // 確定元分類器需要的屬性
    Instances metaData = metaFormat(newData);
    m_MetaFormat = new Instances(metaData, 0);

    for (int j = 0; j < m_NumFolds; j++) {
      // 得到訓練集
      Instances train = newData.trainCV(m_NumFolds, j, random);
      
      // start the executor pool (if necessary)
      // has to be done after each set of classifiers as the
      // executor pool gets shut down in order to prevent the
      // program executing as a server (and not returning to
      // the command prompt when run from the command line
      // 線程池，多線程并行構建基學習器
      super.buildClassifier(train);

      // 構建基學習器
      buildClassifiers(train);
      
      // Classify test instances and add to meta data
      // 將未使用過的原始訓練數據通過基學習器預測后加入metadata作為新的訓練集
      Instances test = newData.testCV(m_NumFolds, j);
      for (int i = 0; i < test.numInstances(); i++) {
	    metaData.add(metaInstance(test.instance(i)));
      }
    }
    // 利用元數據建立元分類器
    m_MetaClassifier.buildClassifier(metaData);    
  }

因為基學習器之間的訓練是獨立的，所以每次交叉驗證劃分好數據后，都是利用線程池并行訓練。
如果是在分層抽樣的基礎上劃分訓練集和驗證集，trainCV()抽取數據后，需要將新的訓練集Shuffle，保證獨立同分布。

/**
   * Makes the format for the level-1 data.
   *
   * @param instances the level-0 format
   * @return the format for the meta data
   * @throws Exception if the format generation fails
   */
  // 生成元數據格式
  protected Instances metaFormat(Instances instances) throws Exception {
    // 如果m_BaseFormat屬性連續，就加入m_Classifiers.length個屬性
    // 如果是離散的，每次要加入level 0類別屬性取值個數個屬性
    ArrayList<Attribute> attributes = new ArrayList<Attribute>();
    Instances metaFormat;

    // 遍歷基學習器
    for (int k = 0; k < m_Classifiers.length; k++) {
      Classifier classifier = (Classifier) getClassifier(k);
      String name = classifier.getClass().getName() + "-" + (k+1);
      if (m_BaseFormat.classAttribute().isNumeric()) {
	    attributes.add(new Attribute(name));
      } else {
          // 如果離散，后續會通過每個取值的概率來判斷，比如雜色、圓花，這2種特性不能用一個屬性表示，所以每個取值都要獨立成單獨的屬性
          // 來保存概率值
	      for (int j = 0; j < m_BaseFormat.classAttribute().numValues(); j++) {
	        attributes.add(
	          new Attribute(
		      name + ":" + m_BaseFormat.classAttribute().value(j)));
	    }
      }
    }
    // 加上原始類標簽
    attributes.add((Attribute) m_BaseFormat.classAttribute().copy());
    // 形成元數據格式
    metaFormat = new Instances("Meta format", attributes, 0);
    metaFormat.setClassIndex(metaFormat.numAttributes() - 1);
    return metaFormat;
  }

生成元數據格式時，如果是分類問題，類標簽的每個屬性都被作為一個新的屬性：

這里我個人這樣理解：有的基分類器可以輸出屬于某個類的概率（如邏輯回歸），將概率作為元屬性而不是直接將基學習器的分類結果作為元屬性，這樣做能夠減小基學習器的分類誤差帶給元學習器的影響，模型整體更加精確：

/**
   * Makes a level-1 instance from the given instance.
   * 
   * @param instance the instance to be transformed
   * @return the level-1 instance
   * @throws Exception if the instance generation fails
   */
  // 產生元數據
  protected Instance metaInstance(Instance instance) throws Exception {

    // values保存分類結果，連續屬性直接保存，離散屬性則先求得分布，將每種取值的分布加入values，設置為m_MetaFormat格式返回
    double[] values = new double[m_MetaFormat.numAttributes()];
    Instance metaInstance;
    int i = 0;
    for (int k = 0; k < m_Classifiers.length; k++) {
      Classifier classifier = getClassifier(k);
      if (m_BaseFormat.classAttribute().isNumeric()) {
	    values[i++] = classifier.classifyInstance(instance);
      } else {
        // 基學習器對該實例的分類的概率分布, sum(dist)=1
	    double[] dist = classifier.distributionForInstance(instance);
	    // 將該基學習器對該實例的預測概率輸出到對應的元屬性
	    for (int j = 0; j < dist.length; j++) {
	      values[i++] = dist[j];
	    }
      }
    }
    // 標簽值對應最后一個元屬性
    values[i] = instance.classValue();
    metaInstance = new DenseInstance(1, values);
    metaInstance.setDataset(m_MetaFormat);
    return metaInstance;
  }

在實際數據集上的結果其實不一定比其他模型效果好，可能是我參數調的不好吧（霧）~

總結

以上是生活随笔為你收集整理的Stacking的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

stacking

上一篇：轻量级MVVM框架Stylet介绍：(1
下一篇：《像艺术家一样思考》 Think Lik