當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

特征工程tf-idf_特征工程-保留和删除的内容

發布時間：2023/11/29 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了特征工程tf-idf_特征工程-保留和删除的内容小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

特征工程tf-idf

The next step after exploring the patterns in data is feature engineering. Any operation performed on the features/columns which could help us in making a prediction from the data could be termed as Feature Engineering. This would include the following at high-level:

探索數據模式之后的下一步是要素工程。對特征/列執行的任何可幫助我們根據數據進行預測的操作都可以稱為特征工程。這將在高層包括以下內容：

adding new features

添加新功能

eliminating some of the features which tell the same story

消除了講述同一故事的某些功能

combining several features together

結合幾個功能

breaking down a feature into multiple features

將一個功能分解為多個功能

新增功能 (Adding new features)

Suppose you want to predict sales of ice-cream or gloves, or umbrella. What is common in these items? The sales of all these items are dependent on “weather” and “location”. Ice-creams sell more during summer or hotter areas, gloves are sold more in colder weather (winter) or colder regions, and we definitely need an umbrella when there’s rain. So if you have the historical sales data for all these items, what would help your model to learn the patterns more would be to add the weather and the selling areas at each data level.

假設您要預測冰淇淋或手套或雨傘的銷量。這些項目有什么共同點？所有這些項目的銷售都取決于“天氣”和“位置”。在夏季或更熱的地區，冰淇淋的銷售量更大，在寒冷的天氣(冬季)或寒冷的地區，手套的銷售量也更多，而下雨天我們肯定需要一把雨傘。因此，如果您具有所有這些項目的歷史銷售數據，那么可以幫助您的模型學習更多模式的方法是在每個數據級別添加天氣和銷售區域。

消除講述同一故事的某些功能 (Eliminating some of the features which tell the same story)

For explanation purpose, I made up a sample dataset which has data of different phone brands, something like the one below. Let us analyze this data and figure out why we should remove/eliminate some columns-

為了說明起見，我組成了一個樣本數據集，其中包含不同手機品牌的數據，如下圖所示。讓我們分析這些數據并弄清楚為什么要刪除/消除某些列-

Image by Author圖片作者

Now in this dataset, if we look carefully, there is a column for the brand name, a column for the model name, and there’s another column which says Phone (which basically contains both brand and model name). So if we see this situation, we don’t need the column Phone because the data in this column is already present in other columns, and split data is better than the aggregated data in this case.

現在，在此數據集中，如果我們仔細看，會出現一列品牌名稱，一列型號名稱以及另一列顯示“ 電話”的信息 (基本上包含品牌名稱和型號名稱)。因此，如果遇到這種情況，則不需要“電話”列，因為此列中的數據已經存在于其他列中，在這種情況下，拆分數據要好于匯總數據。

There is another column that is not adding any value to the dataset — Memory scale. All the memory values are in terms of “GB”, hence there is no need to keep an additional column that fails to show any variation in the dataset, because it’s not going to help our model learn different patterns.

另一列未向數據集添加任何值- 內存比例。 所有內存值均以“ GB”為單位，因此無需保留額外的列，該列無法顯示數據集中的任何變化，因為這不會幫助我們的模型學習不同的模式。

組合多個功能以創建新功能 (Combining several features to create new features)

This means we can use 2–3 features or rows and create a new feature that explains the data better. For example, in the above dataset, some of the features which we can create could be — count of phones in each brand, % share of each phone in respective brand, count of phones available in different memory size, price per unit memory, etc. This will help the model understand the data at a granular level.

這意味著我們可以使用2–3個要素或行，并創建一個可以更好地解釋數據的新要素。例如，在上述數據集中，我們可以創建的某些功能可能是-每個品牌的手機數量，每個品牌在每個品牌手機中的百分比份額，具有不同內存大小的可用手機數量，單位內存價格等這將幫助模型更深入地了解數據。

將功能分解為多個功能 (Breaking down a feature into multiple features)

The most common example in this segment is Date and Address. A date mostly consists of Year, Month, and Day, let’s say in the form of ‘07/28/2019’. So if we break down the Date column into 2019, 7 or July, and 28, it’ll help us join the tables to various other tables in an easier way, and also will be easy to manipulate the data, because now instead of a date format, we have to deal with numbers which are a lot easier.

此段中最常見的示例是日期和地址。日期主要由年，月和日組成，以“ 07/28/2019”的形式表示。因此，如果我們將“日期”列細分為2019年，7月，7月和28日，這將有助于我們以一種更簡單的方式將這些表連接到其他各種表，并且也將易于操作數據，因為現在不再使用日期格式，我們必須處理容易得多的數字。

For the same easier data manipulation and easier data joins reason, we break down the Address data (721 Main St., Apt 24, Dallas, TX-75432) into — Street name (721 Main St.), Apartment number/ House number (Apt 24), City (Dallas), State (TX/Texas), zip code (75432).

為了簡化數據處理和簡化數據合并的原因，我們將地址數據(721 Main St.，Apt 24，Dallas，TX-75432)分解為—街道名稱(721 Main St.)，公寓號/門牌號( Apt 24)，城市(達拉斯)，州(TX / Texas)，郵政編碼(75432)。

Now that we know what feature engineering is, let’s go through some of the techniques by which we can do feature engineering. There are various methods out there for feature engineering, but I will discuss some of the most common techniques & practices that I use in my regular problems.

既然我們知道了特征工程是什么，那么讓我們看一下可以進行特征工程的一些技術。有許多用于特征工程的方法，但是我將討論一些我經常遇到的最常見的技術和實踐。

Lags — this means creating columns for previous timestamp records (sales 1-day back, sales 1-month back, etc. based on the use-case). This feature will help us understand, for example, what was the iPhone sale 1 day back, 2 days back, etc. This is important because most of the machine learning algorithms look at the data row-wise, and unless we don’t have the previous days' records in the same row, the model will not be able to create patterns between current and previous date records efficiently.

滯后 -這意味著為以前的時間戳記錄創建列(根據用例，返回1天的銷售額，返回1個月的銷售額等)。例如，此功能將幫助我們了解1天后，2天后iPhone的銷售情況。這很重要，因為大多數機器學習算法都是按行查看數據，除非我們沒有同一行中的前幾天記錄，該模型將無法有效地在當前日期記錄和以前的日期記錄之間創建模式。

Count of categories — this could be anything as simple as count of phones in each brand, count of people buying iPhone 11pro, count of the different age groups of people buying Samsung Galaxy vs iPhone.

類別計數 -這可能很簡單，例如每個品牌的手機計數，購買iPhone 11pro的人數，購買三星Galaxy與iPhone的不同年齡段的人數。

Sum/ Mean/ Median/Cumulative sum/ Aggregate sum — of any numeric features like salary, sales, profit, age, weight, etc.

總和/平均值/中位數/累計總和/總和 -任何數字特征，如薪水，銷售額，利潤，年齡，體重等。

Categorical Transformation Techniques (replacing values, one-hot encoding, label encoding, etc) — These techniques are used to convert the categorical features to respective numerical encoded values, because some of the algorithms (like xgboost) do not identify categorical features. The correct technique depends on the number of categories in each column, the number of categorical columns, etc. To learn more about different techniques, check this blog and this blog.

分類轉換技術 (替換值，單次編碼，標簽編碼等)-這些技術用于將分類特征轉換為各自的數字編碼值，因為某些算法(例如xgboost)無法識別分類特征。正確的技術取決于每列中類別的數量，分類列的數量等。要了解有關不同技術的更多信息，請訪問此博客和此博客。

Standardization/ Normalization techniques (min-max, standard scaler, etc) — There could be some datasets where you have numerical features but they’re present at different scales (kg, $, inch, sq.ft., etc.). So for some of the machine learning methods like clustering, it is important that we have all the numbers at one scale (we will discuss about clustering more in later blogs, but for now understand it as creating groups of data points in space based on the similarity). To know more about this section, check out these blogs — Feature Scaling Analytics Vidhya, Handling Numerical Data O'Reilly, Standard Scaler/MinMax Scaler.

標準化/標準化技術 (最小-最大，標準縮放器等)—可能有一些數據集具有數字功能，但它們以不同的比例(公斤，美元，英寸，平方英尺等)顯示。因此，對于某些諸如聚類的機器學習方法，重要的是使所有數字都在一個尺度上(我們將在以后的博客中討論有關聚類的更多信息，但就目前而言，它理解為基于空間的數據點組)相似)。要了解有關此部分的更多信息，請查看這些博客-Feature Scaling Analytics Vidhya ，處理數值數據O'Reilly ， Standard Scaler / MinMax Scaler。

These are some of the very general methods of creating new features, but most of the feature engineering largely depends on brainstorming on the dataset in the picture. For example, if we have a dataset for employees, vs if we have the dataset of the general transactions, feature engineering will be done in different ways.

這些是創建新要素的一些非常通用的方法，但是大多數要素工程很大程度上取決于對圖片數據集進行頭腦風暴。例如，如果我們有一個雇員數據集，而我們有一個一般交易數據集，那么要素工程將以不同的方式完成。

We can create these columns using various pandas functions manually. Besides these, there is a package called FeatureTools, which can also be explored to create new columns by combining the datasets at different levels.

我們可以使用各種熊貓函數手動創建這些列。除此之外，還有一個名為FeatureTools的軟件包，也可以通過組合不同級別的數據集來探索該軟件包以創建新列。

Image by Author圖片作者

This brings us to a (somewhat) end of Data Preprocessing Stages. Once we have the data preprocessed, we need to start looking into different ML techniques for our problem statement. We will be discussing those in the upcoming blogs. Hope y’all found this blog interesting and useful! :)

這使我們進入了(某種程度上) 數據預處理階段的結尾。 對數據進行預處理后，我們需要針對問題陳述開始研究不同的ML技術。我們將在即將發布的博客中討論這些內容。希望大家都覺得這個博客有趣和有用！ :)

翻譯自: https://medium.com/swlh/what-to-keep-and-what-to-remove-74ba1b3cb04