當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据工程师要学的编程_每个数据工程师都应了解的ml编程技巧，第2部分

發布時間：2023/12/20 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据工程师要学的编程_每个数据工程师都应了解的ml编程技巧，第2部分小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

大數據工程師要學的編程

現實世界中的DS(DS IN THE REAL WORLD)

這篇文章是下面提到的繼續。 (This post is in continuation with the one mentioned below.)

In the above post, I have presented some important programming takeaways to know and keep in mind while performing Machine Learning practices to make your implementation faster and effective. Following which we are going to see more of these hacks. Let us begin.

在上面的文章中，我介紹了一些重要的編程要點，它們在執行機器學習實踐時要了解并牢記，以使您的實現更快，更有效。接下來，我們將看到更多這些技巧。讓我們開始吧。

11.操縱寬和長數據幀： (11. Manipulating Wide & Long DataFrames:)

The most effective method for converting wide to long data and long to wide data is pandas.melt() and pandas.pivot_table() function respectively. You will not need anything else to manipulate long and wide data into one another other than these functions.

轉換寬數據到長數據和長數據到寬數據的最有效方法分別是pandas.melt()和pandas.pivot_table()函數。除了這些功能之外，您不需要其他任何東西就可以將長而寬的數據相互轉換。

一種。寬到長(融化) (a. Wide to Long (Melt))

>>> import pandas as pd
# create wide dataframe
>>> df_wide = pd.DataFrame(
... {"student": ["Andy", "Bernie", "Cindy", "Deb"],
... "school": ["Z", "Y", "Z", "Y"],
... "english": [66, 98, 61, 67], # eng grades
... "math": [87, 48, 88, 47], # math grades
... "physics": [50, 30, 59, 54] # physics grades
... }
... )
>>> df_wide
student school english math physics
0 Andy Z 66 87 50
1 Bernie Y 98 48 30
2 Cindy Z 61 88 59
3 Deb Y 67 47 54
>>> df_wide.melt(id_vars=["student", "school"],
... var_name="subject", # rename
... value_name="score") # rename
student school subject score
0 Andy Z english 66
1 Bernie Y english 98
2 Cindy Z english 61
3 Deb Y english 67
4 Andy Z math 87
5 Bernie Y math 48
6 Cindy Z math 88
7 Deb Y math 47
8 Andy Z physics 50
9 Bernie Y physics 30
10 Cindy Z physics 59
11 Deb Y physics 54

b。長到寬(數據透視表) (b. Long to Wide (Pivot Table))

>>> import pandas as pd
# create long dataframe
>>> df_long = pd.DataFrame({
... "student":
... ["Andy", "Bernie", "Cindy", "Deb",
... "Andy", "Bernie", "Cindy", "Deb",
... "Andy", "Bernie", "Cindy", "Deb"],
... "school":
... ["Z", "Y", "Z", "Y",
... "Z", "Y", "Z", "Y",
... "Z", "Y", "Z", "Y"],
... "class":
... ["english", "english", "english", "english",
... "math", "math", "math", "math",
... "physics", "physics", "physics", "physics"],
... "grade":
... [66, 98, 61, 67,
... 87, 48, 88, 47,
... 50, 30, 59, 54]
... })
>>> df_long
student school class grade
0 Andy Z english 66
1 Bernie Y english 98
2 Cindy Z english 61
3 Deb Y english 67
4 Andy Z math 87
5 Bernie Y math 48
6 Cindy Z math 88
7 Deb Y math 47
8 Andy Z physics 50
9 Bernie Y physics 30
10 Cindy Z physics 59
11 Deb Y physics 54
>>> df_long.pivot_table(index=["student", "school"],
... columns='class',
... values='grade')
class english math physics
student school
Andy Z 66 87 50
Bernie Y 98 48 30
Cindy Z 61 88 59
Deb Y 67 47 54

12.交叉表： (12. Cross Tabulation:)

When you need to summarise the data, cross tabulation plays a great role to aggregate two or more factors and compute the frequency table for the values. It can be implemented with pandas.crosstab() function which also allows to find the normalized values while printing the output using ‘normalize’ parameter.

當您需要匯總數據時，交叉表在匯總兩個或更多因素并計算這些值的頻率表方面發揮著重要作用。可以使用pandas.crosstab()函數實現該函數，該函數還允許在使用'normalize'參數打印輸出時查找歸一化的值。

>>> import numpy as np
>>> import pandas as pd
>>> p = np.array(["s1", "s1", "s1", "s1", "b1", "b1",
... "b1", "b1", "s1", "s1", "s1"], dtype=object)
>>> q = np.array(["one", "one", "one", "two", "one", "one",
... "one", "two", "two", "two", "one"], dtype=object)
>>> r = np.array(["x", "x", "y", "x", "x", "y",
... "y", "x", "y", "y", "y"], dtype=object)
>>> pd.crosstab(p, [q, r], rownames=['p'], colnames=['q', 'r'])
q one two
r x y x y
p
b1 1 2 1 0
s1 2 2 1 2# get normalized output values
>>> pd.crosstab(p, [q, r], rownames=['p'], colnames=['q', 'r'], normalize=True)
q one two
r x y x y
p
b1 0.090909 0.181818 0.090909 0.000000
s1 0.181818 0.181818 0.090909 0.181818

13. Jupyter主題： (13. Jupyter Themes:)

The one of the best libraries in Python is jupyterthemes that allows you to change and control the style of the notebook view that most of the ML practitioners work upon. As different themes like having dark mode, light mode, etc. or custom styling is preferred by most of the programmers and it can be achieved in Jupyter notebooks using jupyterthemes library.

Python中最好的庫之一是jupyterthemes，它使您可以更改和控制大多數ML從業人員從事的筆記本視圖的樣式。由于大多數程序員都喜歡不同的主題，例如具有暗模式，亮模式等或自定義樣式，因此可以使用jupyterthemes庫在Jupyter筆記本中實現。

# pip install
$ pip install jupyterthemes# conda install
$ conda install -c conda-forge jupyterthemes# list available themes
$ jt -l
Available Themes:
chesterish
grade3
gruvboxd
gruvboxl
monokai
oceans16
onedork
solarizedd
solarizedl# apply the theme
jt -t chesterish# reverse the theme
!jt -r

You can find more about it here on Github https://github.com/dunovank/jupyter-themes.

您可以在Github上找到更多有關它的信息https://github.com/dunovank/jupyter-themes 。

14.將分類轉換為虛擬變量： (14. Convert Categorical to Dummy Variable:)

Using pandas.get_dummies() function, you can directly convert the categorical features in the DataFrame to Dummy variables along with drop_first=True to remove the first redundant column.

使用pandas.get_dummies()函數，可以將DataFrame中的分類功能與drop_first = True一起直接轉換為Dummy變量，以刪除第一個冗余列。

>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
... 'C': [1, 2, 3]})>>> df
A B C
0 a b 1
1 b a 2
2 a c 3>>> pd.get_dummies(df[['A','B']])
A_a A_b B_a B_b B_c
0 1 0 0 1 0
1 0 1 1 0 0
2 1 0 0 0 1>>> dummy = pd.get_dummies(df[['A','B']], drop_first=True)
>>> dummy
A_b B_b B_c
0 0 1 0
1 1 0 0
2 0 0 1# concat dummy features to existing df
>>> df = pd.concat([df, dummy], axis=1)>>> df
A B C A_b B_b B_c
0 a b 1 0 1 0
1 b a 2 1 0 0
2 a c 3 0 0 1

15.轉換為數字： (15. Convert into Numeric:)

While loading dataset into pandas, sometimes the numeric column is taken object type and numeric operations cannot be performed on the same. In order to convert them to numeric, we can use pandas.to_numeric() function and update existing Series, or column in DataFrame.

在將數據集加載到熊貓中時，有時會將數字列作為對象類型，并且不能在同一列上執行數字操作。為了將它們轉換為數字，我們可以使用pandas.to_numeric()函數并更新現有的Series或DataFrame中的列。

>>> import pandas as pd
>>> s = pd.Series(['1.0', '2', -3, '12', 5])
>>> s
0 1.0
1 2
2 -3
3 12
4 5
dtype: object>>> pd.to_numeric(s)
0 1.0
1 2.0
2 -3.0
3 12.0
4 5.0
dtype: float64>>> pd.to_numeric(s, downcast='signed')
0 1
1 2
2 -3
3 12
4 5
dtype: int8

16.分層采樣/拆分： (16. Stratified Sampling/Splitting:)

When splitting the dataset, we need to obtain sample population in data splits at times. It is more effective when the classes are not balanced enough in the dataset. In sklearn.model_selection.train_test_split() function, a parameter named “stratify” can be set with target class feature to correctly split the data with same ratio as present in unsplitted dataset for different classes.

拆分數據集時，我們有時需要獲取數據拆分中的樣本總體。當類在數據集中不夠平衡時，它會更有效。在sklearn.model_selection .train_test_split()函數中，可以使用目標類別功能設置名為“ stratify ”的參數，以與未分割數據集中不同類別的比率正確分割數據。

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.25)

17.按類型選擇特征： (17. Selecting Features By Type:)

In most of the datasets, we have both types of columns, i.e. Numerical, and Non-Numerical. We often have the need to extract only the numerical columns or categorical columns in the dataset and perform some visualization functions or custom manipulations on the same. In pandas library, we have DataFrame.select_dtypes() function which selects the specific columns from the given dataset that matches the specified datatype.

在大多數數據集中，我們有兩種類型的列，即數值列和非數值列。我們經常需要僅提取數據集中的數字列或分類列，并對它們執行一些可視化功能或自定義操作。在熊貓庫中，我們具有DataFrame.select_dtypes()函數，該函數從給定的數據集中選擇與指定數據類型匹配的特定列。

>>> import pandas as pd
>>> df = pd.DataFrame({'a': [1, 2] * 3,
... 'b': [True, False] * 3,
... 'c': [1.0, 2.0] * 3})
>>> df
a b c
0 1 True 1.0
1 2 False 2.0
2 1 True 1.0
3 2 False 2.0
4 1 True 1.0
5 2 False 2.0>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False>>> df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0>>> df.select_dtypes(exclude=['int64'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0

18. RandomizedSearchCV： (18. RandomizedSearchCV:)

RandomizedSearchCV is a function from sklearn.model_selectionclass that is used to determine random set of hyperparameters for the mentioned learning algorithm, it randomly selects different values for each hyperparameter provided to tune and applied cross-validations on each selected value and determine the best one of them using different scoring mechanism provided while searching.

RandomizedSearchCV是sklearn.model_selection類的一個函數，用于為所提到的學習算法確定隨機的超參數集，它為提供的每個超參數隨機選擇不同的值，以調整和應用對每個選定值的交叉驗證，并確定最佳選擇之一。他們使用搜索時提供的不同評分機制。

>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from scipy.stats import uniform
>>> iris = load_iris()
>>> logistic = LogisticRegression(solver='saga', tol=1e-2,
... max_iter=300,random_state=12)
>>> distributions = dict(C=uniform(loc=0, scale=4),
... penalty=['l2', 'l1'])>>> clf = RandomizedSearchCV(logistic, distributions, random_state=0)>>> search = clf.fit(iris.data, iris.target)>>> search.best_params_
{'C': 2..., 'penalty': 'l1'}

19.魔術功能-歷史記錄： (19. Magic function — %history:)

A batch of previously ran commands in the notebook can be accessed using ‘%history’ magic function. This will provide all previously executed commands and can be provided custom options to select the specific history commands which you can check using ‘%history?’ in jupyter notebook.

可以使用'％history'魔術功能訪問筆記本中一批以前運行的命令。這將提供所有以前執行的命令，并可以提供自定義選項以選擇特定的歷史命令，您可以使用'％history？'進行檢查。在jupyter筆記本中。

In [1]: import math

In [2]: math.sin(2)
Out[2]: 0.9092974268256817

In [3]: math.cos(2)
Out[3]: -0.4161468365471424In [16]: %history -n 1-3
1: import math
2: math.sin(2)
3: math.cos(2)

20.下劃線快捷方式(_)： (20. Underscore Shortcuts (_):)

In python, you can directly print the last output sent by the interpreter using print(_) function with underscore. This might not be that helpful, but in IPython (jupyter notebook), this feature has been extended and you can print any nth last output using n underscores within print() function. E.g. print(__) with two underscores will give you second-to-last output which skips all command that has no output.

在python中，您可以使用帶下劃線的print(_)函數直接打印解釋器發送的最后輸出。這可能沒有幫助，但是在IPython(jupyter筆記本)中，此功能已得到擴展，您可以在print()函數中使用n下劃線打印任何n個最后輸出。例如帶有兩個下劃線的print(__)將為您提供倒數第二個輸出，該輸出將跳過所有沒有輸出的命令。

Also, another is underscore followed by line number prints the associated output.

此外，另一個是下劃線，其后是行號，以打印相關的輸出。

In [1]: import math

In [2]: math.sin(2)
Out[2]: 0.9092974268256817

In [3]: math.cos(2)
Out[3]: -0.4161468365471424In [4]: print(_)
-0.4161468365471424

In [5]: print(__)
0.9092974268256817In [6]: _2
Out[13]: 0.9092974268256817

That’s all for now. I will present more of these important hacks/functions that every data engineer should know about in more next few parts.

目前為止就這樣了。我將在接下來的幾個部分中介紹每個數據工程師都應該了解的這些重要的技巧/功能。

Stay tuned.

敬請關注。

Photo by Howie R on Unsplash照片由Howie R在Unsplash上拍攝

Thanks for reading. You can find my other Machine Learning related posts here.

謝謝閱讀。您可以在這里找到我其他與機器學習有關的帖子。

I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at LinkedIn.

希望這篇文章對您有所幫助。我感謝反饋和建設性的批評。如果您想談論本文或其他相關主題，可以在此處或在LinkedIn上給我發短信。

翻譯自: https://towardsdatascience.com/ml-programming-hacks-that-every-data-engineer-should-know-part-2-61c0df0f215c