當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python数据分析与机器学习(Numpy,Pandas,Matplotlib)

發(fā)布時(shí)間：2024/7/5 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python数据分析与机器学习(Numpy,Pandas,Matplotlib) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

機(jī)器學(xué)習(xí)怎么學(xué)？

機(jī)器學(xué)習(xí)包含數(shù)學(xué)原理推導(dǎo)和實(shí)際應(yīng)用技巧，所以需要清楚算法的推導(dǎo)過程和如何應(yīng)用。
深度學(xué)習(xí)是機(jī)器學(xué)習(xí)中神經(jīng)網(wǎng)絡(luò)算法的延伸，在計(jì)算機(jī)視覺和自然語言處理中應(yīng)用更厲害一些。
自己從頭開始做筆記。

機(jī)器學(xué)習(xí)怎么動(dòng)手，哪里去找案例？

最好的資源：github ，kaggle
案例積累的作用很大，很少從頭去寫一個(gè)項(xiàng)目。先學(xué)會(huì)模仿，再去創(chuàng)作。

科學(xué)計(jì)算庫Numpy

numpy(Numerical Python extensions)是一個(gè)第三方的Python包，用于科學(xué)計(jì)算。這個(gè)庫的前身是1995年就開始開發(fā)的一個(gè)用于數(shù)組運(yùn)算的庫。經(jīng)過了長時(shí)間的發(fā)展，基本上成了絕大部分Python科學(xué)計(jì)算的基礎(chǔ)包，當(dāng)然也包括所有提供Python接口的深度學(xué)習(xí)框架。
numpy.genfromtxt方法
從文本文件加載數(shù)據(jù)，并按指定的方式處理缺少的值

delimiter : 分隔符：用于分隔值的字符串。可以是str, int, or sequence。默認(rèn)情況下，任何連續(xù)的空格作為分隔符。
dtype：結(jié)果數(shù)組的數(shù)據(jù)類型。如果沒有，則dtypes將由每列的內(nèi)容單獨(dú)確定。

import numpy world_alcohol = numpy.genfromtxt("world_alcohol.txt",delimiter=",",dtype=str) print(type(world_alcohol)) print(world_alcohol) print(help(numpy.genfromtxt)) #當(dāng)想知道numpy.genfromtxt用法時(shí)，使用help查詢幫助文檔

輸出結(jié)果：
<class ‘numpy.ndarray’> #所有的numpy都是ndarray結(jié)構(gòu)
[[‘Year’ ‘WHO region’ ‘Country’ ‘Beverage Types’ ‘Display Value’]
[‘1986’ ‘Western Pacific’ ‘Viet Nam’ ‘Wine’ ‘0’]
[‘1986’ ‘Americas’ ‘Uruguay’ ‘Other’ ‘0.5’]
…,
[‘1987’ ‘Africa’ ‘Malawi’ ‘Other’ ‘0.75’]
[‘1989’ ‘Americas’ ‘Bahamas’ ‘Wine’ ‘1.5’]
[‘1985’ ‘Africa’ ‘Malawi’ ‘Spirits’ ‘0.31’]]

numpy.array
創(chuàng)建一個(gè)向量或矩陣（多維數(shù)組）

import numpy as np a = [1, 2, 4, 3] #vector b = np.array(a) # array([1, 2, 4, 3]) type(b) # <type 'numpy.ndarray'>

對(duì)數(shù)組元素的操作1

b.shape # (4,) 返回矩陣的（行數(shù)，列數(shù)）或向量中的元素個(gè)數(shù) b.argmax() # 2 返回最大值所在的索引 b.max() # 4最大值 b.min() # 1最小值 b.mean() # 2.5平均值

numpy限制了nump.array中的元素必須是相同數(shù)據(jù)結(jié)構(gòu)。使用dtype屬性返回?cái)?shù)組中的數(shù)據(jù)類型

>>> a = [1,2,3,5] >>> b = np.array(a) >>> b.dtype dtype('int64')

對(duì)數(shù)組元素的操作2

c = [[1, 2], [3, 4]] # 二維列表 d = np.array(c) # 二維numpy數(shù)組 d.shape # (2, 2) d[1,1] #4,矩陣方式按照行、列獲取元素 d.size # 4 數(shù)組中的元素個(gè)數(shù) d.max(axis=0) # 找維度0，也就是最后一個(gè)維度上的最大值，array([3, 4]) d.max(axis=1) # 找維度1，也就是倒數(shù)第二個(gè)維度上的最大值，array([2, 4]) d.mean(axis=0) # 找維度0，也就是第一個(gè)維度上的均值，array([ 2., 3.]) d.flatten() # 展開一個(gè)numpy數(shù)組為1維數(shù)組，array([1, 2, 3, 4]) np.ravel(c) # 展開一個(gè)可以解析的結(jié)構(gòu)為1維數(shù)組，array([1, 2, 3, 4])

對(duì)數(shù)組元素的操作3

import numpy as np matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix.sum(axis=1)) #指定維度axis=1，即按行計(jì)算輸出結(jié)果： [ 30 75 120]

import numpy as np
matrix = np.array([
[5,10,15],
[20,25,30],
[35,40,45]
])
print(matrix.sum(axis=0)) #指定維度axis=0，即按列計(jì)算
輸出結(jié)果：
[60 75 90]

矩陣中也可以使用切片

import numpy as np vector = [1, 2, 4, 3] print(vector[0:3]) #[1, 2, 4] 對(duì)于索引大于等于0，小于3的所有元素matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix[:,1]) #[10 25 40]取出所有行的第一列 print(matrix[:,0:2]) #取出所有行的第一、第二列 #[[ 5 10][20 25][35 40]]

對(duì)數(shù)組的判斷操作，等價(jià)于對(duì)數(shù)組中所有元素的操作

import numpy as np matrix = np.array([[5,10,15],[20,25,30],[35,40,45]]) print(matrix == 25) 輸出結(jié)果： [[False False False][False True False][False False False]]

second_colum_25 = matrix[:,1]== 25
print(second_colum_25)
print(matrix[second_colum_25,:]) #bool類型的值也可以拿出來當(dāng)成索引
輸出結(jié)果：
[False True False]
[[20 25 30]]

對(duì)數(shù)組元素的與操作,或操作

import numpy as np vector = np.array([5,10,15,20]) equal_to_ten_and_five = (vector == 10) & (vector == 5) print (equal_to_ten_and_five) 輸出結(jié)果： [False False False False]

import numpy as np
vector = np.array([5,10,15,20])
equal_to_ten_and_five = (vector == 10) | (vector == 5)
print (equal_to_ten_and_five)
vector[equal_to_ten_and_five] = 50 #bool類型值作為索引時(shí)，True有效
print(vector)
輸出結(jié)果：
[ True True False False]
[50 50 15 20]

對(duì)數(shù)組元素類型的轉(zhuǎn)換

import numpy as np vector = np.array(['lucy','ch','dd']) vector = vector.astype(float) #astype對(duì)整個(gè)vector進(jìn)行值類型的轉(zhuǎn)換 print(vector.dtype) print(vector) 輸出結(jié)果： float64 [ 5. 10. 15. 20.]

Numpy常用函數(shù)

reshape方法，變換矩陣維度

import numpy as np print(np.arange(15)) a = np.arange(15).reshape(3,5) #將向量變?yōu)?行5列矩陣 print(a) print(a.shape) #shape方法獲得（行數(shù)，烈數(shù)）

輸出結(jié)果：
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
(3, 5)

初始化矩陣為0或1

>>> import numpy as np >>> np.zeros((3,4)) #將一個(gè)三行四列矩陣初始化為0 輸出結(jié)果： array([[ 0., 0., 0., 0.],[ 0., 0., 0., 0.],[ 0., 0., 0., 0.]])

>>> import numpy as np
>>> np.ones((3,4),dtype=np.int32) #指定類型為int型
輸出結(jié)果：
array([[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]], dtype=int32)

構(gòu)造序列

np.arange( 10, 30, 5 ) #起始值10，終止值小于30，間隔為5 輸出結(jié)果： array([10, 15, 20, 25])

np.arange( 0, 2, 0.3 )
輸出結(jié)果：
array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

random模塊

np.random.random((2,3)) #random模塊中的random函數(shù)，產(chǎn)生一個(gè)兩行三列的隨機(jī)矩陣。（-1，+1）之間的值輸出結(jié)果： array([[ 0.40130659, 0.45452825, 0.79776512],[ 0.63220592, 0.74591134, 0.64130737]])

linspace模塊，將起始值與終止值之間等分成x份

from numpy import pi np.linspace( 0, 2*pi, 100 ) 輸出結(jié)果： array([ 0. , 0.06346652, 0.12693304, 0.19039955, 0.25386607,0.31733259, 0.38079911, 0.44426563, 0.50773215, 0.57119866,0.63466518, 0.6981317 , 0.76159822, 0.82506474, 0.88853126,0.95199777, 1.01546429, 1.07893081, 1.14239733, 1.20586385,1.26933037, 1.33279688, 1.3962634 , 1.45972992, 1.52319644,1.58666296, 1.65012947, 1.71359599, 1.77706251, 1.84052903,1.90399555, 1.96746207, 2.03092858, 2.0943951 , 2.15786162,2.22132814, 2.28479466, 2.34826118, 2.41172769, 2.47519421,2.53866073, 2.60212725, 2.66559377, 2.72906028, 2.7925268 ,2.85599332, 2.91945984, 2.98292636, 3.04639288, 3.10985939,3.17332591, 3.23679243, 3.30025895, 3.36372547, 3.42719199,3.4906585 , 3.55412502, 3.61759154, 3.68105806, 3.74452458,3.8079911 , 3.87145761, 3.93492413, 3.99839065, 4.06185717,4.12532369, 4.1887902 , 4.25225672, 4.31572324, 4.37918976,4.44265628, 4.5061228 , 4.56958931, 4.63305583, 4.69652235,4.75998887, 4.82345539, 4.88692191, 4.95038842, 5.01385494,5.07732146, 5.14078798, 5.2042545 , 5.26772102, 5.33118753,5.39465405, 5.45812057, 5.52158709, 5.58505361, 5.64852012,5.71198664, 5.77545316, 5.83891968, 5.9023862 , 5.96585272,6.02931923, 6.09278575, 6.15625227, 6.21971879, 6.28318531])

對(duì)矩陣的運(yùn)算以矩陣為單位進(jìn)行操作

import numpy as np a = np.array( [20,30,40,50] ) b = np.arange( 4 ) #[0 1 2 3] c = a-b print(c) #[20 29 38 47] print(b**2) #[0 1 4 9] print(a<35) #[ True True False False]

矩陣乘法

A = np.array( [[1,1],[0,1]] ) B = np.array( [[2,0],[3,4]] ) print A.dot(B) #求矩陣乘法的方法一 print np.dot(A, B) ##求矩陣乘法的方法二輸出結(jié)果： [[5 4][3 4]] [[5 4][3 4]]

e為底數(shù)的運(yùn)算&開根運(yùn)算

import numpy as np B = np.arange(3) print (np.exp(B)) #[ 1. 2.71828183 7.3890561 ] e的B次方 print (np.sqrt(B)) #[ 0. 1. 1.41421356]

floor向下取整

import numpy as np a = np.floor(10*np.random.random((3,4))) #floor向下取整 print(a) print (a.ravel()) #將矩陣中元素展開成一行 a.shape = (6, 2) #當(dāng)采用a.reshape(6,-1) 第二個(gè)參數(shù)-1表示默認(rèn)根據(jù)行數(shù)確定列數(shù) print (a) print (a.T) #a的轉(zhuǎn)置（矩陣行列互換）

[[ 8. 7. 2. 1.]
[ 5. 2. 5. 1.]
[ 8. 7. 7. 2.]]
[ 8. 7. 2. 1. 5. 2. 5. 1. 8. 7. 7. 2.]
[[ 8. 7.]
[ 2. 1.]
[ 5. 2.]
[ 5. 1.]
[ 8. 7.]
[ 7. 2.]]
[[ 8. 2. 5. 5. 8. 7.]
[ 7. 1. 2. 1. 7. 2.]]

hstack與vstack實(shí)現(xiàn)矩陣的拼接（拼接數(shù)據(jù)常用）

a = np.floor(10*np.random.random((2,2))) b = np.floor(10*np.random.random((2,2))) print(a) print(b) print(np.hstack((a,b))) #橫著拼接 print(np.vstack((a,b))) #豎著拼接輸出結(jié)果： [[ 8. 6.][ 7. 6.]] [[ 3. 4.][ 8. 1.]] [[ 8. 6. 3. 4.][ 7. 6. 8. 1.]] [[ 8. 6.][ 7. 6.][ 3. 4.][ 8. 1.]]

hsplit與vsplit實(shí)現(xiàn)矩陣的切分

a = np.floor(10*np.random.random((2,12))) print(a) print(np.hsplit(a,3)) #橫著將矩陣切分為3份 print(np.hsplit(a,(3,4))) # 指定橫著切分的位置，第三列和第四列輸出結(jié)果： [[ 7. 1. 4. 9. 8. 8. 5. 9. 6. 6. 9. 4.][ 1. 9. 1. 2. 9. 9. 5. 0. 5. 4. 9. 6.]] [array([[ 7., 1., 4., 9.],[ 1., 9., 1., 2.]]), array([[ 8., 8., 5., 9.],[ 9., 9., 5., 0.]]), array([[ 6., 6., 9., 4.],[ 5., 4., 9., 6.]])] [array([[ 7., 1., 4.],[ 1., 9., 1.]]), array([[ 9.],[ 2.]]), array([[ 8., 8., 5., 9., 6., 6., 9., 4.],[ 9., 9., 5., 0., 5., 4., 9., 6.]])]

a = np.floor(10*np.random.random((12,2)))
print(a)
np.vsplit(a,3) #豎著將矩陣切分為3份
輸出結(jié)果：
[[ 6. 4.]
[ 0. 1.]
[ 9. 0.]
[ 0. 0.]
[ 0. 4.]
[ 1. 1.]
[ 0. 4.]
[ 1. 6.]
[ 9. 7.]
[ 0. 9.]
[ 6. 1.]
[ 3. 0.]]
[array([[ 6., 4.],
[ 0., 1.],
[ 9., 0.],
[ 0., 0.]]), array([[ 0., 4.],
[ 1., 1.],
[ 0., 4.],
[ 1., 6.]]), array([[ 9., 7.],
[ 0., 9.],
[ 6., 1.],
[ 3., 0.]])]

直接把一個(gè)數(shù)組賦值給另一個(gè)數(shù)組，兩個(gè)數(shù)組指向同一片內(nèi)存區(qū)域，對(duì)其中一個(gè)的操作就會(huì)影響另一個(gè)結(jié)果

a = np.arange(12) b = a #a和b是同一個(gè)數(shù)組對(duì)象的兩個(gè)名字 print (b is a) b.shape = 3,4 print (a.shape) print (id(a)) #id表示指向內(nèi)存區(qū)域，具有相同id，表示a、b指向相同內(nèi)存區(qū)域中的值 print (id(b)) 輸出結(jié)果： True (3, 4) 4382560048 4382560048

view方法創(chuàng)建一個(gè)新數(shù)組，指向的內(nèi)存區(qū)域不同，但元素值共用

import numpy as np a = np.arange(12) c = a.view() print(id(a)) #id值不同 print(id(c)) print(c is a) c.shape = 2,6 print (a.shape) #改變c的shape，a的shape不變 c[0,4] = 1234 #改變c中元素的值 print(a) #a中元素的值也會(huì)發(fā)生改變輸出結(jié)果： 4382897216 4382897136 False (12,) [ 0 1 2 3 1234 5 6 7 8 9 10 11]

copy方法(深復(fù)制)創(chuàng)建一個(gè)對(duì)數(shù)組和元素值的完整的copy

d = a.copy()

按照矩陣的行列找出最大值，最大值的索引

import numpy as np data = np.sin(np.arange(20)).reshape(5,4) print (data) ind = data.argmax(axis=0) #找出每列最大值的索引 print (ind) data_max = data[ind, range(data.shape[1])] #通過行列索引取值 print (data_max) 輸出結(jié)果： [[ 0. 0.84147098 0.90929743 0.14112001][-0.7568025 -0.95892427 -0.2794155 0.6569866 ][ 0.98935825 0.41211849 -0.54402111 -0.99999021][-0.53657292 0.42016704 0.99060736 0.65028784][-0.28790332 -0.96139749 -0.75098725 0.14987721]] [2 0 3 1] [ 0.98935825 0.84147098 0.99060736 0.6569866 ]

tile方法，對(duì)原矩陣的行列進(jìn)行擴(kuò)展

import numpy as np a = np.arange(0, 40, 10) b = np.tile(a, (2, 3)) #行變成2倍，列變成3倍 print(b) 輸出結(jié)果： [[ 0 10 20 30 0 10 20 30 0 10 20 30][ 0 10 20 30 0 10 20 30 0 10 20 30]]

兩種排序方法
sort方法對(duì)矩陣中的值進(jìn)行排序，argsort方法得到元素從小到大的索引值，根據(jù)索引值的到排序結(jié)果

a = np.array([[4, 3, 5], [1, 2, 1]]) b = np.sort(a, axis=1) #對(duì)a按行由小到大排序，值賦給b print(b) a.sort(axis=1) #直接對(duì)a按行由小到大排序 print(a) a = np.array([4, 3, 1, 2]) j = np.argsort(a) #argsort方法得到元素從小到大的索引值 print (j) print (a[j]) #根據(jù)索引值輸出a 輸出結(jié)果： [[3 4 5][1 1 2]] ------- [[3 4 5][1 1 2]] ------- [2 3 1 0] ------- [1 2 3 4]

數(shù)據(jù)分析處理庫Pandas，基于Numpy

read_csv方法讀取csv文件

import pandas as pd food_info = pd.read_csv("food_info.csv") print(type(food_info)) #pandas代表的DataFrame可以當(dāng)成矩陣結(jié)構(gòu) print(food_info.dtypes) #dtypes在當(dāng)前數(shù)據(jù)中包含的數(shù)據(jù)類型輸出結(jié)果： <class 'pandas.core.frame.DataFrame'> NDB_No int64 Shrt_Desc object Water_(g) float64 Energ_Kcal int64 ...... Cholestrl_(mg) float64 dtype: object

獲取讀取到的文件的信息

print(food_info.head(3)) #head()方法如果沒有參數(shù)，默認(rèn)獲取前5行 print(food_info.tail()) #tail()方法獲取最后5行 print(food_info.columns) #columns獲取所有的列名 print(food_info.shape) #獲取當(dāng)前數(shù)據(jù)維度(8618, 36)

取出指定某行的數(shù)據(jù)

print(food_info.loc[0]) #取出第零行的數(shù)據(jù) food_info.loc[8620] # 當(dāng)index值超過最大值，throw an error: "KeyError: 'the label [8620] is not in the [index]'" food_info.loc[3:6] #取出第三到第六行數(shù)據(jù)，3、4、5、6 two_five_ten = [2,5,10] food_info.loc[two_five_ten] #取出第2、5、10行數(shù)據(jù)

取出指定某列的數(shù)據(jù)

ndb_col = food_info["NDB_No"] #取出第一列NDB_No中的數(shù)據(jù) print (ndb_col)

columns = [“Zinc_(mg)”, “Copper_(mg)”] #要取出多列，就寫入所要取出列的列名
zinc_copper = food_info[columns]
print(zinc_copper)

取出以(g)為結(jié)尾的列名

col_names = food_info.columns.tolist() #tolist()方法將列名放在一個(gè)list里 gram_columns = [] for c in col_names:if c.endswith("(g)"): gram_columns.append(c) gram_df = food_info[gram_columns] print(gram_df.head(3)) 輸出結(jié)果：Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \ 0 15.87 0.85 81.11 2.11 0.06 1 15.87 0.85 81.11 2.11 0.06 2 0.24 0.28 99.48 0.00 0.00 3 42.41 21.40 28.74 5.11 2.34 4 41.11 23.24 29.68 3.18 2.79

Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g)
0 0.0 0.06 51.368 21.021 3.043
1 0.0 0.06 50.489 23.426 3.012
2 0.0 0.00 61.924 28.732 3.694
3 0.0 0.50 18.669 7.778 0.800
4 0.0 0.51 18.764 8.598 0.784

對(duì)某列中的數(shù)據(jù)進(jìn)行四則運(yùn)算

import pandas food_info = pandas.read_csv("food_info.csv") iron_grams = food_info["Iron_(mg)"] / 1000 #對(duì)列中的數(shù)據(jù)除以1000 food_info["Iron_(g)"] = iron_grams #新增一列Iron_(g) 保存結(jié)果

water_energy = food_info[“Water_(g)”] * food_info[“Energ_Kcal”] #將兩列數(shù)字相乘

求某列中的最大值、最小值、均值

max_calories = food_info["Energ_Kcal"].max() print(max_calories) min_calories = food_info["Energ_Kcal"].min() print(min_calories) mean_calories = food_info["Energ_Kcal"].mean() print(mean_calories) 輸出結(jié)果： 902 0 226.438616848

使用sort_values()方法對(duì)某列數(shù)據(jù)進(jìn)行排序

food_info.sort_values("Sodium_(mg)", inplace=True)#默認(rèn)從小到大排序，inplace=True表示返回一個(gè)新的數(shù)據(jù)結(jié)構(gòu)，而不在原來基礎(chǔ)上做改變 print(food_info["Sodium_(mg)"])

food_info.sort_values(“Sodium_(mg)”, inplace=True, ascending=False)
#ascending=False表示從大到小排序，
print(food_info[“Sodium_(mg)”])

針對(duì)titanic_train.csv 的練習(xí)（含pivot_table()透視表方法）

import pandas as pd import numpy as np titanic_survival = pd.read_csv("titanic_train.csv") titanic_survival.head()

age = titanic_survival[“Age”]
print(age.loc[0:20]) #打印某一列的0到20行
age_is_null = pd.isnull(age) #isnull()方法用于檢測是否為缺失值，缺失為True 不缺失為False
print(age_is_null)
age_null_true = age[age_is_null] #得到該列所有缺失的行
print(age_null_true)
age_null_count = len(age_null_true)
print(age_null_count) #缺失的行數(shù)

#存在缺失值的情況下無法計(jì)算均值
mean_age = sum(titanic_survival[“Age”]) / len(titanic_survival[“Age”]) #sum()方法對(duì)列中元素求和
print(mean_age) #nan

#在計(jì)算均值前要把缺失值剔除
good_ages = titanic_survival[“Age”][age_is_null == False] #不缺失的取出來
correct_mean_age = sum(good_ages) / len(good_ages)
print(correct_mean_age) #29.6991176471

#當(dāng)然也可以不這么麻煩，缺失值很普遍，pandas提供了mean()方法用于自動(dòng)剔除缺失值并求均值
correct_mean_age = titanic_survival[“Age”].mean()
print(correct_mean_age) #29.6991176471

#求每個(gè)倉位等級(jí)，船票的平均價(jià)格
passenger_classes = [1, 2, 3]
fares_by_class = {}
for this_class in passenger_classes:
pclass_rows = titanic_survival[titanic_survival[“Pclass”] == this_class]
pclass_fares = pclass_rows[“Fare”] #定為到同一等級(jí)艙，船票價(jià)格的那一列
fare_for_class = pclass_fares.mean()
fares_by_class[this_class] = fare_for_class
print(fares_by_class)
運(yùn)算結(jié)果：
{1: 84.154687499999994, 2: 20.662183152173913, 3: 13.675550101832993}

#pandas為我們提供了更方便的統(tǒng)計(jì)工具，pivot_table()透視表方法
#index 告訴pivot_table方法是根據(jù)哪一列分組
#values 指定對(duì)哪一列進(jìn)行計(jì)算
#aggfunc 指定使用什么計(jì)算方法
passenger_survival = titanic_survival.pivot_table(index=“Pclass”, values=“Survived”, aggfunc=np.mean)
print(passenger_survival)
運(yùn)算結(jié)果：
Pclass Survived
1 0.629630
2 0.472826
3 0.242363

#計(jì)算不同等級(jí)艙乘客的平均年齡
passenger_age = titanic_survival.pivot_table(index=“Pclass”, values=“Age”) #默認(rèn)采用aggfunc=np.mean計(jì)算方法
print(passenger_age)
運(yùn)算結(jié)果：
Pclass Age
1 38.233441
2 29.877630
3 25.140620

#index 根據(jù)一列分組
##values 指定對(duì)多列進(jìn)行計(jì)算
port_stats = titanic_survival.pivot_table(index=“Embarked”, values=[“Fare”,“Survived”], aggfunc=np.sum)
print(port_stats)
運(yùn)算結(jié)果：
Embarked Fare Survived
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217

#丟棄有缺失值的數(shù)據(jù)行
new_titanic_survival = titanic_survival.dropna(axis=0,subset=[“Age”, “Cabin”]) #subset指定了Age和Cabin中任何一個(gè)有缺失的，這行數(shù)據(jù)就丟棄
print(new_titanic_survival)

#按照行列定位元素，取出值
row_index_83_age = titanic_survival.loc[103,“Age”]
row_index_1000_pclass = titanic_survival.loc[766,“Pclass”]
print(row_index_83_age)
print(row_index_1000_pclass)

#sort_values()排序，reset_index()重新設(shè)置行號(hào)
new_titanic_survival = titanic_survival.sort_values(“Age”,ascending=False) #ascending=False從大到小
print(new_titanic_survival[0:10]) #但序號(hào)是原來的序號(hào)
itanic_reindexed = new_titanic_survival.reset_index(drop=True) #reset_index(drop=True)更新行號(hào)
print(itanic_reindexed.iloc[0:10]) #iloc通過行號(hào)獲取行數(shù)據(jù)

#通過定義一個(gè)函數(shù)，把操作封裝起來，然后apply函數(shù)
def hundredth_row(column): #這個(gè)函數(shù)返回第100行的每一列數(shù)據(jù)
# Extract the hundredth item
hundredth_item = column.iloc[99]
return hundredth_item
hundredth_row = titanic_survival.apply(hundredth_row) #apply()應(yīng)用函數(shù)
print(hundredth_row)
返回結(jié)果：
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object

##統(tǒng)計(jì)所有的缺失值
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = titanic_survival.apply(not_null_count)
print(column_null_count)
輸出結(jié)果：
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

#對(duì)船艙等級(jí)進(jìn)行轉(zhuǎn)換
def which_class(row):
pclass = row[‘Pclass’]
if pd.isnull(pclass):
return “Unknown”
elif pclass == 1:
return “First Class”
elif pclass == 2:
return “Second Class”
elif pclass == 3:
return “Third Class”
classes = titanic_survival.apply(which_class, axis=1) #通過axis = 1參數(shù)，使用DataFrame.apply（）方法來迭代行而不是列。
print(classes)

#使用兩個(gè)自定義函數(shù)，統(tǒng)計(jì)不同年齡標(biāo)簽對(duì)應(yīng)的存活率
def generate_age_label(row):
age = row[“Age”]
if pd.isnull(age):
return “unknown”
elif age < 18:
return “minor”
else:
return “adult”

age_labels = titanic_survival.apply(generate_age_label, axis=1)

titanic_survival[‘a(chǎn)ge_labels’] = age_labels
age_group_survival = titanic_survival.pivot_table(index=“age_labels”, values=“Survived” ,aggfunc=np.mean)
print(age_group_survival)
運(yùn)算結(jié)果：

age_labels Survived
adult 0.381032
minor 0.539823
unknown 0.293785

Series結(jié)構(gòu)

Series (collection of values) DataFrame中的一行或者一列就是Series結(jié)構(gòu)
DataFrame (collection of Series objects)是讀取文件read_csv()方法獲得的矩陣
Panel (collection of DataFrame objects)

import pandas as pd fandango = pd.read_csv('fandango_score_comparison.csv') #讀取電影信息，DataFrame結(jié)構(gòu) series_film = fandango['FILM'] #定位到“FILM”這一列 print(type(series_film)) #<class 'pandas.core.series.Series'>結(jié)構(gòu) print(series_film[0:5]) #通過索引切片 series_rt = fandango['RottenTomatoes'] print (series_rt[0:5])

from pandas import Series # Import the Series object from pandas
film_names = series_film.values #把Series結(jié)構(gòu)中的每一個(gè)值拿出來
print(type(film_names)) #<class ‘numpy.ndarray’>說明series結(jié)構(gòu)中每一個(gè)值的結(jié)構(gòu)是ndarray
rt_scores = series_rt.values
series_custom = Series(rt_scores , index=film_names) #設(shè)置以film_names為索引的film結(jié)構(gòu),創(chuàng)建一個(gè)Series
series_custom[[‘Minions (2015)’, ‘Leviathan (2014)’]] #確實(shí)可以使用名字索引
fiveten = series_custom[5:10] #也可以使用數(shù)字索引
print(fiveten)

Series中的排序

original_index = series_custom.index.tolist() #將index值放入一個(gè)list結(jié)構(gòu)中 sorted_index = sorted(original_index) sorted_by_index = series_custom.reindex(sorted_index) #reset index操作 print(sorted_by_index)

sc2 = series_custom.sort_index() #根據(jù)index值進(jìn)行排序
sc3 = series_custom.sort_values() #根據(jù)value值進(jìn)行排序
print(sc3)

在Series中的每一個(gè)值的類型是ndarray，即NumPy中核心數(shù)據(jù)類型

import numpy as np print(np.add(series_custom, series_custom)) #將兩列值相加 np.sin(series_custom) #對(duì)每個(gè)值使用sin函數(shù) np.max(series_custom) #獲取某一列的最大值

取出series_custom列中數(shù)值在50到70之間的數(shù)值
對(duì)某一列中的所有值進(jìn)行比較運(yùn)算，返回boolean值

criteria_one = series_custom > 50 criteria_two = series_custom < 75 both_criteria = series_custom[criteria_one & criteria_two] #返回boolean值的Series對(duì)象 print(both_criteria)

對(duì)index相同的兩列運(yùn)算

#data alignment same index rt_critics = Series(fandango['RottenTomatoes'].values, index=fandango['FILM']) rt_users = Series(fandango['RottenTomatoes_User'].values, index=fandango['FILM']) rt_mean = (rt_critics + rt_users)/2 print(rt_mean)

對(duì)DataFrame結(jié)構(gòu)進(jìn)行操作
設(shè)置‘FILM’為索引

fandango = pd.read_csv('fandango_score_comparison.csv') print(type(fandango)) #<class 'pandas.core.frame.DataFrame'> fandango_films = fandango.set_index('FILM', drop=False) #以‘FILM’為索引返回一個(gè)新的DataFrame ，drop=False不丟棄原來的FILM列

對(duì)DataFrame切片

#可以使用[]或者loc[]來切片 fandango_films["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"] #用string值做的索引也可以切片 fandango_films.loc["Avengers: Age of Ultron (2015)":"Hot Tub Time Machine 2 (2015)"] fandango_films[0:3] #數(shù)值索引依然存在，可以用來切片 #選擇特定的列 #movies = ['Kumiko, The Treasure Hunter (2015)', 'Do You Believe? (2015)', 'Ant-Man (2015)']

可視化庫matplotlib

Matplotlib是Python中最常用的可視化工具之一，可以非常方便地創(chuàng)建海量類型地2D圖表和一些基本的3D圖表。

2D圖表之折線圖

Matplotlib中最基礎(chǔ)的模塊是pyplot，先從最簡單的點(diǎn)圖和線圖開始。
更多屬性可以參考官網(wǎng)：http://matplotlib.org/api/pyplot_api.html

import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt

unrate = pd.read_csv(‘unrate.csv’)
unrate[‘DATE’] = pd.to_datetime(unrate[‘DATE’]) #pd.to_datetime方法標(biāo)準(zhǔn)化日期格式

first_twelve = unrate[0:12] #取0到12行數(shù)據(jù)
plt.plot(first_twelve[‘DATE’], first_twelve[‘VALUE’]) #plot(x軸,y軸)方法畫圖
plt.xticks(rotation=45) #設(shè)置x軸上橫坐標(biāo)旋轉(zhuǎn)角度
plt.xlabel(‘Month’) #x軸含義
plt.ylabel(‘Unemployment Rate’) #y軸含義
plt.title(‘Monthly Unemployment Trends, 1948’) #圖標(biāo)題
plt.show() #show方法顯示圖

子圖操作

添加子圖：add_subplot(first,second,index)
first 表示行數(shù),second 列數(shù).

import matplotlib.pyplot as plt fig = plt.figure() #Creates a new figure. ax1 = fig.add_subplot(3,2,1) #一個(gè)3*2子圖中的第一個(gè)模塊 ax2 = fig.add_subplot(3,2,2) #一個(gè)3*2子圖中的第二個(gè)模塊 ax2 = fig.add_subplot(3,2,6) #一個(gè)3*2子圖中的第六個(gè)模塊 plt.show() import numpy as np #fig = plt.figure() fig = plt.figure(figsize=(3, 6)) #指定畫圖區(qū)大小（長，寬） ax1 = fig.add_subplot(2,1,1) ax2 = fig.add_subplot(2,1,2)

ax1.plot(np.random.randint(1,5,5), np.arange(5)) #第一個(gè)子圖畫圖
ax2.plot(np.arange(10)*3, np.arange(10)) #第二個(gè)子圖畫圖
plt.show()

在同一個(gè)圖中畫兩條折線（plot兩次）

fig = plt.figure(figsize=(6,3)) plt.plot(unrate[0:12]['MONTH'], unrate[0:12]['VALUE'], c='red') plt.plot(unrate[12:24]['MONTH'], unrate[12:24]['VALUE'], c='blue') plt.show()

為所畫曲線作標(biāo)記

fig = plt.figure(figsize=(10,6)) colors = ['red', 'blue', 'green', 'orange', 'black'] for i in range(5):start_index = I*12end_index = (i+1)*12subset = unrate[start_index:end_index]label = str(1948 + i) #label值plt.plot(subset['MONTH'], subset['VALUE'], c=colors[i], label=label) #x軸指標(biāo)，y軸指標(biāo)，顏色，label值 plt.legend(loc='upper left') #loc指定legend方框的位置,loc = 'best'/'upper right'/'lower left'等，print(help(plt.legend))查看用法 plt.xlabel('Month, Integer') plt.ylabel('Unemployment Rate, Percent') plt.title('Monthly Unemployment Trends, 1948-1952')plt.show()

2D圖標(biāo)之條形圖與散點(diǎn)圖

bar條形圖

import pandas as pd reviews = pd.read_csv('fandango_scores.csv') #讀取電影評(píng)分表 cols = ['FILM', 'RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars'] norm_reviews = reviews[cols] num_cols = ['RT_user_norm', 'Metacritic_user_nom', 'IMDB_norm', 'Fandango_Ratingvalue', 'Fandango_Stars'] bar_heights = norm_reviews.ix[0, num_cols].values #柱高度 bar_positions = arange(5) + 0.75 #設(shè)定每一個(gè)柱到左邊的距離 tick_positions = range(1,6) #設(shè)置x軸刻度標(biāo)簽為[1,2,3,4,5] fig, ax = plt.subplots()

ax.bar(bar_positions, bar_heights, 0.5) #bar型圖。柱到左邊距離，柱高度，柱寬度
ax.set_xticks(tick_positions) #x軸刻度標(biāo)簽
ax.set_xticklabels(num_cols, rotation=45)

ax.set_xlabel(‘Rating Source’)
ax.set_ylabel(‘Average Rating’)
ax.set_title(‘Average User Rating For Avengers: Age of Ultron (2015)’)
plt.show()

散點(diǎn)圖

fig, ax = plt.subplots() #fig控制圖的整體情況，如大小，用ax實(shí)際來畫圖 ax.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) #scatter方法，畫散點(diǎn)圖的x軸，y軸 ax.set_xlabel('Fandango') ax.set_ylabel('Rotten Tomatoes') plt.show()

散點(diǎn)圖子圖

fig = plt.figure(figsize=(8,3)) ax1 = fig.add_subplot(1,2,1) ax2 = fig.add_subplot(1,2,2) ax1.scatter(norm_reviews['Fandango_Ratingvalue'], norm_reviews['RT_user_norm']) ax1.set_xlabel('Fandango') ax1.set_ylabel('Rotten Tomatoes') ax2.scatter(norm_reviews['RT_user_norm'], norm_reviews['Fandango_Ratingvalue']) ax2.set_xlabel('Rotten Tomatoes') ax2.set_ylabel('Fandango') plt.show() 屏幕快照 2017-11-05 上午11.42.10.png </div></div>

總結(jié)

以上是生活随笔為你收集整理的python数据分析与机器学习(Numpy,Pandas,Matplotlib)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：数据挖掘算法（logistic回归，随机
下一篇： python编程之如何判断某个元素在不在