日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程语言 > python >内容正文

python

python 数据挖掘 简书_[Python数据挖掘入门与实践]-第一章开启数据挖掘之旅

發布時間:2024/4/19 python 44 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python 数据挖掘 简书_[Python数据挖掘入门与实践]-第一章开启数据挖掘之旅 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.數據挖掘簡介(略)

2.使用Python和IPython Notebook

2.1.安裝Python

2.2.安裝IPython

2.3.安裝scikit-learn

scikit-learn是用Python開發的機器學習庫,它包含大量機器學習算法、數據集、工具和框架。它以Python科學計算的相關工具集為基礎,其中numpy和scipy等都針對數據處理任務進行過優化,因此scikit-learn速度快、擴展性強,為此做數據挖掘很實用。

scikit-learn可以用Python3提供的pip工具進行安裝,之前沒有安裝Numpy和Scipy的話也會順便安裝。安裝命令如下:

pip install scikit-learn

3.親和性分析示例

3.1什么是親和性分析

親和性分析根據樣本個體(物體)之間的相似度,確定它們關系的親疏。親和性分析的應用場景如下。

(1)向網站用戶提供多樣化的服務或投放定向廣告

(2)為了向用戶推薦電影或商品,而賣給他們一些與之相關的小玩意。

(3)根據基因尋找有親緣關系的人。

......

親和性有哪些測量方法?

(1)統計兩件商品一起出售的頻率,或者統計顧客購買商品1后再買商品2的比率。

(2)計算兩個體之間的相似度

......

3.2商品推薦

商品銷售從線下搬到線上后,很多之前靠人工完成的工作只有實現自動化,才有望將生意做大,向上銷售出自英文up-selling,指的是向已經購買商品的顧客推銷另一種商品。原來線下由人工完成的商品推薦工作,現在依靠數據挖掘技術就能完成,而且利潤大,助推電子商務革命的發展!

我們一起看一個簡單的推薦服務:人們之前經常購買的兩件商品,以后也很可能同時購買。

作為數據挖掘入門性質的例子,我們希望得到下面的規則:

如果一個人買了商品X,那么他很有可能購買商品Y

3.3在Numpy中加載數據集

import numpy as np

dataset_filename = 'affinity_dataset.txt'

X = np.loadtxt(dataset_filename)

3.4實現簡單的排序規則

規則的優劣有多種衡量方法,常用的是支持度(support)和置信度(confidence)。

支持度指數據集中應驗的次數,有時候需要對支持度進行規范化。

支持度衡量給定規則應驗的比例,置信度衡量規則準確率如何,即符合給定條件的所有規則里,跟當前規則結論一致的比例有多大,計算方法為首先統計當前規則的出現次數,再用它除以條件相同的規則數量

如果顧客買了蘋果,他們也會購買香蕉的支持度和置信度

num_apple_purchases = 0

for sample in X:

if sample[3] ==1: #This person bought apples

num_apple_purchases += 1

# print('{0} people bought apples'.format(num_apple_purchases)) #ou can try the print way to find difference

print('{0} people bought apples'.format(num_apple_purchases))

image.png

同理,檢測sample[4]的值是否為1,就可以確定顧客是否也買了香蕉,進而可以計算支持度和置信度。

我們需要統計數據集中所有規則的相關數據,首先分別為規則應驗和規則無效兩種情況構建字典。字典的鍵是由條件和結論組成的元組,元組元素為特征在特征列表中的索引值,不要用實際特證名,比如“顧客如果購買了蘋果,也買了香蕉”就用(3,4)表示。如果某個個體的條件和結論均與給定規則相符,則表示給定規則對該個體適用,反之無效。

為了計算所有規則的置信度和支持度,首先創建幾個字典,用來存儲計算結果。這里使用defaultdict,好處是如果查找的鍵不存在,則返回默認值。需要統計的量有規則應驗、規則無效、條件相同的規則數量。

from collections import defaultdict

vaild_rules = defaultdict(int)

invaild_rules = defaultdict(int)

num_occurances = defaultdict(int)

#依次對樣本的每個個體及個體的每個特征值進行處理。第一個特征為規則的前提條件-----顧客購買了某種商品

for sample in X:

for premise in range(5):

#檢測個體是否滿足條件,如果不滿足則檢測下一個條件

if sample[premise] ==0:continue

#如果條件滿足(即值為1),該條件的出現次數加1,在遍歷過程中跳過條件和結論相同的情況,比如“如果顧客購買了蘋果,他們也買蘋果”,這樣的規則無用

num_occurances[premise] += 1

n_sample,n_features = X.shape

for conclusion in range(n_features):

if premise ==conclusion:continue

#如果規則適用于個體,規則應驗這種情況(vaild_rules字典中,鍵為由條件和結論組成的元組)增加一次,反之,違反規則情況(invaild_rules字典中)就增加一次

https://github.com/datawhalechina/joyful-pandas datawhale pandas教程

https://space.bilibili.com/631186842?from=search&seid=16882960572917617056 Rachel's english

https://www.liulishuo.com/liulishuo.html 流利說英語 有app直接下

https://github.com/fengdu78/lihang-code 李航Python實現

[ch1_affinity_create]

X = np.zeros((100, 5), dtype='bool')

#dtype can change,such as int,float

X.shape[0]

#0 is row,1 is col

#數組的索引方式是和列表一樣的

np.savetxt("affinity_dataset.txt", X, fmt='%d')

#parameters

fmt : str or sequence of strs, optional

A single format (%10.5f), a sequence of formats, or a

multi-format string, e.g. 'Iteration %d -- %10.5f', in which

case `delimiter` is ignored. For complex `X`, the legal options

for `fmt` are:

#create a random float from 0 to 1

a = np.random.random()

print(X[:5].astype(np.int))

[ch1_affinity]

n_samples, n_features = X.shape

print("This dataset has {0} samples and {1} features".format(n_samples, n_features))

#這種print格式用print('{0},{1}'.format(a,b))

#count the people who bought apples

num_apple_purchases = 0

for sample in X:

if sample[3] = 1:

num_apple_purchases += 1

print('{0} people bought apples'.format(num_apples_purchases))

####################################################################################################

##bought 3 but not bought 4

rule_valid = 0

rule_invalid = 0

for sample in X:

if sample[3] == 1: # This person bought Apples

if sample[4] == 1:

# This person bought both Apples and Bananas

rule_valid += 1

else:

# This person bought Apples, but not Bananas

rule_invalid += 1

print("{0} cases of the rule being valid were discovered".format(rule_valid))

print("{0} cases of the rule being invalid were discovered".format(rule_invalid))

####################################################################################################

## not bought 3

rule_valid = 0

rule_invalid = 0

for sample in X:

if sample[3] == 1:

if sample[4] == 1:

rule_valid += 1

else:

rule_invalid += 1

print('{0} rule_valid'.format(rule_valid))

print('{0} rule_invalid'.format(rule_invalid))

####################################################################################################

規則是 如果買了蘋果,可能也買了香蕉。

規則無效是 如果買了蘋果,但沒買香蕉

print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))

# Confidence can be thought of as a percentage using the following:

print("As a percentage, that is {0:.1f}%.".format(100 * confidence))

####################################################################################################

from collections import defaultdict

# Now compute for all possible rules

valid_rules = defaultdict(int)

invalid_rules = defaultdict(int)

num_occurences = defaultdict(int)

for sample in X:

for premise in range(n_features):

if sample[premise] == 0: continue

# Record that the premise was bought in another transaction

num_occurences[premise] += 1

for conclusion in range(n_features):

if premise == conclusion: # It makes little sense to measure if X -> X.

continue

if sample[conclusion] == 1:

# This person also bought the conclusion item

valid_rules[(premise, conclusion)] += 1

else:

# This person bought the premise, but not the conclusion

invalid_rules[(premise, conclusion)] += 1

support = valid_rules

confidence = defaultdict(float)

for premise, conclusion in valid_rules.keys():

confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]

####################################################################################################

from collections import defaultdict

rule_valid = defaultdict(int)

rule_invalid = defaultdict(int)

num_premise = defaultdict(int)

n_features = X.shape[1]

for sample in X:

for premise in range(n_features):

if sample[premise] == 0:continue

if sample[premise] == 1:

num_premise[premise] += 1

for conclusion in range(n_features):

if premise == conclusion:continue

if sample[conclusion] == 1:

rule_valid[(premise,conclusion)] += 1

else:

rule_invalid[(premise,conclusion)] += 1

support = rule_valid

confidence = defaultdict(float)

for premise,conclusion in rule_valid.keys():

confidence[(premise,conclusion)] = rule_valid[(premise,conclusion)] / num_premise[premise]

####################################################################################################

for premise, conclusion in confidence:

premise_name = features[premise]

conclusion_name = features[conclusion]

print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))

print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))

print(" - Support: {0}".format(support[(premise, conclusion)]))

print("")

####################################################################################################

for premise,conclusion in confidence:

features = ["bread", "milk", "cheese", "apples", "bananas"]

premise_name = features[premise]

conclusion_name = features[conclusion]

print('If someone buy {0} then they may buy {1}'.format(premise_name,conclusion_name))

print('confidence is {0:.3f}'.format(confidence[(premise,conclusion)]))

print('support is {0}'.format(support[(premise,conclusion)]))

#用于打印特定的數據結構,整齊好看

from pprint import pprint

pprint(list(support.items()))

#example

import pprint

data = ("test", [1, 2, 3,'test', 4, 5], "This is a string!",

{'age':23, 'gender':'F'})

print(data)

pprint.pprint(data)

image.png

注意

# [Python: 字典列表: itemgetter 函數: 根據某個或某幾個字典字段來排序列表](https://www.cnblogs.com/baxianhua/p/8182627.html)

from operator import itemgetter

sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)

# [python中sorted和sort 、reversed和reverse的使用](https://www.cnblogs.com/shengguorui/p/10863988.html)

[OneR]

[ch1_oner_application]

#iris.describe

print(dataset.DESCR)

在進行OneR算法分類前需要將數據進行離散化

# Compute the mean for each attribute

attribute_means = X.mean(axis=0)

assert attribute_means.shape == (n_features,)#assert:斷言

X_d = np.array(X >= attribute_means, dtype='int')

#X.means(axis):axis = 0 is symbol take col

#assert 1==1 # 條件為 true 正常執行

#assert 1==2 # 條件為 false 觸發異常

#sklearn中已經廢棄cross_validation,將其中的內容整合到#model_selection中

#將sklearn.cross_validation 替換為 sklearn.model_selection

##origin

# Now, we split into a training and test set

from sklearn.cross_validation import train_test_split

# Set the random state to the same number to get the same results as in the book

random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)

print("There are {} training samples".format(y_train.shape))

print("There are {} testing samples".format(y_test.shape))

##new

# Now, we split into a training and test set

from sklearn.model_selection import train_test_split

# Set the random state to the same number to get the same results as in the book

random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)

print("There are {} training samples".format(y_train.shape))

print("There are {} testing samples".format(y_test.shape))

#train_X,test_X,train_y,test_y = train_test_split(train_data,train_target,test_size=0.3,random_state=5)

#train_test_split()函數是用來隨機劃分樣本數據為訓練集和測試集的,當然也可以人為的切片劃分

#優點:隨機客觀的劃分數據,減少人為因素

#test_size:測試數據占樣本數據的比例,若整數則樣本數量

#zip() 函數用于將可迭代的對象作為參數,將對象中對應的元素打包成一個個元組,然后返回由這些元組組成的列表。

#如果各個迭代器的元素個數不一致,則返回列表長度與最短的對象相同,利用 * 號操作符,可以將元組解壓為列表

>>>a = [1,2,3]

>>>b = [4,5,6]

>>>c = [4,5,6,7,8]

>>>zipped = zip(a,b) # 打包為元組的列表

[(1, 4), (2, 5), (3, 6)]

>>>zip(a,c) # 元素個數與最短的列表一致

[(1, 4), (2, 5), (3, 6)]

>>>zip(*zipped) # 與 zip 相反,*zipped 可理解為解壓,返回二維矩陣式

[(1, 2, 3), (4, 5, 6)]

class_counts = defaultdict(int)

#Iterate through each sample and count the frequency of each class/value pair

for sample, y in zip(X, y_true):

if sample[feature] == value:

class_counts[y] += 1

a = zip(X, y)

for b in a:

print(b)

image.png

a = zip(X, y)

for b,c in a:

print(b)

image.png

a = zip(X, y)

for b,c in a:

print(c)

image.png

error = sum([class_count for class_value, class_count in class_counts.items()

if class_value != most_frequent_class])

總結

以上是生活随笔為你收集整理的python 数据挖掘 简书_[Python数据挖掘入门与实践]-第一章开启数据挖掘之旅的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。