當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

根据名字预测性别——朴树贝叶斯分类器

發布時間：2024/9/27 编程问答 70 豆豆

生活随笔收集整理的這篇文章主要介紹了根据名字预测性别——朴树贝叶斯分类器小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?人名和性別數據train.txt；text.txt 下載地址：NLP學習——人名和性別相關對應數據-自然語言處理文檔類資源-CSDN下載

一，首先讀取數據，并對數據進行簡單處理

import pandas as pd from collections import defaultdicttrain = pd.read_csv("D:/train.txt",sep='\s+',names = ['name','sex'],encoding='utf-8') #要先知道文件的編碼方式，否者無法解碼 test = pd.read_csv("D:/test.txt",sep='\s+',names = ['name','sex'],encoding='UTF-8') #以多個空格為分隔符標志##數據處理 ##刪除含有空數據的行，把男替換為1，女替換為0 train.dropna(axis=1) test.dropna(axis=1) test.loc[test['sex']=='男','sex']= 1 test.loc[test['sex']=='女','sex']= 0 train.loc[train['sex']=='男','sex'] = 1 train.loc[train['sex']=='女','sex'] = 0

二，對數據再次進行處理，并計算訓練集中每個字的先驗改概率

#根據性別進行分類 names_female = train[train['sex']== 0] #女性一組 names_male = train[train['sex']== 1] #男性一組 ##釋放內存 del train #計算每個字在男和女中出現的概率 totals = {'f':len(names_female),'m':len(names_male)} #女性總數，男性總數nums_list_f = defaultdict(int) for name in names_female['name']:for ch in name:nums_list_f[ch] += 1.0 / totals['f']nums_list_m = defaultdict(int) for name in names_male['name']:for ch in name:nums_list_m[ch] += 1.0 / totals['m']

?三，訓練集中只含有一部分字，總會有一些人的名字的字在我們的訓練集中沒有出現，當遇見這些名字時，以上的代碼會判為這個人不存在，所以解決這個問題我們可以對數據進行拉普拉斯平滑。

#拉普拉斯平滑 def laplace(char,nums_list,totals,alpha=1): #char:字符；nums_list:字符頻率表；totals：人數count = nums_list[char]*totals #該字符出現的次數charnums = len(nums_list) #字符表中個數smooth = (count+alpha)/(totals+charnums)return smooth

?四，對一個名字進行計算出是男的概率，女的概率各是多少，并返回一個字典

#拉普拉斯平滑后，一個名字是男和女的概率，并返回一個字典 def Prob(name,totals,nums_list_m,nums_list_f):prob_m = 1;prob_f = 1;for ch in name:prob_m*=laplace(ch,nums_list_m,totals['m'])prob_f*=laplace(ch,nums_list_f,totals['f'])prob = {'male':prob_m,'female':prob_f}

五，對prob字典里的數據進行對比，給出結果

##通過概率比對給出結果 def result(prob):if prob['male']>prob['female']:return 1else:return 0

?六，通過測試集對模型的測試得出準確率

i=0 n=0 for nam in test['name']:predict = result(Prob(nam,totals,nums_list_m,nums_list_f))if(predict==test.loc[n,'sex']):i+=1n+=1 accuracy = (i+1)/(n+1)print('準確率為：',accuracy)

完整代碼：

#coding: utf-8 import pandas as pd from collections import defaultdicttrain = pd.read_csv("D:/train.txt",sep='\s+',names = ['name','sex'],encoding='utf-8') #要先知道文件的編碼方式，否者無法解碼 test = pd.read_csv("D:/test.txt",sep='\s+',names = ['name','sex'],encoding='UTF-8') #以多個空格為分隔符標志##數據處理 ##刪除含有空數據的行，把男替換為1，女替換為0 train.dropna(axis=1) test.dropna(axis=1) test.loc[test['sex']=='男','sex']= 1 test.loc[test['sex']=='女','sex']= 0 train.loc[train['sex']=='男','sex'] = 1 train.loc[train['sex']=='女','sex'] = 0#根據性別進行分類 names_female = train[train['sex']== 0] #女性一組 names_male = train[train['sex']== 1] #男性一組 ##釋放內存 del train #計算每個字在男和女中出現的概率 totals = {'f':len(names_female),'m':len(names_male)} #女性總數，男性總數nums_list_f = defaultdict(int) for name in names_female['name']:for ch in name:nums_list_f[ch] += 1.0 / totals['f']nums_list_m = defaultdict(int) for name in names_male['name']:for ch in name:nums_list_m[ch] += 1.0 / totals['m']#拉普拉斯平滑 def laplace(char,nums_list,totals,alpha=1): #char:字符；nums_list:字符頻率表；totals：人數count = nums_list[char]*totals #該字符出現的次數charnums = len(nums_list) #字符表中個數smooth = (count+alpha)/(totals+charnums)return smooth #拉普拉斯平滑后，一個名字是男和女的概率，并返回一個字典 def Prob(name,totals,nums_list_m,nums_list_f):prob_m = 1;prob_f = 1;for ch in name:prob_m*=laplace(ch,nums_list_m,totals['m'])prob_f*=laplace(ch,nums_list_f,totals['f'])prob = {'male':prob_m,'female':prob_f}return prob ##通過概率比對給出結果 def result(prob):if prob['male']>prob['female']:return 1else:return 0##計算準確率 i=0 n=0 for nam in test['name']:predict = result(Prob(nam,totals,nums_list_m,nums_list_f))if(predict==test.loc[n,'sex']):i+=1n+=1 accuracy = (i+1)/(n+1)print('準確率為：',accuracy)while(1):nam = input("請輸入要預測的姓名：")predict = result(Prob(nam,totals,nums_list_m,nums_list_f))if(predict == 1):print("預測結果為：男")if(predict == 0):print("預測結果為：女")

總結

以上是生活随笔為你收集整理的根据名字预测性别——朴树贝叶斯分类器的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：为什么用电汇而不转账
下一篇： ubuntu 命令卡住_如何在Ubunt