日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

【项目实战】:基于python的p2p运营商数据信息的特征挖掘

發(fā)布時間:2025/3/21 python 44 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【项目实战】:基于python的p2p运营商数据信息的特征挖掘 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

######【風(fēng)控建模】

基于python的p2p運營商數(shù)據(jù)信息的特征挖掘

**@author: sunyaowu** **@datetime: 2018年8月**

說明:利用平臺數(shù)據(jù)和第三方數(shù)據(jù)建立基于用戶通信信息的反欺詐規(guī)則,判別通信信息及通話記錄對客戶潛在逾期發(fā)生的預(yù)警。

一 獲取數(shù)據(jù)

1)數(shù)據(jù)庫: petty_loan

①用戶信息
  • cl_user_base_info
②運營商數(shù)據(jù)
  • 賬戶信息
    cl_operator_basic
  • 消費賬單
    cl_operator_bills
  • 通話記錄
    cl_operator_voices_1
    cl_operator_voices_2
③通訊錄
  • cl_user_contacts_1
  • cl_user_contacts_2
④緊急聯(lián)系人
  • cl_user_emer_contacts
⑤借貸客戶逾期數(shù)據(jù)
  • cl_borrow_repay

2)數(shù)據(jù)信息

  • 用戶手機號
  • 用戶手機號開戶時間
  • 用戶手機號在平臺注冊時間
  • 用戶手機號歸屬地址、撥號地址
  • 用戶手機號消費賬單
  • 用戶手機號通話記錄
    • 通話日期、時間
    • 通話號碼
    • 通話時間
    • 主叫地址
  • 用戶通訊錄詳單
  • 用戶緊急聯(lián)系人電話號、身份關(guān)系

3)信息價值

表單信息價值
手機號手機號開戶時間①手機號真實性 ②判斷手機號使用時長 ③用戶粘性
通話記錄①通話地址(范圍) ②通話群體 ③通話時長①手機號價值 ②手機號粘性 ③用戶活躍度
通訊錄①通訊錄大小 ②判斷通訊錄名單聯(lián)系頻率①客戶社交范圍 ②通訊錄價值
緊急聯(lián)系人①是否在通訊錄中 ②是否有近期通話記錄①聯(lián)系人真實性 ②潛在欺詐風(fēng)險
逾期賬戶①是否逾期①逾期客戶通信信息情況反饋

二 數(shù)據(jù)預(yù)處理 + 特征分析 + 模型搭建

1)項目思路

①直接從數(shù)據(jù)庫中通過python/sql語句獲取測試數(shù)據(jù),也可以保存excel、csv、pkl文件。
②數(shù)據(jù)預(yù)處理和數(shù)據(jù)分析。
③建立Logistic、隨機森林算法模型,尋找變量之間的關(guān)系。

2)分步邏輯

①從數(shù)據(jù)庫抓取數(shù)據(jù),為方便代碼復(fù)用,建立三個函數(shù),分別是:
  • 專門用來連接mysql數(shù)據(jù)庫的DataBaseSql函數(shù)
  • 專門用來存儲sql語句的sql_query函數(shù)
  • 專門用來調(diào)用sql語句和mysql數(shù)據(jù)庫連接以生成Dataframe數(shù)據(jù)的get_data函數(shù)
②對數(shù)據(jù)進行預(yù)處理。包括數(shù)據(jù)缺失值、異常值處理等,主要分為三步:
  • 合并數(shù)據(jù)。數(shù)據(jù)分別存儲在不同的excel文件中,以備不同情況下的處理,合并數(shù)據(jù)以獲得此次分析所需要的數(shù)據(jù)。
  • 缺失值處理,對需要處理的數(shù)據(jù)字段進行缺失值填充和丟失處理。
  • 生成用來作進一步分析處理的數(shù)據(jù)。
③數(shù)據(jù)分析+可視化報告:
  • 爬取數(shù)據(jù)統(tǒng)計。主要就樣本總體情況做指標(biāo),打印報告。
  • 緊急聯(lián)系人通話記錄統(tǒng)計。判斷緊急聯(lián)系人的通訊錄權(quán)限、運營商信息權(quán)限開通情況,做統(tǒng)計圖。
  • 放款客戶逾期情況統(tǒng)計。

3)語法重點

①python編程整體思路:
  • python: 函數(shù)式編程、數(shù)據(jù)流程、代碼復(fù)用、實例化、debug處理
②數(shù)據(jù)庫:
  • mysql:多表查詢join、多表合并union、時間戳gettime、篩選條件where、
  • python:調(diào)用數(shù)據(jù)庫connet、游標(biāo)cursor、數(shù)據(jù)匹配fetchall、事務(wù)提交commit
③數(shù)據(jù)處理:
  • 常用方法:讀取文件read_excel、計數(shù)value_counts、重置索引reset_index、關(guān)聯(lián)匹配merge、去除重復(fù)值drop_duplicates、最大值max、去除空值dropna、替換replace、字符串str、矩陣形狀shape、整形int
  • 特殊思路:建立一個for循環(huán),判斷字段里是否有某個字符串、建立循環(huán),批量填充
③可視化+執(zhí)行報告:
  • 常用方法
  • xlsxwriter.Workbook、workbook.add_worksheet、time.sleep(0)、worksheet.insert_image、set_column、zip、plt、echart

4)執(zhí)行代碼

①導(dǎo)入python模塊,定義類名:
""" pro:AntiFraudRule.py @author: sunyaowu """import numpy as np import pandas as pd import time import os import pymysql import sys import matplotlib.pyplot as plt import seaborn as sns import pyecharts import xlwt import xlsxwriter _path = r'C:\Users\A3\Desktop\2:項目\項目\項目22: Python數(shù)據(jù)庫模塊搭建,支持增刪查改調(diào)用' os.chdir(_path + '\data') sys.path.append(_path) import DataBaseSql
②數(shù)據(jù)庫query數(shù)據(jù)
  • 編寫數(shù)據(jù)庫查詢函數(shù),可存為DataBaseSql.py文件,支持調(diào)用
# -*- coding: utf-8 -*- """ Created on Mon Aug 20 10:27:06 2018 pro:DataBaseSql.py @author: sunyaowu """ import pymysql import pandas as pdclass DataBaseSql():def __init__(self):pass def sql_Select(sql,config):try:conn = pymysql.connect(**config)with conn.cursor() as cur:cur.execute(sql)conn.commit()df = pd.DataFrame(cur.fetchall())cur.colse()except:conn.rollback()finally:conn.close()return df
  • 編寫sql語句
def sql_query(self,num):sql1 ='''select distinct a.user_id, gmt_modified, real_name, basic_phone_num, b.phone as emer_phone, a.phone as user_phone, relation from cl_user_base_info a left join cl_user_emer_contacts b on b.user_id = a.user_id left join cl_operator_basic c on c.user_id = b.user_idwhere a.user_id < %i '''%(num) '''通訊錄'''sql2 ='''select user_id,phonefromcl_user_contacts_1where user_id < %i union select user_id,phonefromcl_user_contacts_2where user_id < %i union select user_id,phonefromcl_user_contacts_3where user_id < %i '''%(num,num,num) '''通話記錄''' sql3 ='''selectuser_id,voice_to_number,voice_place,voice_date,voice_durationfromcl_operator_voices_1where user_id < %i union select user_id,voice_to_number,voice_place,voice_date,voice_durationfromcl_operator_voices_2where user_id < %i unionselect user_id,voice_to_number,voice_place,voice_date,voice_durationfromcl_operator_voices_3where user_id < %i ''' %(num,num,num) sql4 ='''select user_id,penalty_amout,penalty_dayfrom cl_borrow_repaywhere user_id < %i ''' %(num) sql5 ='''select distinctuser_id,phonefrom cl_user_emer_contactswhere user_id < %i ''' %(num) return sql1,sql2,sql3,sql4,sql5
  • 從數(shù)據(jù)庫中query數(shù)據(jù),保存為excel文件
def get_data(self,name):config = {'host':'XXXXXXXXXXX','port':3XX6,'db':'pXXXXX_loan','user':'cash_XXXXXyw_r','password':'IxCZIXXXXXXXXXXXXXXlext5G', 'charset':'utf8mb4','cursorclass':pymysql.cursors.DictCursor,} sql_ = list(self.sql_query(number))n = 0for i in sql_:try:data = DataBaseSql.DataBaseSql.sql_Select(i,config)print('Bingo,get:%i!' %(n + 1))except:print('Bingo,error!')finally:passdata.to_excel(path +"\data\%s.xlsx" % name[n],index = False, encoding = 'utf-8') n += 1
②數(shù)據(jù)處理
def data_pro(self,name): data0 = pd.read_excel('%s.xlsx'%name[0])data1 = pd.read_excel('%s.xlsx'%name[1])_data1 = data1['user_id'].value_counts().reset_index()_data1.columns = 'user_id','phone_counts'data_mix = pd.merge(data0,_data1,on = 'user_id',how = 'left')data2 = pd.read_excel('%s.xlsx'%name[2])_data2 = data2['user_id'].value_counts().reset_index()_data2.columns = 'user_id','voice_counts' #data22.rename(columns = {'index':'user_id','user_id':'voice_counts'},inplace = True)data_mix = pd.merge(data_mix,_data2,on = 'user_id',how = 'left')_data2 = data2['voice_to_number'].value_counts().reset_index()_data2.columns = 'voice_to_number','tel_counts'__data2 = pd.merge(data2.dropna(),_data2,on = 'voice_to_number',how = 'left').drop_duplicates(subset=['voice_to_number']).astype({'tel_counts':'int'})__data2 = __data2 [__data2['tel_counts'] > 5] _data2 = __data2['user_id'].value_counts().reset_index()_data2.columns = 'user_id','tel_5_counts' _data2 = pd.merge(_data2,__data2[['user_id','tel_counts']].groupby('user_id').max().reset_index(),on = 'user_id',how = 'left')_data2.rename(columns = {'tel_counts':'max_tel_count'},inplace = True)data_mix = pd.merge(data_mix,_data2,on = 'user_id',how = 'left') data4 = pd.read_excel('%s.xlsx'%name[4])data24_mix = pd.merge(data2,data4,on = 'user_id',how = 'left').drop_duplicates(subset=['voice_to_number']).dropna().reset_index()#data24_mix['voice_to_number'] = data24_mix['voice_to_number'].str.replace(')','').str.replace('(','').str.replace('*','').str.replace('+','')#for i in range(data24_mix.shape[0]):# data24_mix['voice_to_number'][i] = int(''.join(list(filter(lambda ch: ch in '0123456789',data24_mix['voice_to_number'][i]))))# print(i)#data24_mix = data24_mix.dropna()data24_mix['emer_in_voice'] = 0for i in range(data24_mix.shape[0]):a = str(data24_mix['voice_to_number'][i])b = str(data24_mix['phone'][i])print (i)if a == b:data24_mix['emer_in_voice'][i] = 1data24_mix = data24_mix[data24_mix['emer_in_voice'] == 1]data_mix = pd.merge(data_mix,data24_mix[['user_id','emer_in_voice']],on = 'user_id',how = 'left') data3 = pd.read_excel('%s.xlsx'%name[3])data_mix = pd.merge(data_mix,data3,on = 'user_id',how = 'left')for i in ['phone_counts','voice_counts','tel_5_counts','max_tel_count','emer_in_voice']: data_mix[i].fillna(0,inplace = True)data_mix.to_excel(_path +'\data\%s.xlsx'%name[5]) emer_in_voice = data_mix[data_mix['emer_in_voice'] == 1]emer_in_voice.to_excel(_path +'\data\%s.xlsx'%name[6])emer_in_repay = data_mix[data_mix['penalty_day'] >= 0]'''emer_in_repay['penalty'] = 0for i in range(emer_in_repay.shape[0]):a = emer_in_repay['penalty_day'][i]pirnt(a)if a > 0 :emer_in_repay['penalty'][i] = 1'''emer_in_repay.to_excel(_path +'\data\%s.xlsx'%name[7])return data_mix,emer_in_voice,emer_in_repay
③數(shù)據(jù)分析 + 可視化報告
'''所有抓取數(shù)據(jù)的特征描述、分析及可視化報告'''def data_query_description(self,num,name):data = pd.read_excel(_path +'\data\%s.xlsx'%name[5])worksheet1 = workbook.add_worksheet('query_data_report')worksheet1.set_column('A:B',50)a = ['通訊錄信息','運營商信息','通話記錄','通話記錄']b = ['phone_counts','voice_counts','tel_5_counts','emer_in_voice']c = ['開通通訊錄權(quán)限','開通運營商信息權(quán)限','有通話記錄','與緊急聯(lián)系人存在通話記錄']data_count = []m = 0for i,j,n in zip(a,b,c):_p = len(data)_q = len(data[data[j]>0])data_count.append(_q)d,e = '爬取%i數(shù)據(jù)共%s條,其中有效數(shù)據(jù)共%i條' %(num,i,_q),'說明用戶%s的比例為:%.2f%%' %(n,_q/_p*100)print(d,e) worksheet1.write(m,0,d)#寫入excel報告worksheet1.write(m,1,e)#寫入excel報告time.sleep(0) m += 1print('-----next-----')time.sleep(0) d = '開通運營商權(quán)限的%i個用戶中,有過通話記錄的為%i,占比%.2f%%' %(data_count[1],data_count[2],data_count[2]/data_count[1]*100) print(d) e = '有過通話記錄的%i個用戶中,與緊急聯(lián)系人通話的為%i,占比%.2f%%' %(data_count[2],data_count[3],data_count[3]/data_count[2]*100) print('-----next-----')print(e) print('Bingo!')worksheet1.write(m + 1,0,d)#寫入excel報告worksheet1.write(m + 2,0,e)#寫入excel報告self.data_query_show(num,data,name,worksheet1) workbook.close() #關(guān)閉excel文件return datadef data_query_show(self,num,data,name,worksheet):box = ['phone_counts','voice_counts','tel_5_counts','max_tel_count']for item in box :#data[item].bar()#plt.show() #直接利用Dataframe畫圖 self.hist_show(data[data[item] > 0],item,100) #為變量屬性畫直方圖 v = 8for p,q,l in zip(box,[8,8,18,18],['A','B','A','B']) :worksheet.insert_image('%s%i'%(l,q),_path + r'\report+image\%s.png'%p,{'x_scale': 0.5, 'y_scale': 0.5})#寫入excel報告#for item in ['relation','emer_in_voice']: # data[item].bar()# plt.show()#self.hist_show(data[data[item] > 0],item,100) #為變量屬性畫箱型圖 #self.data_report(): #self.echart_show(data['relation'])'''存在緊急聯(lián)系人數(shù)據(jù)的特征描述、分析及可視化報告'''def data_emer_description(self,num,name):data = pd.read_excel(_path +'\data\%s.xlsx'%name[5])worksheet2 = workbook.add_worksheet('emer_data_report')a = ['通訊錄信息','運營商信息','通話記錄','通話記錄']b = ['phone_counts','voice_counts','tel_5_counts','emer_in_voice']c = ['開通通訊錄權(quán)限','開通運營商信息權(quán)限','有通話記錄','與緊急聯(lián)系人存在通話記錄']data_count = []for i,j,n in zip(a,b,c):_p = len(data)_q = len(data[data[j]>0])data_count.append(_q)print('爬取客戶數(shù)據(jù)共%i條,其中有效%s數(shù)據(jù)共%i條' %(num,i,_q))print('說明用戶%s的比例為:%.2f%%' %(n,_q/_p*100)) print('-----next-----')time.sleep(0) #type(data_count[0])print('開通運營商權(quán)限的%i個用戶中,有過通話記錄的為%i,占比%.2f%%' %(data_count[1],data_count[2],data_count[2]/data_count[1]*100) ) print('-----next-----')time.sleep(0) print('有過通話記錄的%i個用戶中,與緊急聯(lián)系人通話的為%i,占比%.2f%%' %(data_count[2],data_count[3],data_count[3]/data_count[2]*100) ) print('Bingo!')#寫入text報告self.data_query_show(num,data,name)#函數(shù)中將過程產(chǎn)生圖片也保存在text報告中return data def data_emer_show(self,name): data = pd.read_excel(_path +'\data\%s.xlsx'%name[6])for item in ['phone_counts','voice_counts','tel_5_counts']:self.hist_show(data1[(data1[item] > 0) & (data1[item] < 1*data1[item].max())],item,20) return data'''借貸用戶還款數(shù)據(jù)的特征描述、分析及可視化報告'''def data_repay_description(self,num,name):data = pd.read_excel(_path +'\data\%s.xlsx'%name[5])worksheet3 = workbook.add_worksheet('repay_data_report')a = ['通訊錄信息','運營商信息','通話記錄','通話記錄']b = ['phone_counts','voice_counts','tel_5_counts','emer_in_voice']c = ['開通通訊錄權(quán)限','開通運營商信息權(quán)限','有通話記錄','與緊急聯(lián)系人存在通話記錄']data_count = []for i,j,n in zip(a,b,c):_p = len(data)_q = len(data[data[j]>0])data_count.append(_q)print('爬取%s數(shù)據(jù)共%i條,其中有效數(shù)據(jù)共%i條' %(i,num,_q))print('說明用戶%s的比例為:%.2f%%' %(n,_q/_p*100)) print('-----next-----')time.sleep(0) #type(data_count[0])print('開通運營商權(quán)限的%i個用戶中,有過通話記錄的為%i,占比%.2f%%' %(data_count[1],data_count[2],data_count[2]/data_count[1]*100) ) print('-----next-----')time.sleep(0) print('有過通話記錄的%i個用戶中,與緊急聯(lián)系人通話的為%i,占比%.2f%%' %(data_count[2],data_count[3],data_count[3]/data_count[2]*100) ) print('Bingo!')#寫入text報告self.data_query_show(num,data,name)#函數(shù)中將過程產(chǎn)生圖片也保存在text報告中return data def data_repay_show(self,name):data = pd.read_excel(_path +'\data\%s.xlsx'%name[7])for item in ['phone_counts','voice_counts','tel_5_counts']:self.hist_show(data1[data1[item] > 0],item,20) for item in ['relation','phone_counts','voice_counts']:self.boxplot_show(data3,item,'penalty_day') return data # ============================================================================= # #可視化及數(shù)據(jù)報告功能 # ============================================================================='''頻率分布直方圖''' def hist_show(self,data,field,bin):'''data[field].hist(bins = bin,histtype = 'bar',align = 'mid',orientation = 'vertical',alpha = 0.5,normed = True )data[field].plot(kind = 'kde',style = 'k--')'''plt.hist(data[field], bins=40, normed=0, facecolor="blue", edgecolor="black", alpha=0.7)plt.title(field)plt.savefig(_path + r'\report+image\%s.png' % field, dpi=100)plt.show()'''柱狀圖''' '''箱型圖''' '''雙變量箱型圖''' def boxplot_show(self,data,field1,field2):data = pd.concat([data[field1], data[field2]], axis=1)fig = sns.boxplot(x=field1, y=field2, data=data)plt.title(field1)plt.show() #plt.savefig('%s.png' % field, dpi=200)'''本地文件eharts''' def echart_show(self,data):from pyecharts import Bar from pyecharts import Bar, Linefrom pyecharts.engine import create_default_environmentbar = Bar("緊急聯(lián)系人分布", "副標(biāo)題")bar.add('聯(lián)系人',data)# bar.print_echarts_options() # 該行只為了打印配置項,方便調(diào)試時使用bar.render() # 生成本地 HTML 文件'''寫入excel文件并生成報告''' def data_report():pass

④特征挖掘
  • 待續(xù)!
def **model_result**(self):pass
⑤主函數(shù)
if __name__ == "__main__":AFR = AntiFraudRule() number = 10000 #int(input('請輸入您要查詢的ID數(shù)量!\n')) name = ['emer_relation_data','phonebook_data','tel_records_data','repay_data','emer_real_data','data_mix','emer_in_voice','emer_in_repay']_begin_time = time.time() ''' 1、數(shù)據(jù)獲取'''print("---------- 1.get_data ----------")#AFR.get_data(name)''' 2、數(shù)據(jù)處理'''print("---------- 2.data_pro ----------")#data_mix,emer_in_voice,emer_in_repay = AFR.data_pro(name) ''' 3、數(shù)據(jù)分析'''print("---------- 3.data_show ----------")workbook = xlsxwriter.Workbook('report.xlsx')data_query = AFR.data_query_description(number,name)#data_emer = AFR.data_emer_description(number,name)#data_repay = AFR.data_repay_description(number,name)''' 4、特征挖掘'''#print("---------- 4.data_rule ----------")#print('finished!')''' 5、可視化報告'''#print("---------- 5.result_model ----------")#print('finished!')_end_time = time.time()print('You have finished!\nfanilly use time: {x:.2f}s'.format(x = _end_time - _begin_time))

三 總結(jié)

  • 初衷:此項目用于新手項目練手,重在完整的展現(xiàn)一個數(shù)據(jù)挖掘項目的數(shù)據(jù)整理流程。包括數(shù)據(jù)獲取、處理、挖掘、可視化等模塊。用時五個工作日,新手期可以接受。
  • 問題:由于維度比較單一,可視化分析和可供挖掘的特征比較少,故只簡單的實現(xiàn)了基礎(chǔ)功能,并未深挖,后續(xù)有時間再補充。整個項目,看起來比較簡單,實則已經(jīng)運用了python的一些基礎(chǔ)卻高效的功能。例如:pandas、matplotlib、sql、url、xlwt、echart、os、path等。
  • 反思:代碼只是執(zhí)行思維的工具,有一個接觸、理解、運用、熟練的過程。重點是coder思維邏輯是否清晰,能否按照工程流workflow、項目流、數(shù)據(jù)流的方式層次、結(jié)構(gòu)化、邏輯化的去執(zhí)行工作。
  • Finally!
    • 高效學(xué)習(xí)有兩個很重要的習(xí)慣:
      ①快速進入專注的狀態(tài)。
      ②長期保持專注的狀態(tài)。

總結(jié)

以上是生活随笔為你收集整理的【项目实战】:基于python的p2p运营商数据信息的特征挖掘的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。