数据挖掘流程(二):数据预处理
- 數(shù)據(jù)處理注意事項(xiàng):
- 數(shù)據(jù)類型:
- patient_id(患者id),case_no(住院記錄id),這些id數(shù)據(jù)類型在讀入時(shí)可能為int或float,造成merge無法匹配,應(yīng)在讀入時(shí)設(shè)置為dtype={'patient_id': str};
- age、weight等str類型需要轉(zhuǎn)換為float類型,計(jì)算BMI;value=round(value,2),保留2為小數(shù)
- start_datetime或end_datetime等str還是timedelta時(shí)間戳類型,因?yàn)閟tr無法進(jìn)行時(shí)間加減datetime.timedelta(days=7)
- 時(shí)間格式:規(guī)范化為2018-01-01 18:46:23,而不是13/09/2018 18:46:23,因?yàn)閜ython sort_values()方法按第一個(gè)數(shù)排序,會(huì)把12/04排在22/02前面!
- 明確數(shù)據(jù)對(duì)應(yīng)關(guān)系:選擇納排基準(zhǔn)(patient_id或case_no),合并數(shù)據(jù)時(shí),要明確id對(duì)應(yīng)的用藥、住院記錄關(guān)系
- 一對(duì)一。一個(gè)患者對(duì)應(yīng)一個(gè)id
- 一對(duì)多。一個(gè)患者可能對(duì)應(yīng)多條住院記錄case_no;一條住院記錄可能對(duì)應(yīng)多條用藥記錄
- 按出院日劑量分組時(shí)。雖然病人可能存在多次入院,多次出院時(shí)劑量改變,但我們要研究他再次入院的話,只能以他第一次出院日劑量作為分組標(biāo)準(zhǔn),分析他再次入院,否則無法明確分析不同日劑量組別的入院差異。因?yàn)樗俅稳朐旱挠涗浛赡馨慈談┝糠值狡渌M了,這導(dǎo)致我們無法檢測(cè)數(shù)再次入院。
- 操作DataFrame數(shù)據(jù)之前:
- 刪除空值
- 刪除重復(fù)
- 刪除異常值:文字、過大值(絕對(duì)值大于中位數(shù)100倍)
- 排序
- 保存DataFrame數(shù)據(jù)之前:
- 排序
- 重置索引。df=df.reset_index(drop=True)
- 輸出數(shù)據(jù)統(tǒng)計(jì)。print(df.shape); print(df['patient_id'].nunique()); print(df['case_no'].nunique())
- 數(shù)據(jù)類型:
目錄
Medical DM數(shù)據(jù)處理流程:
1. 原始數(shù)據(jù)raw_data預(yù)處理
導(dǎo)入packages和自定義函數(shù)
1.1 用藥原始數(shù)據(jù)doctor_order預(yù)處理
1.2 診斷原始數(shù)據(jù)diagnostic預(yù)處理
1.3 檢驗(yàn)原始數(shù)據(jù)test_record+test_result預(yù)處理
2. 納排 納排基準(zhǔn)patient_id,case_no
納入:提取服用利伐沙班的患者
納入: 提取出院診斷房顫患者
合并利伐沙班用藥和出院房顫診斷
排除:膜瓣置換手術(shù)
排除:診斷中瓣膜性房顫
3. 計(jì)算利伐沙班用藥日劑量
TDM檢測(cè)信息
合并用藥和tdm檢測(cè)
tdm檢測(cè)前7天內(nèi)有他克莫司用藥
同一病人相鄰兩次TDM檢測(cè)間隔15天判斷
4. 合并人口信息學(xué)數(shù)據(jù)
4.1 合并人口學(xué)特征
4.2 補(bǔ)充缺失的性別、年齡、身高信息
4.3 使用隨機(jī)森林進(jìn)行插補(bǔ)
5. 增加既往史(糖尿病、高血壓)
6. 增加聯(lián)合用藥
6.1 提取聯(lián)合用藥
6.2 補(bǔ)充聯(lián)合用藥時(shí)間
6.3 提取tdm檢測(cè)7天內(nèi)的聯(lián)合用藥
6.4 刪除缺失值>50%的列
7. 增加其他檢測(cè)
7.1 肌酐(腎功能)
7.2 肝功能
7.3 血細(xì)胞分析
7.4 凝血
7.5 大便隱血
7.5 刪除缺失值
出入院時(shí)間和入院診斷
高低劑量組
高低劑量組分組
高低劑量組統(tǒng)計(jì)
Medical DM數(shù)據(jù)處理流程:
1. 原始數(shù)據(jù)raw_data預(yù)處理
因?yàn)?span style="background-color:#ffd900;">特定用藥和聯(lián)合用藥都需要從doctor_order用藥里面提取;tdm檢測(cè)和其他檢測(cè)都需要從df_test(df_test_record + df_test_result)提取;如果先簡(jiǎn)單預(yù)處理一下原始數(shù)據(jù),后面直接用的話,會(huì)方便很多!不用提特定要的時(shí)候處理一次,提聯(lián)合用藥的時(shí)候再處理一次。
導(dǎo)入packages和自定義函數(shù)
# _*_ coding: utf-8 _*_ # @Time: 2021/10/27 17:51 # @Author: yuyongsheng # @Software: PyCharm # @Description: # 導(dǎo)入程序包 import pandas as pd pd.set_option('mode.chained_assignment', None) import numpy as np import os project_path=os.getcwd() # 導(dǎo)入預(yù)定義函數(shù) # 字符串轉(zhuǎn)換為時(shí)間格式 import datetime def str_to_datetime(x):try:a = datetime.datetime.strptime(x, "%d/%m/%Y %H:%M:%S")return aexcept:return np.NaN# 過濾異常值 def filter_exce_value(df,feature):# 過濾文字!!!!!!!!!!!!!!!!!!!!!!!!!!!df=df[df[feature].str.contains('\d')]# 過濾異常大值!!!!!!!!!!!!!!!!!!!!!!!!!!median_value=df[feature].median()df[feature]=df[feature].apply(lambda x: x if abs(float(x)) < (100 * abs(median_value)) else np.nan)df=df[df[feature].notnull()]return df1.1 用藥原始數(shù)據(jù)doctor_order預(yù)處理
# 原始數(shù)據(jù)集預(yù)處理:調(diào)整時(shí)間格式;異常值刪除需求具體到特定藥,最好不要籠統(tǒng)的刪,因?yàn)榇藭r(shí)劑量單位不統(tǒng)一; # 而日劑量的計(jì)算和劑量單位統(tǒng)一,在具體到特定藥后更簡(jiǎn)單#%% md## 用藥原始數(shù)據(jù)doctor_order處理 df_doctor_order=pd.read_csv(project_path+'/data/raw_data/2-doctor_order.csv') print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique()) # 提取用藥狀態(tài)為停止的用藥 df_doctor_order=df_doctor_order[df_doctor_order['statusdesc']=='停止'] print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique()) # 并刪除服藥方式為“取藥用”的樣本 df_doctor_order=df_doctor_order[df_doctor_order['medication_way']!='取藥用'] print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique()) # 刪除用藥劑量為空的數(shù)據(jù) df_doctor_order=df_doctor_order[(df_doctor_order['dosage'].astype('str').notnull()) & (df_doctor_order['dosage'].astype('str')!='nan')] df_doctor_order=df_doctor_order.reset_index(drop=True) print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique()) # 刪除重復(fù)數(shù)據(jù) df_doctor_order=df_doctor_order.drop_duplicates(subset=['patient_id','case_no','drug_name','dosage','frequency','start_datetime','end_datetime'],keep='first') df_doctor_order=df_doctor_order.reset_index(drop=True) print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique())#%%# 提取doctor_order里面的有效字段 df_doctor_order=df_doctor_order[['patient_id','case_no','long_d_order','drug_name','amount','drug_spec','dosage','frequency','medication_way','start_datetime','end_datetime']] # 調(diào)整doctor_order開始服藥時(shí)間和結(jié)束服藥時(shí)間格式 df_doctor_order['start_datetime']=df_doctor_order['start_datetime'].apply(str_to_datetime) df_doctor_order['end_datetime']=df_doctor_order['end_datetime'].apply(str_to_datetime) print(df_doctor_order.shape) print(df_doctor_order['patient_id'].nunique()) print(df_doctor_order['case_no'].nunique())#%%# 保存預(yù)處理后的原始用藥數(shù)據(jù)doctor_order writer=pd.ExcelWriter(project_path+'/data/pre_processed_raw_data/df_doctor_order.xlsx') df_doctor_order.to_excel(writer) writer.save()1.2 診斷原始數(shù)據(jù)diagnostic預(yù)處理
## 診斷原始數(shù)據(jù)diagnostic處理#%%df_diagnostic=pd.read_csv(project_path+'/data/raw_data/3-diagnostic_record.csv',dtype={'case_no':str}) # dtype可以防止某一列因?yàn)閜andas讀取導(dǎo)致數(shù)據(jù)類型改變 print(df_diagnostic.shape) print(df_diagnostic['patient_id'].nunique()) print(df_diagnostic['case_no'].nunique()) print(df_diagnostic)#%%# 刪除診斷為空的數(shù)據(jù) df_diagnostic=df_diagnostic[(df_diagnostic['diagnostic_content'].notnull())& (df_diagnostic['diagnostic_content'].astype('str')!='nan')] print(df_diagnostic.shape) print(df_diagnostic['patient_id'].nunique()) print(df_diagnostic['case_no'].nunique()) # 刪除住院記錄case_no為空的記錄 df_diagnostic=df_diagnostic[(df_diagnostic['case_no'].notnull()) & (df_diagnostic['case_no'].astype('str')!='nan')] df_diagnostic=df_diagnostic.reset_index(drop=True) print(df_diagnostic.shape) print(df_diagnostic['patient_id'].nunique()) print(df_diagnostic['case_no'].nunique()) print(df_diagnostic) # 刪除重復(fù)數(shù)據(jù) df_diagnostic=df_diagnostic.drop_duplicates(subset=['patient_id','case_no','record_date','diagnostic_type','diagnostic_content'],keep='first') df_diagnostic=df_diagnostic.reset_index(drop=True) print(df_diagnostic.shape) print(df_diagnostic['patient_id'].nunique()) print(df_diagnostic['case_no'].nunique())#%%# 調(diào)整diagnostic里面的時(shí)間格式 df_diagnostic['record_date']=df_diagnostic['record_date'].astype('str').apply(str_to_datetime) # 提取diagnostic里面的有效字段 df_diagnostic=df_diagnostic[['patient_id','case_no','record_date','diagnostic_type','diagnostic_content']] print(df_diagnostic)#%%# 保存預(yù)處理后的原始診斷數(shù)據(jù)diagnostic writer=pd.ExcelWriter(project_path+'/data/pre_processed_raw_data/df_diagnostic.xlsx') df_diagnostic.to_excel(writer) writer.save()1.3 檢驗(yàn)原始數(shù)據(jù)test_record+test_result預(yù)處理
## 檢驗(yàn)原始數(shù)據(jù)test_record+test_result處理#%%# 提取df_test,它是由rest_record和test_result合并而成,十分重要!!包含:tdm和安全性指標(biāo)。 # 檢測(cè)記錄test_record df_test_record=pd.read_csv(project_path+'/data/raw_data/4-test_record.csv',dtype={'case_no':str}) df_test_record=df_test_record[['test_record_id','patient_id','case_no','test_date','clinical_diagnosis']] print(df_test_record.shape) print(df_test_record['patient_id'].nunique()) print(df_test_record['case_no'].nunique()) # 刪除test_date為空的記錄 df_test_record=df_test_record[df_test_record['test_date'].notnull()] print(df_test_record.shape) print(df_test_record['patient_id'].nunique()) print(df_test_record['case_no'].nunique()) # 刪除住院號(hào)case_no為空的記錄 df_test_record=df_test_record[df_test_record['case_no'].notnull()] df_test_record=df_test_record.reset_index(drop=True) print(df_test_record.shape) print(df_test_record['patient_id'].nunique()) print(df_test_record['case_no'].nunique()) # 刪除test_record重復(fù)數(shù)據(jù) df_test_record=df_test_record.drop_duplicates(subset=['test_record_id','patient_id','case_no','test_date','clinical_diagnosis'],keep='first') df_test_record=df_test_record.reset_index(drop=True) print(df_test_record.shape) print(df_test_record['patient_id'].nunique()) print(df_test_record['case_no'].nunique()) # 調(diào)整檢測(cè)時(shí)間格式 df_test_record['test_date']=df_test_record['test_date'].astype('str').apply(str_to_datetime) print(df_test_record)#%%# 保存預(yù)處理后的test_record writer=pd.ExcelWriter(project_path+'/data/pre_processed_raw_data/df_test_record.xlsx') df_test_record.to_excel(writer) writer.save()#%%# 檢測(cè)結(jié)果test_result df_test_result=pd.read_csv(project_path+'/data/raw_data/4-test_result.csv') df_test_result=df_test_result[['test_record_id','project_name','test_result','refer_scope','synonym']] print(df_test_result.shape) # 刪除檢測(cè)項(xiàng)目project_name為空的數(shù)據(jù) df_test_result=df_test_result[df_test_result['project_name'].notnull()] print(df_test_result.shape) # 刪除test_result為空的數(shù)據(jù) df_test_result=df_test_result[df_test_result['test_result'].notnull()] df_test_result=df_test_result.reset_index(drop=True) print(df_test_result.shape) # 刪除<>號(hào) df_test_result['test_result']=df_test_result['test_result'].astype('str').apply(lambda x:x.replace('<','')) df_test_result['test_result']=df_test_result['test_result'].astype('str').apply(lambda x:x.replace('>','')) print(df_test_result) # 刪除test_result重復(fù)數(shù)據(jù) df_test_result=df_test_result.drop_duplicates(subset=['test_record_id','project_name','test_result','refer_scope','synonym'],keep='first') df_test_result=df_test_result.reset_index(drop=True) print(df_test_result.shape)#%%# 保存預(yù)處理后的test_result,數(shù)據(jù)太大無法保存 # writer=pd.ExcelWriter(project_path+'/data/pre_processed_raw_data/df_test_result.xlsx') # df_test_result.to_excel(writer) # writer.save()#%%# 基于唯一性的test_record_id,合并test_record和test_result df_test=pd.merge(df_test_record,df_test_result,on=['test_record_id'],how='inner') print(df_test)2. 納排 納排基準(zhǔn)patient_id,case_no
-
- 納入condition1
- 納入condition2
- 排除condition1
- 排除condition2
納入:提取服用利伐沙班的患者
# 納排:提取服用利伐沙班的非瓣膜房顫患者#%% md## 納入:提取服用利伐沙班的患者#%%# 1. 提取服用利伐沙班非瓣膜房顫患者 print('-------------------------1.提取提取服用利伐沙班的非瓣膜房顫患者------------------------------') # 1.1 服用利伐沙班且出院記錄中有房顫的患者 print('-------------------------提取服用利伐沙班的患者------------------------------') # 提取服藥利伐沙班的患者id df_lfsb=df_doctor_order[df_doctor_order['drug_name'].str.contains('利伐沙班')] df_lfsb=df_lfsb.reset_index(drop=True) # 排序 df_lfsb=df_lfsb.sort_values(['patient_id','case_no','start_datetime'],ascending=[True,True,True]) df_lfsb=df_lfsb.reset_index(drop=True) print(df_lfsb.shape) print(df_lfsb['patient_id'].nunique()) print(df_lfsb['case_no'].nunique()) # print(df_lfsb)#%%# 保存利伐沙班用藥記錄 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.1_利伐沙班用藥記錄.xlsx') df_lfsb.to_excel(writer) writer.save()#%%df_lfsb#%% md納入: 提取出院診斷房顫患者
#%% md## 納入: 提取出院診斷房顫患者#%%# 1.2 根據(jù)鄭-診斷.xlsx,提取出院診斷房顫患者case_no,已進(jìn)行合并納入 print('-------------------------提取出院診斷房顫患者------------------------------') df_oup_fib=df_diagnostic[(df_diagnostic['diagnostic_type']=='出院診斷') & (df_diagnostic['diagnostic_content'].str.contains( '房顫射消融術(shù)后|心房撲動(dòng)射頻消融術(shù)后|心房顫動(dòng)|陣發(fā)性心房顫動(dòng)|持續(xù)性心房顫動(dòng)|陣發(fā)性房顫|頻發(fā)房性早搏|陣發(fā)性心房撲動(dòng)|心房撲動(dòng)|持續(xù)性房顫|房顫伴快速心室率\ |房顫射頻消融術(shù)后|射頻消融術(shù)后|快慢綜合征|左心耳封堵術(shù)后|陣發(fā)性心房纖顫|心房顫動(dòng)伴快速心室率|房顫|心房顫動(dòng)射頻消融術(shù)后|射頻消融+左心耳封堵術(shù)后|左心耳封閉術(shù)后\ |心房顫動(dòng)射頻消融術(shù)后+左心耳封堵術(shù)|動(dòng)態(tài)心電圖異常:陣發(fā)性房顫、偶發(fā)房性早搏、偶發(fā)室性早搏、T波間歇性異常改變|左心房房顫射頻消融+左心耳切除術(shù)后|永久性房顫\ |陣發(fā)性房顫射頻消融術(shù)后|冷凍射頻消融術(shù)后|心房顫動(dòng)藥物復(fù)律后'))] df_oup_fib=df_oup_fib.sort_values(by=['patient_id','case_no','record_date'],ascending=[True,True,True]) df_oup_fib=df_oup_fib.reset_index(drop=True) print(df_oup_fib.shape) print(df_oup_fib['patient_id'].nunique()) print(df_oup_fib['case_no'].nunique()) print(df_oup_fib)#%%# 保存出院診斷房顫患者 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.2_出院診斷房顫患者記錄.xlsx') df_oup_fib.to_excel(writer) writer.save()合并利伐沙班用藥和出院房顫診斷
print(type(df_lfsb.loc[0,'case_no'])) print(type(df_oup_fib.loc[0,'case_no']))#%% md## 合并利伐沙班用藥和出院房顫診斷#%%# 調(diào)整利伐沙班用藥的case_no格式 df_lfsb['case_no']=df_lfsb['case_no'].astype('str') # 出院診斷 df_oup_fib=df_oup_fib.drop(['patient_id'],axis=1)#%%oup_fib_list=list(df_oup_fib['case_no']) temp_list=[] for i in np.unique(df_lfsb['case_no']):temp=df_lfsb[df_lfsb['case_no']==i]temp=temp.reset_index(drop=True)if i in oup_fib_list:temp_list.append(temp) df_lfsb_oup=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_oup=pd.concat([df_lfsb_oup,temp_list[j]],axis=0) df_lfsb_oup=df_lfsb_oup.reset_index(drop=True) del temp_list#%%print(df_lfsb_oup.shape) print(df_lfsb_oup['patient_id'].nunique()) print(df_lfsb_oup['case_no'].nunique())#%%print(df_lfsb_oup)排除:膜瓣置換手術(shù)
#%% md## 排除:膜瓣置換手術(shù)和瓣膜性房顫#%%# 1.3 提取瓣膜性房顫患者:手術(shù)中有膜瓣置換、診斷中為瓣膜性房顫。 print('-------------------------排除房顫相關(guān)的手術(shù)-----------------------------') # 根據(jù)鄭-手術(shù).xlsx,排除膜瓣置換手術(shù) df_surgical_record=pd.read_csv(project_path+'/data/raw_data/1-surgical_record.csv') # df_surgical_valve=df_surgical_record[df_surgical_record['surgery_name'].str.contains('心臟病損腔內(nèi)消融術(shù)|心臟病損腔內(nèi)冷凍消融術(shù)|心電生理測(cè)定(EPS)|左心耳堵閉術(shù)|左心耳切除術(shù)|左心封堵術(shù)')] df_surgical_valve=df_surgical_record[df_surgical_record['surgery_name'].str.contains('瓣膜置換')] print(df_surgical_valve.shape) print(df_surgical_valve['patient_id'].nunique()) print(df_surgical_valve['case_no'].nunique()) print(df_surgical_valve)#%%# 排除瓣膜置換手術(shù)的case_no surgical_valve_list=list(df_surgical_record['case_no']) temp_list=[] for i in np.unique(df_lfsb_oup['case_no']):temp=df_lfsb_oup[df_lfsb_oup['case_no']==i]temp=temp.reset_index(drop=True)if i in surgical_valve_list:continueelse:temp_list.append(temp) df_lfsb_not_surgery=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_not_surgery=pd.concat([df_lfsb_not_surgery,temp_list[j]],axis=0) df_lfsb_not_surgery=df_lfsb_not_surgery.reset_index(drop=True) del temp_list#%%print(df_lfsb_not_surgery.shape) print(df_lfsb_not_surgery['patient_id'].nunique()) print(df_lfsb_not_surgery['case_no'].nunique())排除:診斷中瓣膜性房顫
#%% md## 排除:診斷中瓣膜性房顫#%%# 排除臨床診斷中瓣膜性房顫,包含:心臟瓣膜病和風(fēng)濕性瓣膜病;不包括下肢靜脈瓣膜病 print('-------------------------排除瓣膜性房顫患者-----------------------------') # 刪除臨床診斷中的空值 df_clinical_diagnosis=df_test_record[df_test_record['clinical_diagnosis'].notnull()] # 非空 df_heart_valve=df_clinical_diagnosis[df_clinical_diagnosis['clinical_diagnosis'].str.contains('瓣膜')] df_heart_valve=df_heart_valve[df_heart_valve['clinical_diagnosis'].str.contains('心臟|風(fēng)濕性')] df_heart_valve['case_no']=df_heart_valve['case_no'].astype('str')#%%print(df_heart_valve.shape) print(df_heart_valve['patient_id'].nunique()) print(df_heart_valve['case_no'].nunique())#%%# 排除瓣膜房顫的case_no diagnosis_valve_list=list(df_heart_valve['case_no']) temp_list=[] for i in np.unique(df_lfsb_not_surgery['case_no']):temp=df_lfsb_not_surgery[df_lfsb_not_surgery['case_no']==i]temp=temp.reset_index(drop=True)if i in diagnosis_valve_list:continueelse:temp_list.append(temp) df_lfsb_not_valve=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_not_valve=pd.concat([df_lfsb_not_valve,temp_list[j]],axis=0) df_lfsb_not_valve=df_lfsb_not_valve.reset_index(drop=True) del temp_list#%%print(df_lfsb_not_valve.shape) print(df_lfsb_not_valve['patient_id'].nunique()) print(df_lfsb_not_valve['case_no'].nunique())#%%# 保存利伐沙班非置換非瓣膜 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_temp_利伐沙班非置換非瓣膜.xlsx') df_lfsb_not_valve.to_excel(writer) writer.save()3. 計(jì)算利伐沙班用藥日劑量
#%% md## 計(jì)算利伐沙班用藥日劑量#%%# 1.5計(jì)算利伐沙班用藥日劑量 print('-------------------------計(jì)算出院時(shí)利伐沙班用藥日劑量------------------------------') print(np.unique(df_lfsb['frequency'])) # 一片利伐沙班10mg df_lfsb_not_valve['dosage']=df_lfsb_not_valve['dosage'].apply(lambda x: x.replace('mg', '') if 'mg' in x else 10 if '片' in x else x) third=['1/72小時(shí)'] half=['1/2日','1/隔日'] one=['1/午','1/單日','1/日','1/日(餐前)','1/早','1/晚','Qd','Qd(8am)'] two=['1/12小時(shí)','12/日','2/日'] three=['Tid'] df_lfsb_not_valve['frequency']=df_lfsb_not_valve['frequency'].apply(lambda x: 0.33 if x in third else0.5 if x in half else1 if x in one else2 if x in two else3 if x in three else x)#%%# # print(df_lfsb_not_valve.to_string()) # writer=pd.ExcelWriter(project_path+'/data/processed_data/df_temp_利伐沙班frequency處理.xlsx') # df_lfsb_not_valve.to_excel(writer) # writer.save()#%%df_lfsb_not_valve['日劑量']=df_lfsb_not_valve['dosage'].astype('float') * df_lfsb_not_valve['frequency'].astype('float')#%%print(df_lfsb_not_valve.shape) print(df_lfsb_not_valve['patient_id'].nunique()) print(df_lfsb_not_valve['case_no'].nunique())#%%df_lfsb_not_valve#%%# 合并同一case_no的多次用藥數(shù)據(jù),取最后一次日劑量作為最終日劑量 temp_list=[] for i in np.unique(df_lfsb_not_valve['case_no']):temp=df_lfsb_not_valve[df_lfsb_not_valve['case_no']==i]temp=temp.reset_index(drop=True)if temp.shape[0]>1:temp.loc[0,'日劑量']=temp.loc[(temp.shape[0]-1),'日劑量']temp=temp.drop_duplicates(['case_no'],keep='first')temp_list.append(temp) df_lfsb_drug=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_drug=pd.concat([df_lfsb_drug,temp_list[j]],axis=0) del temp_list df_lfsb_drug=df_lfsb_drug.reset_index(drop=True) # 提取利伐沙班有效字段 df_lfsb_drug=df_lfsb_drug[['patient_id','case_no','start_datetime','end_datetime','日劑量']]#%%print(df_lfsb_drug.shape) print(df_lfsb_drug['patient_id'].nunique()) print(df_lfsb_drug['case_no'].nunique())#%%writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.3_計(jì)算出院時(shí)利伐沙班日劑量.xlsx') df_lfsb_drug.to_excel(writer) writer.save()TDM檢測(cè)信息
# 保存特定用藥的檢測(cè)記錄 df_test_tcms = df_test[df_test['project_name'].str.contains('他克莫司')] writer=pd.ExcelWriter(project_path+'/data/processed_data/df_temp_他克莫司檢測(cè)結(jié)果.xlsx') df_test_tcms.to_excel(writer) writer.save()合并用藥和tdm檢測(cè)
# 合并自身免疫疾病病人的他克莫司用藥和tdm檢測(cè)數(shù)據(jù) print('----------------------合并自身免疫疾病病人的他克莫司用藥和tdm檢測(cè)數(shù)據(jù)------------------------------') drug_test_tcms = pd.merge(drug_tcms_frequency_l,test_record_result_tdm, on=['patient_id'/'case_no'], how='inner') # 時(shí)間字段要注意數(shù)據(jù)類型,有些時(shí)間字段為str,有些為timestamp,類型沖突會(huì)報(bào)錯(cuò) drug_test_tcms['start_datetime'] = drug_test_tcms['start_datetime'].astype('str').apply(str_to_datetime) drug_test_tcms['test_date'] = drug_test_tcms['test_date'].astype('str').apply(str_to_datetime)# end_datetime為空的數(shù)據(jù)賦值為start_datetime aaa = drug_test_tcms[drug_test_tcms['end_datetime'].isnull()] bbb = drug_test_tcms[drug_test_tcms['end_datetime'].notnull()] aaa['end_datetime'] = aaa['start_datetime'] drug_test_tcms = pd.concat([aaa, bbb], axis=0) drug_test_tcms = drug_test_tcms.sort_values(by=['patient_id'],ascending=True) drug_test_tcms = drug_test_tcms.reset_index(drop=True) drug_test_tcms['end_datetime'] = drug_test_tcms['end_datetime'].astype('str').apply(str_to_datetime)print(drug_test_tcms.shape) # (3125,15) print(len(np.unique(drug_test_tcms['patient_id']))) # 149writer = pd.ExcelWriter(project_path + '/result/df_6_合并自身免疫病人的他克莫司用藥和tdm檢測(cè)數(shù)據(jù).xlsx') drug_test_tcms.to_excel(writer) writer.save()tdm檢測(cè)前7天內(nèi)有他克莫司用藥
drug_test_tcms = drug_test_tcms.sort_values(by=['patient_id/case_no', 'test_date', 'start_datetime'],ascending=[True, True, False]) drug_test_tcms = drug_test_tcms.reset_index(drop=True)drug_test_tcms_frequency = drug_test_tcms[(drug_test_tcms['test_date'] - datetime.timedelta(days=15) <= drug_test_tcms['end_datetime'])&(drug_test_tcms['start_datetime'] <= drug_test_tcms['test_date'] - datetime.timedelta(days=1))]drug_test_tcms_frequency = drug_test_tcms_frequency.reset_index() del drug_test_tcms_frequency['index']print(drug_test_tcms_frequency.shape) # 7天,(384,20); print(len(np.unique(drug_test_tcms_frequency['patient_id']))) # 88drug_test_tcms_frequency =drug_test_tcms_frequency.sort_values(by=['patient_id','start_datetime'],ascending=[True,False]) drug_test_tcms_frequency=drug_test_tcms_frequency.reset_index(drop=True)writer = pd.ExcelWriter(project_path + '/result/df_8_tdm檢測(cè)前7天的他克莫司用藥數(shù)據(jù).xlsx') drug_test_tcms_frequency.to_excel(writer) writer.save()同一病人相鄰兩次TDM檢測(cè)間隔15天判斷
# 檢測(cè)時(shí)間test_date升序,用藥時(shí)間start_datetime降序,方便后面7-15天篩選。這樣選出來是第一條tdm檢測(cè)和最后一次用藥。 drug_test_tcms_frequency = drug_test_tcms_frequency.sort_values(by=['patient_id', 'test_date', 'start_datetime'],ascending=[True, True, False]) drug_test_tcms_frequency['test_date']=drug_test_tcms_frequency['test_date'].astype('str').apply(str_to_datetime) all_id = [] for i in np.unique(drug_test_tcms_frequency['patient_id']):temp = drug_test_tcms_frequency[drug_test_tcms_frequency['patient_id'] == i]temp = temp.reset_index()del temp['index']between_id = []j = 0while j < temp.shape[0]:# 取出符合要求的第一次tdm檢測(cè)數(shù)據(jù),其中他克莫司服藥因?yàn)橹皶r(shí)間倒序排序,取得是最近的一次用藥。between_id.append(temp.iloc[[j]]) # .iloc[[i]]取出dataframe;.loc[i]取出seriesk = j + 1# 同一個(gè)病人id的第j次tdm檢測(cè)和第k次tdm檢測(cè),15天間隔while k < temp.shape[0]:# 兩次tdm檢測(cè)在15天內(nèi),只保留第一條。if temp.loc[j, 'test_date'] >= temp.loc[k, 'test_date'] - datetime.timedelta(days=15):k += 1continueelse:break# 兩次tdm檢測(cè)間隔15天及以上,認(rèn)為相互獨(dú)立,break,將k賦值給j,下一次j循環(huán)會(huì)將k對(duì)應(yīng)的tdm檢測(cè)存入between_idj = ktemp_between = between_id[0]for m in range(1, len(between_id)):temp_between = pd.concat([temp_between, between_id[m]], axis=0) # list轉(zhuǎn)換為DateFrametemp_between = temp_between.reset_index()del temp_between['index']all_id.append(temp_between)drug_test_tcms_15 = all_id[0] for n in range(1, len(all_id)):drug_test_tcms_15 = pd.concat([drug_test_tcms_15, all_id[n]], axis=0) drug_test_tcms_15 = drug_test_tcms_15.reset_index() del drug_test_tcms_15['index']print(drug_test_tcms_15.shape) # 7天,(102,20); print(len(np.unique(drug_test_tcms_15['patient_id']))) # 88writer = pd.ExcelWriter(project_path + '/result/df_9_兩次他克莫司檢測(cè)間隔15天判斷.xlsx') drug_test_tcms_15.to_excel(writer) writer.save()4. 合并人口信息學(xué)數(shù)據(jù)
4.1 合并人口學(xué)特征
#%% md## 合并人口信息學(xué)數(shù)據(jù)#%%# 1.5 合并人口信息學(xué)數(shù)據(jù) print('-------------------------合并人口信息學(xué)數(shù)據(jù)-----------------------------') df_popu=pd.read_excel(project_path+'/data/raw_data/1.基本信息(診斷非瓣膜房顫用利伐沙班).xlsx') if 'Unnamed: 0' in df_popu.columns:df_popu = df_popu.drop(['Unnamed: 0'], axis=1) df_popu=df_popu[['case_no','gender','age','height','weight','BMI']] # 刪除人口信息學(xué)重復(fù)數(shù)據(jù),只保留第一條 df_popu=df_popu.drop_duplicates(subset=['case_no'],keep='first')#%%print(type(df_popu.loc[0,'case_no'])) print(type(df_lfsb_drug.loc[0,'case_no']))#%%# 將df_popu的case_no格式調(diào)整為str df_popu['case_no']=df_popu['case_no'].astype('str') df_lfsb_popu=pd.merge(df_lfsb_drug,df_popu,on=['case_no'],how='left')#%%print(df_lfsb_popu.shape) print(df_lfsb_popu['patient_id'].nunique()) print(df_lfsb_popu['case_no'].nunique())#%%print(df_lfsb_popu)4.2 補(bǔ)充缺失的性別、年齡、身高信息
# 補(bǔ)充缺失的性別、年齡、身高信息 # 讀取patient_info-包含性別和年齡;patient_sign_record-包含身高 df_patient_info=pd.read_csv(project_path+'/data/raw_data/1-patient_info.csv') df_patient_info = df_patient_info.set_index('patient_id') df_patient_sign_record=pd.read_csv(project_path+'/data/raw_data/1-patient_sign_record.csv') df_height = df_patient_sign_record[df_patient_sign_record['sign_type'] == '身高(cm)'] # 刪除空值 df_height = df_height[df_height['record_content'].notnull()] # 刪除重復(fù)數(shù)據(jù) df_height = df_height.drop_duplicates[subset['patient_id','case_no','sign_type','record_content']] # 刪除異常值 df_height = filter_exce_value(df_height,'record_content')df_weight = df_patient_sign_record[df_patient_sign_record['sign_type'] == '體重(kg)'] # 刪除空值 df_weight = df_weight[df_weight['record_content'].notnull()] # 刪除重復(fù)數(shù)據(jù) df_weight = df_weight.drop_duplicates[subset['patient_id','case_no','sign_type','record_content']] # 刪除異常值 df_weight = filter_exce_value(df_weight,'record_content')aaa=df_lfsb_popu[df_lfsb_popu['gender'].isnull()] bbb=df_lfsb_popu[df_lfsb_popu['gender'].notnull()] aaa_list=[] for i in np.unique(aaa['patient_id']):# print(i)temp=aaa[aaa['patient_id']==i]temp=temp.reset_index(drop=True)# 提取缺失的性別數(shù)據(jù)gender=df_patient_info.loc[i,'gender']if gender =='男':gender_value=1else:gender_value=0temp['gender']=gender_value# 提取缺失的年齡數(shù)據(jù)age=df_patient_info.loc[i,'birth_year']age_year=age.split('-')[0]start_datetime=temp.loc[0,'start_datetime']start_year=str(start_datetime).split('-')[0]# start_year=start_time[0:3]age_value=int(start_year)-int(age_year)temp['age']=age_value# 提取身高信息temp_height= df_height[df_height['patient_id']==i]temp_height=temp_height.reset_index(drop=True)height=temp_height.loc[0,'record_content']temp['height']=height # if height=='臥床' or height=='輪椅': # temp['height']=np.nan # else: # temp['height']=height# 提取體重信息temp_weight= df_weight[df_weight['patient_id']==i]temp_weight=temp_weight.reset_index(drop=True)weight=temp_weight.loc[0,'record_content']temp['weight']=weight # if height=='臥床' or height=='輪椅': # temp['height']=np.nan # else: # temp['height']=heightaaa_list.append(temp) aaa=aaa_list[0] for j in range(1,len(aaa_list)):aaa=pd.concat([aaa,aaa_list[j]],axis=0) df_lfsb_popu=pd.concat([aaa,bbb],axis=0) df_lfsb_popu=df_lfsb_popu.sort_values(['patient_id']) df_lfsb_popu=df_lfsb_popu.reset_index(drop=True)print(df_lfsb_popu.shape) print(df_lfsb_popu['patient_id'].nunique()) print(df_lfsb_popu['case_no'].nunique())df_lfsb_popu4.3 使用隨機(jī)森林進(jìn)行插補(bǔ)
# 使用隨機(jī)森林對(duì)缺失值進(jìn)行插補(bǔ) import pandas as pd pd.set_option('mode.chained_assignment', None) import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV def missing_value_interpolation(df,missing_list=[]):df = df.reset_index(drop=True)# 提取存在缺失值的列名if not missing_list:for i in df.columns:if df[i].isnull().sum() > 0:missing_list.append(i)missing_list_copy = missing_list.copy()# 用該列未缺失的值訓(xùn)練隨機(jī)森林,然后用訓(xùn)練好的rf預(yù)測(cè)缺失值for i in range(len(missing_list)):name=missing_list[0]df_missing = df[missing_list_copy]# 將其他列的缺失值用0表示。missing_list.remove(name)for j in missing_list:df_missing[j]=df_missing[j].astype('str').apply(lambda x: 0 if x=='nan' else x)df_missing_is = df_missing[df_missing[name].isnull()]df_missing_not = df_missing[df_missing[name].notnull()]y = df_missing_not[name]x = df_missing_not.drop([name],axis=1)# 列出參數(shù)列表tree_grid_parameter = {'n_estimators': list((10, 50, 100, 150, 200))}# 進(jìn)行參數(shù)的搜索組合grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_grid_parameter,cv=3)#rfr=RandomForestRegressor(random_state=0,n_estimators=100,n_jobs=-1)#根據(jù)已有數(shù)據(jù)去擬合隨機(jī)森林模型grid.fit(x, y)rfr = RandomForestRegressor(n_estimators=grid.best_params_['n_estimators'])rfr.fit(x, y)#預(yù)測(cè)缺失值predict = rfr.predict(df_missing_is.drop([name],axis=1))#填補(bǔ)缺失值df.loc[df[name].isnull(),name] = predictreturn df# 對(duì)性別、年齡、身高、體重等列進(jìn)行插補(bǔ) df_lfsb_popu=missing_value_interpolation(df_lfsb_popu,['gender','age','height','weight','BMI'])#%%# 統(tǒng)計(jì)年齡分布 df_age_stats=df_lfsb_popu.drop_duplicates(subset=['patient_id'],keep='first') print(df_age_stats['age'].describe())#%%# 保存人口學(xué)特征 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.4_合并人口信息學(xué)特征的非瓣膜房顫患者.xlsx') df_lfsb_popu.to_excel(writer) writer.save()5. 增加既往史(糖尿病、高血壓)
## 增加糖尿病病史#%%# 其他糖尿病的診斷 df_diagnostic_dm=df_diagnostic[df_diagnostic['diagnostic_content'].str.contains('糖尿病|高血壓')] # 刪除重復(fù)診斷的case_no df_diagnostic_dm=df_diagnostic_dm.drop_duplicates(['case_no','diagnostic_content'],keep='first') df_diagnostic_dm=df_diagnostic_dm.reset_index(drop=True)#%%print(df_diagnostic_dm.shape) print(df_diagnostic_dm['patient_id'].nunique()) print(df_diagnostic_dm['case_no'].nunique())#%%df_diagnostic_dm#%%# 提取糖尿病患者case_no列表 dm_list=list(df_diagnostic_dm[df_diagnostic_dm['diagnostic_content']=='糖尿病']['case_no']) htn_list=list(df_diagnostic_dm[df_diagnostic_dm['diagnostic_content']=='高血壓']['case_no']) print(dm_list[0]) print(type(dm_list[0]))#%%# 并入納排數(shù)據(jù)中 temp_list=[] for i in np.unique(df_lfsb_popu['case_no']):temp=df_lfsb_popu[df_lfsb_popu['case_no']==i]temp=temp.reset_index(drop=True)if i in dm_list:temp['糖尿病']=1else:temp['糖尿病']=0if i in htn_list:temp['高血壓']=1else:temp['高血壓']=0temp_list.append(temp) #%%df_lfsb_merge_dm=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_merge_dm=pd.concat([df_lfsb_merge_dm,temp_list[j]],axis=0) df_lfsb_merge_dm=df_lfsb_merge_dm.sort_values(by=['patient_id','case_no','start_datetime']) df_lfsb_merge_dm=df_lfsb_merge_dm.reset_index(drop=True) del temp_list#%%print(df_lfsb_merge_dm.shape) print(df_lfsb_merge_dm['patient_id'].nunique()) print(df_lfsb_merge_dm['case_no'].nunique())#%%writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.7_增加糖尿病檢驗(yàn)信息.xlsx') df_lfsb_merge_dm.to_excel(writer) writer.save()6. 增加聯(lián)合用藥
6.1 提取聯(lián)合用藥
# 根據(jù)業(yè)務(wù)需求制定的聯(lián)合用藥范圍,用0-1表示聯(lián)合用藥就行。1.6 增加聯(lián)合用藥 # 糖皮質(zhì)激素(地塞米松、甲潑尼龍、潑尼松、可的松)名稱統(tǒng)一 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'糖皮質(zhì)激素' if '地塞米松' in x else'糖皮質(zhì)激素' if '甲潑尼龍' in x else'糖皮質(zhì)激素' if '潑尼松' in x else'糖皮質(zhì)激素' if '可的松' in x else x) # 質(zhì)子泵抑制劑(奧美拉唑、泮托拉唑、艾普拉唑、雷貝拉唑、蘭索拉唑、雷尼替丁)名稱統(tǒng)一 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'質(zhì)子泵抑制劑' if '奧美拉唑' in x else'質(zhì)子泵抑制劑' if '泮托拉唑' in x else'質(zhì)子泵抑制劑' if '艾普拉唑' in x else'質(zhì)子泵抑制劑' if '雷貝拉唑' in x else'質(zhì)子泵抑制劑' if '蘭索拉唑' in x else'質(zhì)子泵抑制劑' if '雷尼替丁' in x else x) # 鈣離子阻抗劑(硝苯地平、氨氯地平、尼群地平、非洛地平、地爾硫卓)名稱統(tǒng)一 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'鈣離子阻抗劑' if '硝苯地平' in x else'鈣離子阻抗劑' if '氨氯地平' in x else'鈣離子阻抗劑' if '尼群地平' in x else'鈣離子阻抗劑' if '非洛地平' in x else'鈣離子阻抗劑' if '地爾硫卓' in x else x) # 其他免疫抑制劑(環(huán)孢素、嗎替麥考酚酯、環(huán)磷酰胺、硫唑嘌呤、甲氨蝶呤)名稱統(tǒng)一 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'其他免疫抑制劑' if '環(huán)孢素' in x else'其他免疫抑制劑' if '嗎替麥考酚酯' in x else'其他免疫抑制劑' if '環(huán)磷酰胺' in x else'其他免疫抑制劑' if '硫唑嘌呤' in x else'其他免疫抑制劑' if '甲氨蝶呤' in x else x) # 克拉霉素 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'克拉霉素' if '克拉霉素' in x else x) # 阿奇霉素 doctor_order['drug_name']=doctor_order['drug_name'].astype('str').apply(lambda x:'阿奇霉素' if '阿奇霉素' in x else x)# 提取其他聯(lián)合用藥記錄 drug_other=doctor_order[doctor_order['drug_name'].str.contains('糖皮質(zhì)激素|質(zhì)子泵抑制劑|鈣離子阻抗劑|其他免疫抑制劑|克拉霉素|阿奇霉素')] drug_other=drug_other.reset_index(drop=True)writer=pd.ExcelWriter(project_path+'/data/processed_data/df_提取聯(lián)合用藥.xlsx') drug_other.to_excel(writer) writer.save()6.2 補(bǔ)充聯(lián)合用藥時(shí)間
# 將其他用藥的缺失的end_datetime替換為start_datetime # end_datetime為空的數(shù)據(jù)賦值為start_datetime aaa = drug_other[drug_other['end_datetime'].isnull()] bbb = drug_other[drug_other['end_datetime'].notnull()] aaa['end_datetime'] = aaa['start_datetime'] drug_other = pd.concat([aaa, bbb], axis=0) drug_other = drug_other.sort_values(by=['patient_id'],ascending=True) drug_other = drug_other.reset_index(drop=True) drug_other['end_datetime'] = drug_other['end_datetime'].astype('str').apply(str_to_datetime)# print(drug_other.shape) # (19728,9) # print(drug_other['patient_id'].nunique()) # 948writer = pd.ExcelWriter(project_path + '/processed_data/df_補(bǔ)充聯(lián)合用藥時(shí)間.xlsx') drug_other.to_excel(writer) writer.save()6.3 提取tdm檢測(cè)7天內(nèi)的聯(lián)合用藥
all_id = [] for i in np.unique(tdm_7_other_interpolation['patient_id']):# 根據(jù)patient_id進(jìn)行第一次分類tdm_time = tdm_7_other_interpolation[tdm_7_other_interpolation['patient_id'] == i] # 他克莫司的他克莫司用藥記錄# 檢測(cè)時(shí)間排序tdm_time = tdm_time.sort_values(by=['test_date'], ascending=True)tdm_time = tdm_time.reset_index()del tdm_time['index']# 根據(jù)patient_id篩選出其他聯(lián)合用藥,并提取有效字段temp = drug_other[drug_other['patient_id'] == i]# 修改其他用藥的字段名稱,避免與tdm檢測(cè)合并時(shí)發(fā)生字段名沖突temp_drug_other = temp_drug_other.rename(columns={'drug_name': 'drug_name_other','start_datetime':'start_datetime_other','end_datetime':'end_datetime_other'})# 檢測(cè)時(shí)間排序temp_drug_other = temp_drug_other.sort_values(by=['start_datetime_other'], ascending=True)temp_drug_other = temp_drug_other.reset_index(drop=True)# 5.1,根據(jù)不同的tdm_time進(jìn)行第二次數(shù)據(jù)分組between_id = []for j in range(tdm_time.shape[0]):tdm_time_1 = tdm_time.iloc[[j]]time_1 = tdm_time.loc[j,'test_date']last_id = []for k in range(temp_drug_other.shape[0]):# 篩選tdm前7天內(nèi)的其他用藥if (time_1 - datetime.timedelta(days=8) <= temp_drug_other.loc[k,'end_datetime_other']) & (time_1 - datetime.timedelta(days=1) >= temp_drug_other.loc[k,'start_datetime_other']):last_id.append(temp_drug_other.iloc[[k]])if last_id:temp_last = last_id[0]for m in range(1, len(last_id)):temp_last = pd.concat([temp_last, last_id[m]], axis=0)# 5.2,根據(jù)patient_id、drug_name_other進(jìn)行最近一次的篩選temp_last = temp_last.drop_duplicates(subset=['patient_id', 'drug_name_other'], keep='first')drug_other_list=list(temp_last['drug_name_other'])# 5.3,將篩選出來的7天之內(nèi)的最后一次其他檢測(cè)的具體指標(biāo)轉(zhuǎn)換為0-1列,整合到建模數(shù)據(jù)中for drug_other in drug_other_list:tdm_time_1[drug_other] = 1between_id.append(tdm_time_1)# 將patient_id下所有符合要求的其他用藥數(shù)據(jù)合并。temp_between = between_id[0]for m in range(1, len(between_id)):temp_between = pd.concat([temp_between, between_id[m]], axis=0)temp_between = temp_between.reset_index()del temp_between['index']all_id.append(temp_between)# 將所有patient_id的其他用藥數(shù)據(jù)進(jìn)行合并 drug_other_7_select = all_id[0] for n in range(1, len(all_id)):drug_other_7_select = pd.concat([drug_other_7_select, all_id[n]], axis=0) drug_other_7_select=drug_other_7_select.reset_index(drop=True)print(drug_other_7_select.shape) # (106,27) print(len(np.unique(drug_other_7_select['patient_id']))) # 886.4 刪除缺失值>50%的列
# 刪除缺失超過50%的其他聯(lián)合用藥 for i in np.unique(drug_other_7_select.columns):other_up = drug_other_7_select[i].isnull().sum()other_down = drug_other_7_select[i].shape[0]if drug_other_7_select[i].isnull().sum()/drug_other_7_select[i].shape[0] >= 0.5:del drug_other_7_select[i]# 糖皮質(zhì)激素|質(zhì)子泵抑制劑|鈣離子阻抗劑|其他免疫抑制劑|克拉霉素|阿奇霉素, # 將空值替換為0 drug_other_7_select.fillna(0) # 如果是單列 # drug_other_7_select['糖皮質(zhì)激素'].fillna(0,inplace=True)print(drug_other_7_select.shape) # (106,23) print(len(np.unique(drug_other_7_select['patient_id']))) # 88writer = pd.ExcelWriter(project_path + '/result/df_提取tdm檢測(cè)7天內(nèi)最近的其他聯(lián)合用藥.xlsx') drug_other_7_select.to_excel(writer)7. 增加其他檢測(cè)
7.1 肌酐(腎功能)
## 肌酐(腎功能)df_test_cr = df_test[df_test['test_purpose'].str.contains('肌酐(Cr)|腎功五項(xiàng)(UREA,CR,UA,TCO2,CysC)|腎功四項(xiàng)(UREA,CR,UA,TCO2)')]df_result_cr = df_test_cr[df_test_cr['synonym']=='Cys-C'] df_result_cr=df_result_cr[['patient_id','case_no','test_result']] df_lfsb_cr = pd.merge(df_lfsb_merge_dm, df_result_cr, on=['patient_id','case_no'], how='left') df_lfsb_cr.to_excel(r'Cys-C.xlsx')7.2 肝功能
df_test_liver = df_test[ df_test['test_purpose']=='肝功1(7項(xiàng);ALT,AST,TP,ALB,G,TBIL,DBIL)']df_test_liver = df_test_liver[df_test_liver['synonym']=='DBIL']df_lfsb_liver = pd.merge(df_lfsb_cr, df_test_liver, on=['patient_id','case_no'], how='left') df_lfsb_liver.to_excel(r'DBIL.xlsx')7.3 血細(xì)胞分析
# 血小板,紅細(xì)胞,白細(xì)胞,血紅蛋白 df_test_blood = df_test[df_test['test_purpose']=='血細(xì)胞分析(五分類)']df_test_blood = df_test_blood[df_test_blood['project_name']=='血紅蛋白測(cè)定']df_lfsb_blood = pd.merge(df_lfsb_liver, df_test_blood, on=['patient_id','case_no'], how='left') df_lfsb_blood.to_excel(r'血紅蛋白測(cè)定.xlsx')7.4 凝血
df_test_bc = df_test[ df_test['test_purpose'].str.contains('凝血')]df_test_bc = df_test_bc[df_test_bc['project_name'].str.contains('凝血')]df_lfsb_bc = pd.merge(df_lfsb_blood, df_test_bc, on=['patient_id','case_no'], how='left') df_lfsb_bc.to_excel(r'凝血.xlsx')7.5 大便隱血
# 尿常規(guī),大便常規(guī)#%%df_test_sb = df_test[df_test['test_purpose'].str.contains('糞便'))] df_test_sb =df_test_sb[df_test_sb['project_name'].str.contains('隱血')]df_lfsb_sb = pd.merge(df_lfsb_bc, df_test_sb, on=['patient_id','case_no'], how='left') df_lfsb_sb.to_excel(r'糞便-隱血.xlsx')7.5 刪除缺失值
## 刪除缺失過多(>50%)的列#%%# 刪除列超過50%的其他指標(biāo) for i in np.unique(df_lfsb_merge_dm.columns):other_up = df_lfsb_merge_dm[i].isnull().sum()other_down = df_lfsb_merge_dm[i].shape[0]if df_lfsb_merge_dm[i].isnull().sum()/df_lfsb_merge_dm[i].shape[0] >= 0.5:del df_lfsb_merge_dm[i]#%%print(df_lfsb_merge_dm.shape) print(df_lfsb_merge_dm['patient_id'].nunique()) print(df_lfsb_merge_dm['case_no'].nunique())#%%# 排序 df_lfsb_merge_dm=df_lfsb_merge_dm.sort_values(['patient_id','case_no','start_datetime'])#%%# 保存刪除缺失值過大的數(shù)據(jù) writer=pd.ExcelWriter(project_path+'/data/processed_data/df_1.8_刪除缺失過多的列.xlsx') df_lfsb_merge_dm.to_excel(writer) writer.save()出入院時(shí)間和入院診斷
#%% md# 計(jì)算多次出入院#%% md## 提取入院診斷#%%# 入院診斷: 補(bǔ)充診斷、初步診斷、門診診斷、修正診斷、最后診斷 df_diagnostic_inp=df_diagnostic[df_diagnostic['diagnostic_type'].str.contains('補(bǔ)充診斷|初步診斷|門診診斷|修正診斷|最后診斷|出院診斷')] # 刪除空值 df_diagnostic_inp=df_diagnostic_inp[df_diagnostic_inp['case_no'].notnull()] # 入院診斷case_no格式調(diào)整:由float轉(zhuǎn)為str df_diagnostic_inp['case_no']=df_diagnostic_inp['case_no'].astype('int').astype('str') df_diagnostic_inp=df_diagnostic[['patient_id','case_no','record_date','diagnostic_type','diagnostic_content']]#%%print(df_diagnostic_inp.shape)#%%# 合并同一case_no的入院診斷 temp_list=[] for i in np.unique(df_diagnostic_inp['case_no']):temp=df_diagnostic_inp[df_diagnostic_inp['case_no']==i]temp=temp.reset_index(drop=True)temp_diagnostic_list=list(temp['diagnostic_content'])temp=temp.drop_duplicates(subset=['case_no'],keep='first')temp_diagnostic_str=';'.join(temp_diagnostic_list)temp['diagnostic_content']=temp_diagnostic_str # print(temp)temp_list.append(temp) # j=0 # while j < temp.shape[0]-1: # # print(i) # temp.loc[j+1,'diagnostic_content']=temp.loc[j,'diagnostic_content'] +';'+temp.loc[j+1,'diagnostic_content'] # temp=temp.drop(index=[j],axis=0) # temp=temp.reset_index(drop=True) # temp=temp.drop_duplicates(subset=['case_no'],keep='last') # j+=1 # temp_list.append(temp)#%%df_diagnostic_inp_merge=temp_list[0] for j in range(1,len(temp_list)):df_diagnostic_inp_merge=pd.concat([df_diagnostic_inp_merge,temp_list[j]],axis=0) del temp_list df_diagnostic_inp_merge=df_diagnostic_inp_merge.reset_index(drop=True)#%%df_diagnostic_inp_merge#%%print(df_diagnostic_inp_merge.shape) print(df_diagnostic_inp_merge['patient_id'].nunique()) print(df_diagnostic_inp_merge['case_no'].nunique())#%%writer=pd.ExcelWriter(project_path+'/data/processed_data/df_2.1_提取入院診斷.xlsx') df_diagnostic_inp_merge.to_excel(writer) writer.save()#%% md## 提取出入院時(shí)間#%%# 2.計(jì)算多次出入院時(shí)間,case_no df_inp_record=pd.read_csv(project_path+'/data/raw_data/1-inp_record.csv',dtype={'case_no':str})#%%# 刪除空值數(shù)據(jù) df_inp_record=df_inp_record[df_inp_record['adm_date'].notnull() & df_inp_record['dis_date'].notnull()]#%%print(df_inp_record.shape) print(df_inp_record['patient_id'].nunique()) print(df_inp_record['case_no'].nunique())#%%# 調(diào)整出入院時(shí)間格式 df_inp_record['adm_date']=df_inp_record['adm_date'].astype('str').apply(str_to_datetime) df_inp_record['dis_date']=df_inp_record['dis_date'].astype('str').apply(str_to_datetime)#%%# 提取出入院時(shí)間有效字段 df_inp_record=df_inp_record[['patient_id','case_no','adm_date','care_area','dis_date']] df_inp_record=df_inp_record.sort_values(by=['patient_id','case_no','adm_date']) df_inp_record=df_inp_record.reset_index(drop=True)#%%print(df_inp_record.shape) print(df_inp_record['patient_id'].nunique()) print(df_inp_record['case_no'].nunique())#%%# 保存多次出入院時(shí)間 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_temp_保存多次出入院時(shí)間.xlsx') df_inp_record.to_excel(writer) writer.save()#%% md## 劑量分組,統(tǒng)計(jì)第二次入院高低劑量組
高低劑量組分組
#%%# 3.按劑量10、15、20分組,需要以第一次出院日劑量為標(biāo)準(zhǔn)分組,不能直接lambda,然后計(jì)算再次入院率 # 先排序 df_lfsb_merge_dm=df_lfsb_merge_dm.sort_values(['patient_id','case_no','start_datetime']) # 分組 temp_list=[] for i in np.unique(df_lfsb_merge_dm['patient_id']):temp=df_lfsb_merge_dm[df_lfsb_merge_dm['patient_id']==i]temp=temp.reset_index(drop=True)dosage=temp.loc[0,'日劑量']if dosage==10:temp['劑量分組']=0elif dosage==15:temp['劑量分組']=1elif dosage==20:temp['劑量分組']=2else:temp['劑量分組']=np.nantemp_list.append(temp)#%%df_lfsb_group=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_group=pd.concat([df_lfsb_group,temp_list[j]],axis=0) df_lfsb_group=df_lfsb_group.reset_index(drop=True) del temp_list#%%print(df_lfsb_group.shape) print(df_lfsb_group['patient_id'].nunique) print(df_lfsb_group['case_no'].nunique)#%%# 提取分組數(shù)據(jù) df_lfsb_group=df_lfsb_group[df_lfsb_group['劑量分組'].notnull()]#%%print(df_lfsb_group.shape) print(df_lfsb_group['patient_id'].nunique()) print(df_lfsb_group['case_no'].nunique())#%%# 保存分組數(shù)據(jù) writer=pd.ExcelWriter(project_path+'/data/processed_data/df_2.2_保存分組數(shù)據(jù).xlsx') df_lfsb_group.to_excel(writer) writer.save()高低劑量組統(tǒng)計(jì)
#%%# 提取單個(gè)劑量分組 df_lfsb_10=df_lfsb_group[df_lfsb_group['劑量分組']==0] df_lfsb_15=df_lfsb_group[df_lfsb_group['劑量分組']==1] df_lfsb_20=df_lfsb_group[df_lfsb_group['劑量分組']==2]#%%# 統(tǒng)計(jì)分組數(shù) num_10_patient=df_lfsb_10['patient_id'].nunique() num_10_case=df_lfsb_10['case_no'].nunique() num_15_patient=df_lfsb_15['patient_id'].nunique() num_15_case=df_lfsb_15['case_no'].nunique() num_20_patient=df_lfsb_20['patient_id'].nunique() num_20_case=df_lfsb_20['case_no'].nunique()print('分組patient人數(shù)',num_10_patient,num_15_patient,num_20_patient) print('分組case記錄',num_10_case,num_15_case,num_20_case)#%% md### 統(tǒng)計(jì)10mg組再次入院#%%#統(tǒng)計(jì)10mg組再次入院人數(shù) count_10=0 list_10_again=[] for i in np.unique(df_lfsb_10['patient_id']):temp=df_lfsb_10[df_lfsb_10['patient_id']==i]if temp.shape[0]>1:count_10 +=1list_10_again.append(i) print('10mg再次入院人數(shù)',count_10,count_10/num_10_patient) print(list_10_again)#%% md### 統(tǒng)計(jì)15mg組再次入院#%%# 統(tǒng)計(jì)15mg組再次入院人數(shù) count_15=0 list_15_again=[] for i in np.unique(df_lfsb_15['patient_id']):temp=df_lfsb_15[df_lfsb_15['patient_id']==i]temp=temp.reset_index(drop=True)if temp.shape[0]>1:count_15 +=1list_15_again.append(i) print('15mg再次入院人數(shù)',count_15,count_15/num_15_patient) print(list_15_again)#%% md### 統(tǒng)計(jì)20mg組再次入院人數(shù)#%%# 統(tǒng)計(jì)20mg組再次入院人數(shù) count_20=0 list_20_again=[] for i in np.unique(df_lfsb_20['patient_id']):temp=df_lfsb_20[df_lfsb_20['patient_id']==i]temp=temp.reset_index(drop=True)if temp.shape[0]>1:count_20 +=1list_20_again.append(i) print('20mg再次入院人數(shù)',count_20,count_20/num_20_patient) print(list_20_again)#%% md### 提取各組再次入院的記錄#%%# 再次入院patient_id列表 list_again=list_10_again + list_15_again + list_20_again print(type(list_again)) print(list_again) df_lfsb_group_again=df_lfsb_group[df_lfsb_group['patient_id'].isin(list_again)] df_lfsb_group_again=df_lfsb_group_again.reset_index(drop=True)#%%print(df_lfsb_group_again.shape) print(df_lfsb_group_again['patient_id'].nunique()) print(df_lfsb_group_again['case_no'].nunique())#%%# 保存再次入院的分組數(shù)據(jù) writer=pd.ExcelWriter(project_path+'/data/processed_data/df_2.3_保存再次入院的分組數(shù)據(jù).xlsx') df_lfsb_group_again.to_excel(writer) writer.save()高低劑量組PSM數(shù)據(jù)
#%% md## 提取部分基礎(chǔ)特征,做PSM分析#%%# 提取部分基礎(chǔ)特征,做PSM分析,一個(gè)患者對(duì)應(yīng)一條數(shù)據(jù) df_lfsb_group_PSM=df_lfsb_group_again[['patient_id','case_no','start_datetime','end_datetime','日劑量','gender','age','BMI','糖尿病','高血壓','劑量分組']]#%%# 提取數(shù)據(jù)先排序和reset_index df_lfsb_group_PSM=df_lfsb_group_PSM.sort_values(['patient_id','case_no','start_datetime']) # 計(jì)算第二次入院記錄的PSM。需要注意:一個(gè)患者第二次入院的的日劑量應(yīng)該是第一次出院時(shí)的日劑量,而不是記錄本身的第二次出院日劑量 temp_list=[] for i in np.unique(df_lfsb_group_PSM['patient_id']):temp=df_lfsb_group_PSM[df_lfsb_group_PSM['patient_id']==i]temp=temp.reset_index(drop=True)# 再次入院的日劑量為第一次出院時(shí)的日劑量temp.loc[1,'日劑量']=temp.loc[0,'日劑量']temp=temp.iloc[1:2,:]temp_list.append(temp)#%%df_lfsb_group_PSM = temp_list[0] for j in range(1,len(temp_list)):df_lfsb_group_PSM=pd.concat([df_lfsb_group_PSM,temp_list[j]],axis=0) df_lfsb_group_PSM=df_lfsb_group_PSM.reset_index(drop=True)#%%# 再次入院分組數(shù)據(jù)做PSM分析 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_2.4_再次入院分組數(shù)據(jù)做PSM分析.xlsx') df_lfsb_group_PSM.to_excel(writer) writer.save()高低劑量組再入院統(tǒng)計(jì)
#%% md## 并入出入院時(shí)間和診斷#%%df_inp_record print(type(df_inp_record.loc[0,'case_no']))#%%print('-------------------------計(jì)算多次出入院時(shí)間-----------------------------') temp_list=[] for i in np.unique(df_lfsb_group_again['case_no']):print(i) # print(type(i))temp=df_lfsb_group_again[df_lfsb_group_again['case_no']==i]temp_inp_time=df_inp_record[df_inp_record['case_no']==i]temp_inp_time=temp_inp_time.reset_index(drop=True) # print(temp_inp_time) # print(temp_inp_time.loc[0,'adm_date'])temp_inp_diagnostic=df_diagnostic_inp_merge[df_diagnostic_inp_merge['case_no']==i]temp_inp_diagnostic=temp_inp_diagnostic.reset_index(drop=True) # print(temp_inp_diagnostic)# 并入出入院時(shí)間temp['adm_date']=temp_inp_time.loc[0,'adm_date']temp['dis_date']=temp_inp_time.loc[0,'dis_date'] # print(temp)# 并入入院診斷temp['diagnostic_content']=temp_inp_diagnostic.loc[0,'diagnostic_content']print(temp)temp_list.append(temp)#%%df_lfsb_merge_inp_diagnostic=temp_list[0] for j in range(1,len(temp_list)):df_lfsb_merge_inp_diagnostic=pd.concat([df_lfsb_merge_inp_diagnostic,temp_list[j]]) df_lfsb_merge_inp_diagnostic=df_lfsb_merge_inp_diagnostic.sort_values(['patient_id','case_no','adm_date']) df_lfsb_merge_inp_diagnostic=df_lfsb_merge_inp_diagnostic.reset_index(drop=True) del temp_list#%%print(df_lfsb_merge_inp_diagnostic.shape) print(df_lfsb_merge_inp_diagnostic['patient_id'].nunique()) print(df_lfsb_merge_inp_diagnostic['case_no'].nunique())#%%df_lfsb_merge_inp_diagnostic#%%# 保存并入出入院時(shí)間和診斷 writer=pd.ExcelWriter(project_path+'/data/processed_data/df_2.5_并入出入院時(shí)間和診斷.xlsx') df_lfsb_merge_inp_diagnostic.to_excel(writer) writer.save()#%% md## 統(tǒng)計(jì)再次入院的出血、卒中#%%# 按時(shí)間排序 df_lfsb_merge_inp_diagnostic=df_lfsb_merge_inp_diagnostic.sort_values(['patient_id','adm_date'])#%%# 判斷兩個(gè)列表中是否存在相同元素,存在返回True,否則False def judge_list_element(list1,list2):judge_list=[x for x in list1 if x if list2]if judge_list:return Trueelse:return False#%%# 根據(jù)鄭-隨訪診斷,統(tǒng)計(jì)再次入院的出血、卒中 # 卒中事件 stroke_event=['腦梗死','腦梗死后遺癥','腔隙性腦梗死','大腦動(dòng)脈栓塞引起的腦梗死','腦梗塞','中風(fēng)','腦梗死個(gè)人史','腦干梗死(康復(fù)期)','多發(fā)性腦梗死', '左心耳封堵術(shù)后','左心耳封堵術(shù)','左心房栓子形成','多發(fā)腔隙性腦梗死','左側(cè)基底節(jié)區(qū)陳舊性腔隙性腦梗塞','心耳血栓','小腦梗死','短暫性腦缺血', '陳舊性腦梗死','左心耳附壁血栓','腦栓塞','基底動(dòng)脈血栓形成腦梗死','左心耳血栓形成','腦梗死(基底節(jié)大動(dòng)脈粥樣硬化性)','多發(fā)性腦梗塞','陳舊性腦梗塞', '腦梗死(大腦中動(dòng)脈心源性)','大腦動(dòng)脈狹窄腦梗死','短暫性腦缺血發(fā)作','腦梗塞后遺癥','右側(cè)小腦半球陳舊性腦梗死','腦血管取栓術(shù)后','陳舊性腔隙性腦梗死', '大腦動(dòng)脈血栓形成引起的腦梗死','腎缺血和腎梗死','左側(cè)大腦中動(dòng)脈支架取栓術(shù)后','多發(fā)腔隙性腦梗塞','胸主動(dòng)脈附壁血栓','起搏器血栓形成','左心耳切除術(shù)后', '左側(cè)頸內(nèi)動(dòng)脈血管內(nèi)抽吸術(shù)后','左心耳血栓'] # 出血事件 bleeding_event=['腦梗死后出血轉(zhuǎn)化','消化道出血','出血性腦梗死','失血性休克','出血性內(nèi)痔','腦出血后遺癥','胃潰瘍伴有穿孔','血尿,持續(xù)性', '腦內(nèi)出血','下消化道出血','肺泡出血可能','蛛網(wǎng)膜下腔出血','女性盆腔血腫','皮下出血']#%%# 排序 df_lfsb_merge_inp_diagnostic=df_lfsb_merge_inp_diagnostic.sort_values(['patient_id','case_no','adm_date']) df_lfsb_merge_inp_diagnostic=df_lfsb_merge_inp_diagnostic.reset_index(drop=True) # 第再次入院新出血卒中統(tǒng)計(jì) group_0_num=0 group_1_num=0 group_2_num=0 # 患者id for j in np.unique(df_lfsb_merge_inp_diagnostic['patient_id']): # print(type(j))# 患者的住院記錄case_notemp=df_lfsb_merge_inp_diagnostic[df_lfsb_merge_inp_diagnostic['patient_id']==j]temp=temp.reset_index(drop=True)for k in range(temp.shape[0]):temp_diagnostic_list=str(temp.loc[k,'diagnostic_content']).split(';')# 如果第一次出院,存在出血卒中事件,則跳過;if k==0:if judge_list_element(temp_diagnostic_list,stroke_event) or judge_list_element(temp_diagnostic_list,bleeding_event): # if j==7664380: # print('看錯(cuò)了吧')break# 否則,統(tǒng)計(jì)再次入院的出血卒中事件if judge_list_element(temp_diagnostic_list,stroke_event) or judge_list_element(temp_diagnostic_list,bleeding_event):group_id=temp.loc[(k-1),'劑量分組']if group_id ==0:group_0_num +=1elif group_id ==1:group_1_num +=1elif group_id ==2:group_2_num +=1break#%%print('10mg組再次入院的新出血卒中率:',group_0_num, group_0_num/count_10) print('15mg組再次入院的新出血卒中率:',group_1_num, group_1_num/count_15) print('20mg組再次入院的新出血卒中率:',group_2_num, group_2_num/count_20)總結(jié)
以上是生活随笔為你收集整理的数据挖掘流程(二):数据预处理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: jupyter notebook使用技巧
- 下一篇: 数据挖掘流程(三):特征工程