當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

有关糖尿病模型建立的论文_预测糖尿病结果的模型比较

發布時間：2023/12/15 编程问答 49 豆豆

生活随笔收集整理的這篇文章主要介紹了有关糖尿病模型建立的论文_预测糖尿病结果的模型比较小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

有關糖尿病模型建立的論文

項目主題 (Subject of the Project)

The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem.

該數據集主要用于預測21歲以上的皮馬印度裔女性在五年內的糖尿病發作情況，并提供有關其身體的醫學詳細信息。該數據集旨在與二進制(2類)分類機器學習問題相對應。

We have a dependent variable that indicates the state of having diabetes. Our goal is to model the relationship between other variables and whether or not they have diabetes.

我們有一個因變量，表明患有糖尿病的狀態。我們的目標是為其他變量與他們是否患有糖尿病之間的關系建模。

When the various features of the people are entered, we want to establish a machine learning model that will make a prediction about whether these people will have diabetes or not. This is a classification problem.

當人們的各種特征被輸入時，我們想建立一個機器學習模型，以預測這些人是否患有糖尿病。這是分類問題。

數據集信息 (Dataset Information)

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

該數據集最初來自美國國立糖尿病與消化及腎臟疾病研究所。數據集的目的是基于數據集中包含的某些診斷測量值來診斷預測患者是否患有糖尿病。從較大的數據庫中選擇這些實例受到一些限制。特別是，這里的所有患者均為皮馬印第安人血統至少21歲的女性。

We have 9 columns and 768 instances (rows). The column names are provided as follows:

我們有9列和768個實例(行)。提供的列名稱如下：

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skinfold thickness (mm)
- Insulin: 2-Hour serum insulin measurement (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m) 2 )
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1, 0 = non-diabetic, 1 = diabetic)

數據理解 (Data Understanding)

#installation of librariesimport numpy as np
import pandas as pd
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold#any warnings that do not significantly impact the project are ignoredimport warnings
warnings.simplefilter(action = "ignore")#reading the datasetdf = pd.read_csv("diabetes.csv")
#selection of the first 5 observationsdf.head()#return a random sample of items from an axis of objectdf.sample(3)#makes random selection from dataset at the rate of written valuedf.sample(frac = 0.01)#size informationdf.shape(768, 9)#dataframe's index dtype and column dtypes, non-null values and memory usage informationdf.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB#explanatory statistics values of the observation units corresponding to the specified percentagesdf.describe([0.10,0.25,0.50,0.75,0.90,0.95,0.99]).T#transposition of the df table. This makes it easier to evaluate.#correlation between variablesdf.corr()

Our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes. Visualize some of the differences between those that developed diabetes and those that did not.

我們最終的目標是利用數據中的模式來預測糖尿病的發作。可視化那些患有糖尿病的人與未患糖尿病的人之間的一些差異。

#get a histogram of the Glucose column for both classes
col = 'Glucose'
plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

It seems that this histogram is showing us a pretty big difference between Glucose and two prediction classes.

該直方圖似乎向我們展示了葡萄糖和兩個預測類別之間的巨大差異。

for col in ['BMI', 'BloodPressure']:
plt.hist(df[df['Outcome']==0][col], 10, alpha=0.5, label='non-diabetes')
plt.hist(df[df['Outcome']==1][col], 10, alpha=0.5, label='diabetes')
plt.legend(loc='upper right')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.title('Histogram of {}'.format(col))
plt.show()

These histograms show us the distributions of ‘BMI’, ‘BloodPressure’, ‘Glucose’ for the two class variables (non-diabetes and diabetes).

這些直方圖向我們顯示了兩個類別變量(非糖尿病和糖尿病)的“ BMI”，“ BloodPressure”，“葡萄糖”的分布。

There seems to be a large jump in ‘Glucose’ for those who will eventually develop diabetes. To solidify this, we can visualize correlation matrix in an attempt to quantify the relationship between these variables.

對于那些最終會患上糖尿病的人來說，“葡萄糖”似乎有很大的提高。為了鞏固這一點，我們可以可視化相關矩陣，以試圖量化這些變量之間的關系。

def plot_corr(df,size = 9):
corr = df.corr() #corr = variable, where we assign the correlation matrix to a variable
fig, ax = plt.subplots(figsize = (size,size))
#fig = the column to the right of the chart, subplots (figsize = (size, size)) = determines the size of the chart
ax.matshow(corr) # prints the correlation, which draws the matshow matrix directly
cax=ax.matshow(corr, interpolation = 'nearest') #plotting axis, code that makes the graphic like square or map
fig.colorbar(cax) #plotting color
plt.xticks(range(len(corr.columns)),corr.columns,rotation=65)
# draw xticks, rotation = 17 is for inclined printing of expressions written for each top column
plt.yticks(range(len(corr.columns)),corr.columns) #draw yticks#we draw the dataframe using the function.plot_corr(df)#correlation matrix in seaborn libraryimport seaborn as sb
sb.heatmap(df.corr());#this way we can see the correlationssb.heatmap(df.corr(),annot =True);

Conclusion: The highest correlations with Outcome were observed between Glucose, BMI, Age and Pregnancies.

結論：血糖，BMI，年齡和懷孕與結果的相關性最高。

#proportions of classes 0 and 1 in Outcomedf["Outcome"].value_counts()*100/len(df)0 65.104167
1 34.895833
Name: Outcome, dtype: float64#how many classes are 0 and 1df.Outcome.value_counts()0 500
1 268
Name: Outcome, dtype: int64

histogram of the Age variabledf["Age"].hist(edgecolor = "black");

histogram of the Age variable df["Age"].hist(edgecolor = "black");

#Age, Glucose and BMI means according to Outcome variabledf.groupby("Outcome").agg({"Age":"mean","Glucose":"mean","BMI":"mean"})

數據預處理 (Data Pre-Processing)

缺失數據分析 (Missing Data Analysis)

#no missing data in datasetdf.isnull().sum()Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64#zeros in the corresponding variables mean NA, so 0 is assigned instead of NAdf[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)

It seems that there is no missing value in the data set, but when the variables are examined, the zeros in these variables represent NA.

數據集中似乎沒有缺失值，但是當檢查變量時，這些變量中的零表示NA。

#exclusive values
df.isnull().sum()Pregnancies 0
Glucose 5
BloodPressure 35
SkinThickness 227
Insulin 374
BMI 11
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64def median_target(var):

temp = df[df[var].notnull()]
temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index() #reset_index; solved problems in indices

return temp#Non-nulls are selected from within df and assigned to a dataframe named temp, ignoring the observation units filled.

Independent and dependent variable selected from dataframe, groupby operation is applied to the dependent variable then the independent variable is selected and the median of this variable is taken.

從數據幀中選擇自變量和因變量，然后將groupby操作應用于因變量，然后選擇自變量并取該變量的中值。

#median of glucose taken according to Outcome's value of 0 and 1median_target("Glucose")#median values of diabetes and non-diabetes were given for incomplete observations.columns = df.columns
columns = columns.drop("Outcome")
for col in columns:

df.loc[(df['Outcome'] == 0 ) & (df[col].isnull()), col] = median_target(col)[col][0]
df.loc[(df['Outcome'] == 1 ) & (df[col].isnull()), col] = median_target(col)[col][1]

#select the outcome value 0 and the relevant variable blank, select the relevant variable
#it refers to pre-comma filtering operations, it is used for column selection after comma

特征工程 (Feature Engineering)

#according to BMI, some ranges were determined and categorical variables were assigned.NewBMI = pd.Series(["Underweight", "Normal", "Overweight", "Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")
df["NewBMI"] = NewBMI
df.loc[df["BMI"] < 18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"] > 18.5)&(df["BMI"] <= 24.9), "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"] > 24.9)&(df["BMI"] <= 29.9), "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"] > 29.9)&(df["BMI"] <= 34.9), "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"] > 34.9)&(df["BMI"] <= 39.9), "NewBMI"] = NewBMI[4]
df.loc[df["BMI"] > 39.9 ,"NewBMI"] = NewBMI[5]df.head()#categorical variable creation according to the insulin valuedef set_insulin(row):
if row["Insulin"] >= 16 and row["Insulin"] <= 166:
return "Normal"
else:
return "Abnormal"df.head()#NewInsulinScore variable added with set_insulindf["NewInsulinScore"] = df.apply(set_insulin, axis=1)df.head()#some intervals were determined according to the glucose variable and these were assigned categorical variables.NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]df.head()

一站式編碼 (One-Hot Encoding)

#categorical variables were converted into numerical values by making One Hot Encoding transform
#it is also protected from the Dummy variable trapdf = pd.get_dummies(df, columns =["NewBMI","NewInsulinScore", "NewGlucose"], drop_first = True)df.head()

可變標準化 (Variable Standardization)

#categorical variablescategorical_df = df[['NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]#categorical variables deleted from dfy = df["Outcome"]
X = df.drop(["Outcome",'NewBMI_Obesity 1','NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight','NewBMI_Underweight',
'NewInsulinScore_Normal','NewGlucose_Low','NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis = 1)
cols = X.columns
index = X.indexy.head()0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64X.head()#by standardizing the variables in the dataset, the performance of the models is increased.from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X = transformer.transform(X)
X = pd.DataFrame(X, columns = cols, index = index)X.head()#combining non-categorical and categorical variablesX = pd.concat([X, categorical_df], axis = 1)X.head()

造型 (Modelling)

LR: 0.847539 (0.032028)
KNN: 0.837235 (0.031427)
CART: 0.838602 (0.026456)
RF: 0.878947 (0.030074)
SVM: 0.848855 (0.035492)
XGB: 0.880297 (0.029243)
LightGBM: 0.885526 (0.035487)

RF, XGB and LightGBM gave good results. We focused on optimizing these models

RF，XGB和LightGBM取得了良好的效果。我們專注于優化這些模型

模型優化 (Model Optimization)

模型調整 (Model Tuning)

隨機森林調整 (Random Forests Tuning)

rf_params = {"n_estimators" :[100,200,500,1000],
"max_features": [3,5,7],
"min_samples_split": [2,5,10,30],
"max_depth": [3,5,8,None]}rf_model = RandomForestClassifier(random_state = 12345)gs_cv = GridSearchCV(rf_model,
rf_params,
cv = 10,
n_jobs = -1,
verbose = 2).fit(X, y)gs_cv.best_params_{'max_depth': None,
'max_features': 7,
'min_samples_split': 5,
'n_estimators': 500}

最終模型安裝 (Final Model Installation)

rf_tuned = RandomForestClassifier(**gs_cv.best_params_)rf_tuned = rf_tuned.fit(X,y)
cross_val_score(rf_tuned, X, y, cv = 10).mean()0.8867737525632261feature_imp = pd.Series(rf_tuned.feature_importances_,
index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

XGBoost調整 (XGBoost Tuning)

xgb = GradientBoostingClassifier(random_state = 12345)xgb_params = {
"learning_rate": [0.01, 0.1, 0.2, 1],
"min_samples_split": np.linspace(0.1, 0.5, 3),
"max_depth":[3,5,8],
"subsample":[0.5, 0.9, 1.0],
"n_estimators": [100,500]}xgb_cv = GridSearchCV(xgb,xgb_params, cv = 10, n_jobs = -1, verbose = 2).fit(X, y)xgb_cv.best_params_{'learning_rate': 0.1,
'max_depth': 8,
'min_samples_split': 0.1,
'n_estimators': 100,
'subsample': 0.9}

最終模型安裝 (Final Model Installation)

xgb_tuned = GradientBoostingClassifier(**xgb_cv.best_params_).fit(X,y)cross_val_score(xgb_tuned, X, y, cv = 10).mean()0.8867737525632263feature_imp = pd.Series(xgb_tuned.feature_importances_,
index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

LightGBM調整 (LightGBM Tuning)

lgbm = LGBMClassifier(random_state = 12345)lgbm_params = {"learning_rate": [0.01, 0.03, 0.05, 0.1, 0.5],
"n_estimators": [500, 1000, 1500],
"max_depth":[3,5,8]}gs_cv = GridSearchCV(lgbm,
lgbm_params,
cv = 10,
n_jobs = -1,
verbose = 2).fit(X, y)gs_cv.best_params_{'learning_rate': 0.01, 'max_depth': 8, 'n_estimators': 500}

最終模型安裝 (Final Model Installation)

lgbm_tuned = LGBMClassifier(**gs_cv.best_params_).fit(X,y)cross_val_score(lgbm_tuned, X, y, cv = 10).mean()0.8959330143540669feature_imp = pd.Series(lgbm_tuned.feature_importances_,
index=X.columns).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index, palette="Blues_d")
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Feature Severity Levels")
plt.show()

最終模型的比較 (Comparison of Final Models)

RF: 0.886791 (0.028298)
XGB: 0.886757 (0.021597)
LightGBM: 0.892003 (0.033222)

結論 (Conclusion)

- Machine learning models were established to predict whether people will have diabetes with varying variables.
- The 3 classification models that best describe the dataset were selected and these models were compared according to their success rates. Compared models are Random Forests, XGBoost, LightGBM.
- As a result of this comparison; It is determined that the model that best describes and gives the best results is LightGBM.

You can find the kaggle link of this project here.

你可以找到這個項目的kaggle鏈接在這里。

資源資源 (Resources)

- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html- https://www.udemy.com/course/python-egitimi/- https://github.com/omarozt/MachineLearningWorkshop
- https://www.kaggle.com/ibrahimyildiz/pima-indians-diabetes-pred-0-9078-acc
- https://seaborn.pydata.org/examples/color_palettes.html- https://www.jonobacon.com/2017/08/06/joining-data-world-advisory-board/- https://towardsdatascience.com/data-preprocessing-concepts-fa946d11c825- https://becominghuman.ai/data-preprocessing-a-basic-guideline-c0842b7883fa

- Feature Engineering Made Easy, Sinan Ozdemir and Divya Susarla

翻譯自: https://medium.com/swlh/model-comparison-for-predicting-diabetes-outcomes-ddcd06384743

有關糖尿病模型建立的論文

總結

以上是生活随笔為你收集整理的有关糖尿病模型建立的论文_预测糖尿病结果的模型比较的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：文本文件加密和解密_解密文本见解和相关业
下一篇： chi-squared检验_每位数据科学