带有postgres和jupyter笔记本的Titanic数据集
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.
PostgreSQL是一個功能強大的開源對象關(guān)系數(shù)據(jù)庫系統(tǒng),經(jīng)過30多年的積極開發(fā),在可靠性,功能強大和性能方面贏得了極高的聲譽。
Why use Postgres?
為什么要使用Postgres?
Postgres has a lot of capability. Built using an object-relational model, it supports complex structures and a breadth of built-in and user-defined data types. It provides extensive data capacity and is trusted for its data integrity.
Postgres具有很多功能。 它使用對象關(guān)系模型構(gòu)建,支持復(fù)雜的結(jié)構(gòu)以及內(nèi)置和用戶定義的數(shù)據(jù)類型的范圍。 它提供了廣泛的數(shù)據(jù)容量,并因其數(shù)據(jù)完整性而受到信賴。
It comes with many features aimed to help developers build applications, administrators to protect data integrity and build fault-tolerant environments, and help you manage your data no matter how big or small the dataset.
它具有許多功能,旨在幫助開發(fā)人員構(gòu)建應(yīng)用程序,幫助管理員保護(hù)數(shù)據(jù)完整性和構(gòu)建容錯環(huán)境,并幫助您管理數(shù)據(jù)(無論數(shù)據(jù)集大小)。
We will be using the famous Titanic dataset from Kaggle to predict whether the people aboard were likely to survive the sinkage of the world’s greatest ship or not.
我們將使用來自Kaggle的著名的《泰坦尼克號》數(shù)據(jù)集來預(yù)測船上的人們是否有可能幸免于世界上最偉大的船只的沉沒。
In the first step make sure the you have valid Postgres credentials, a created database with the data already imported. Check the Kaggle website to downloads the csv files: https://www.kaggle.com/c/titanic/data. The data should look something like this:
第一步,請確保您具有有效的Postgres憑據(jù),即已導(dǎo)入數(shù)據(jù)的已創(chuàng)建數(shù)據(jù)庫。 檢查Kaggle網(wǎng)站以下載csv文件: https : //www.kaggle.com/c/titanic/data 。 數(shù)據(jù)應(yīng)如下所示:
We’ll first import the proper libraries. Make sure you pip install them. I’m using a local jupyter environment. Apart from the obvious ones, psycopg2 and sqlalchemy are crucial for creating a connection to postgres. Just pip install them as well. :)
我們將首先導(dǎo)入適當(dāng)?shù)膸臁?確保您點安裝它們。 我正在使用本地jupyter環(huán)境。 除了顯而易見的以外,psycopg2和sqlalchemy對于創(chuàng)建與postgres的連接至關(guān)重要。 只需點安裝它們。 :)
Next, we’ll be using a create_engine form sqlalchemy. It’s too simple to use.
接下來,我們將使用sqlalchemy形式的create_engine。 使用起來太簡單了。
Replace <enter yours> with your own credentials. The default port is 5432 and username is ‘postgres’. If the code prints ‘Connected to database’ you have succesfully made a connection to your postgres database.
用您自己的憑據(jù)替換<enter yours>。 默認(rèn)端口為5432,用戶名為“ postgres”。 如果代碼顯示“已連接到數(shù)據(jù)庫”,則說明您已成功連接到Postgres數(shù)據(jù)庫。
Next, let’s convert the query result set to a pandas dataframe.
接下來,讓我們將查詢結(jié)果集轉(zhuǎn)換為pandas數(shù)據(jù)框。
As you can see the dataframe has 887 rows and 9 columns with the first being id.
如您所見,數(shù)據(jù)框具有887行和9列,第一個是id。
In the next section, let’s try to figure out if any data is directly associated with the survival rate. We’ll take if sex, passenger class and having a family has anything to do with their chance of surviving.
在下一節(jié)中,讓我們嘗試確定是否有任何數(shù)據(jù)與生存率直接相關(guān)。 我們將考慮性別,旅客階層和家庭是否與他們生存的機會有關(guān)。
As you can see, 74% of women aboard survived and only 19% of men did. Passenger class also has an enormous affect. Having siblings or spouses is not correlated. Let’s take a look at a visual correlation between age and survival.
如您所見,船上74%的女性得以幸存,只有19%的男性得以幸存。 客運等級也有巨大影響。 有兄弟姐妹或配偶不相關(guān)。 讓我們看一下年齡和生存率之間的視覺關(guān)聯(lián)。
There is a significant ammount of toddlers that died in the accident. Most of passengers were middle-aged.
事故中有大量嬰兒喪生。 大多數(shù)乘客是中年人。
Since computers like numbers more than words I have converted sex into a binary classifier.
由于計算機比數(shù)字更喜歡數(shù)字,因此我已將性別轉(zhuǎn)換為二進(jìn)制分類器。
The data still remains the same.
數(shù)據(jù)仍然保持不變。
Finally, let’s dive into preprocessing for classification.
最后,讓我們深入進(jìn)行分類預(yù)處理。
I used sklearn’s train_test_split to create a training and test dataset.
我使用sklearn的train_test_split創(chuàng)建了訓(xùn)練和測試數(shù)據(jù)集。
We have to drop the ‘survived’ column in the train set otherwise the data serves no purpose.
我們必須在訓(xùn)練集中刪除“幸存”列,否則數(shù)據(jù)沒有任何作用。
Finally, we fit the training data and got the accuracy of 74.33 which is not great. But not bad either. Let’s save the predicted values to a csv file called ‘submission.csv’. It will only have two values: passengerId and a boolean indicating survival.
最后,我們擬合了訓(xùn)練數(shù)據(jù)并獲得了74.33的準(zhǔn)確度,這并不是一個很好的結(jié)果。 但也不錯。 讓我們將預(yù)測值保存到一個名為“ submission.csv”的csv文件中。 它只有兩個值:passengerId和一個表示生存期的布爾值。
Summary:
摘要:
- use postgres as transactional database management system for data pipelines 使用postgres作為數(shù)據(jù)管道的事務(wù)數(shù)據(jù)庫管理系統(tǒng)
- have fun manipulating data with pandas and visualisation libraries such as matplotlib and seaborn. 使用熊貓和可視化庫(例如matplotlib和seaborn)來處理數(shù)據(jù)很有趣。
- make predictions using the machine learning algorithms provided to you by scikit-learn and tensorflow. 使用scikit-learn和tensorflow提供給您的機器學(xué)習(xí)算法進(jìn)行預(yù)測。
Thanks ;)
謝謝 ;)
翻譯自: https://medium.com/@cvetko.tim/titanic-dataset-with-postgres-and-jupyter-notebook-69073c4a67e6
總結(jié)
以上是生活随笔為你收集整理的带有postgres和jupyter笔记本的Titanic数据集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到宫缩预示着什么
- 下一篇: 机器学习模型 非线性模型_机器学习模型说