日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

分析citibike数据eda

發布時間:2023/11/29 编程问答 31 豆豆
生活随笔 收集整理的這篇文章主要介紹了 分析citibike数据eda 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據科學 (Data Science)

CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — not to mention healthy and good for the environment.

CitiBike是紐約市著名的自行車租賃公司,也是美國最大的自行車租賃公司。 花旗自行車(CitiBike)于2013年5月推出,現已成為交通網絡的重要組成部分。 它們使通勤變得有趣,高效且負擔得起,更不用說健康且對環境有益。

I have got the data of CityBike riders of June 2013 from Kaggle. I will walk you through the complete exploratory data analysis answering some of the questions like:

我從Kaggle獲得了2013年6月的CityBike騎手數據。 我將引導您完成完整的探索性數據分析,回答一些問題,例如:

  • Where do CitiBikers ride?

    CitiBikers騎在哪里?
  • When do they ride?

    他們什么時候騎?
  • How far do they go?

    他們走了多遠?
  • Which stations are most popular?

    哪個電臺最受歡迎?
  • What days of the week are most rides taken on?

    大多數游樂設施在一周的哪幾天?
  • And many more

    還有很多
  • Key learning:

    重點學習:

    I have used many parameters to tweak the plotting functions of Matplotlib and Seaborn. It will be a good read to learn them practically.

    我使用了許多參數來調整Matplotlib和Seaborn的繪圖功能。 實際學習它們將是一本好書。

    Note:

    注意:

    This article is best viewed on a larger screen like a tablet or desktop. At any point of time if you find difficulty in understanding anything I will be dropping the link to my Kaggle notebook at the end of this article, you can drop your quaries in the comment section.

    最好在平板電腦或臺式機等較大的屏幕上查看本文。 在任何時候,如果您發現難以理解任何內容,那么在本文結尾處,我都會刪除指向我的Kaggle筆記本的鏈接,您可以在評論部分中刪除您的查詢。

    讓我們開始吧 (Let’s get?started)

    Importing necessary libraries and reading data.

    導入必要的庫并讀取數據。

    #importing necessary libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns#setting plot style to seaborn
    plt.style.use('seaborn')#reading data
    df = pd.read_csv('../input/citibike-system-data/201306-citibike-tripdata.csv')
    df.head()

    Let’s get some more information on the data.

    讓我們獲取有關數據的更多信息。

    df.info()#sum of missing values in each column
    df.isna().sum()

    We have whooping 5,77,703 rows to crunch and 15 columns. Also, quite a bit of missing values. Let’s deal with missing values first.

    我們有多達5,77,703行要緊縮和15列。 此外,還有很多缺失值。 讓我們先處理缺失值。

    處理缺失值 (Handling missing values)

    Let’s first see the percentage of missing values which will help us decide whether to drop them or no.

    首先讓我們看看缺失值的百分比,這將有助于我們決定是否刪除它們。

    #calculating the percentage of missing values
    #sum of missing value is the column divided by total number of rows in the dataset multiplied by 100data_loss1 = round((df['end station id'].isna().sum()/df.shape[0])*100)
    data_loss2 = round((df['birth year'].isna().sum()/df.shape[0])*100)print(data_loss1, '% of data loss if NaN rows of end station id, \nend station name, end station latitude and end station longitude dropped.\n')
    print(data_loss2, '% of data loss if NaN rows of birth year dropped.')

    We can not afford to drop the missing valued rows of ‘birth year’. Hence, drop the entire column ‘birth year’ and drop missing valued rows of ‘end station id’,‘ end station name’,‘ end station latitude’, and ‘end station longitude’. Fortunately, all the missing values in these four rows (end station id, end station name, end station latitude, and end station longitude) are on the exact same row, so dropping NaN rows from all four rows will still result in only 3% data loss.

    我們不能舍棄丟失的“出生年份”有價值的行。 因此,刪除整列“出生年”并刪除“終端站ID”,“終端站名稱”,“終端站緯度”和“終端站經度”的缺失值行。 幸運的是,這四行中的所有缺失值(終端站ID,終端站名稱,終端站緯度和終端站經度)都在同一行上,因此從所有四行中刪除NaN行仍將僅導致3%數據丟失。

    #dropping NaN values
    rows_before_dropping = df.shape[0]
    #drop entire birth year column.
    df.drop(’birth year’,axis=1, inplace=True)#Now left with end station id, end station name, end station latitude and end station longitude
    #these four columns have missing values in exact same row,
    #so dropping NaN from all four columns will still result in only 3% data loss
    df.dropna(axis=0, inplace=True)
    rows_after_dropping = df.shape[0]#total data loss
    print('% of data lost: ',((rows_before_dropping-rows_after_dropping)/rows_before_dropping)*100)#checking for NaN
    df.isna().sum()

    讓我們看看性別在談論我們的數據 (Let’s see what gender talks about our data)

    #plotting total no.of males and females
    splot = sns.countplot('gender', data=df)#adding value above each bar:Annotation
    for p in splot.patches:
    an = splot.annotate(format(p.get_height(), '.2f'),
    #bar value is nothing but height of the bar
    (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center',
    va = 'center',
    xytext = (0, 10),
    textcoords = 'offset points')
    an.set_size(20)#test size
    splot.axes.set_title("Gender distribution",fontsize=30)
    splot.axes.set_xlabel("Gender",fontsize=20)
    splot.axes.set_ylabel("Count",fontsize=20)#adding x tick values
    splot.axes.set_xticklabels(['Unknown', 'Male', 'Female'])
    plt.show()

    We can see more male riders than females in New York City but due to a large number of unknown gender, we cannot get to any concrete conclusion. Filling unknown gender values is possible but we are not going to do it considering riders did not choose to disclose their gender.

    在紐約市,我們看到男性騎手的人數多于女性騎手,但由于性別眾多,我們無法得出任何具體結論。 可以填寫未知的性別值,但考慮到車手沒有選擇公開性別,我們不會這樣做。

    訂戶與客戶 (Subscribers vs Customers)

    Subscribers are the users who bought the annual pass and customers are the once who bought either a 24-hour pass or a 3-day pass. Let’s see what the riders choose the most.

    訂戶是購買年度通行證的用戶,客戶是購買24小時通行證或3天通行證的用戶。 讓我們來看看騎手最喜歡的東西。

    user_type_count = df[’usertype’].value_counts()
    plt.pie(user_type_count.values,
    labels=user_type_count.index,
    autopct=’%1.2f%%’,
    textprops={’fontsize’: 15} )
    plt.title(’Subcribers vs Customers’, fontsize=20)
    plt.show()

    We can see there is more number of yearly subscribers than 1-3day customers. But the difference is not much, the company has to focus on converting customers to subscribers with some offers or sale.

    我們可以看到,每年訂閱者的數量超過1-3天的客戶。 但是差異并不大,該公司必須專注于將客戶轉換為具有某些要約或銷售的訂戶。

    騎自行車通常需要花費幾個小時 (How many hours do rides use the bike typically)

    We have a column called ‘timeduration’ which talks about the duration each trip covered which is in seconds. Firstly, we will convert it to minutes, then create bins to group the trips into 0–30min, 30–60min, 60–120min, 120min, and above ride time. Then, let’s plot a graph to see how many hours do rides ride the bike typically.

    我們有一個名為“ timeduration”的列,它討論了每次旅行的持續時間,以秒為單位。 首先,我們將其轉換為分鐘,然后創建垃圾箱,將行程分為0–30分鐘,30–60分鐘,60–120分鐘,120分鐘及以上行駛時間。 然后,讓我們繪制一個圖表,看看騎車通常需要騎幾個小時。

    #converting trip duration from seconds to minuits
    df['tripduration'] = df['tripduration']/60#creating bins (0-30min, 30-60min, 60-120min, 120 and above)
    max_limit = df['tripduration'].max()
    df['tripduration_bins'] = pd.cut(df['tripduration'], [0, 30, 60, 120, max_limit])sns.barplot(x='tripduration_bins', y='tripduration', data=df, estimator=np.size)
    plt.title('Usual riding time', fontsize=30)
    plt.xlabel('Trip duration group', fontsize=20)
    plt.ylabel('Trip Duration', fontsize=20)
    plt.show()

    There are a large number of riders who ride for less than half an hour per trip and most less than 1 hour.

    有大量的騎手每次騎行少于半小時,最多少于1小時。

    相同的開始和結束位置VS不同的開始和結束位置 (Same start and end location VS different start and end location)

    We see in the data there are some trips that start and end at the same location. Let’s see how many.

    我們在數據中看到一些行程在同一位置開始和結束。 讓我們看看有多少。

    #number of trips that started and ended at same station
    start_end_same = df[df['start station name'] == df['end station name']].shape[0]#number of trips that started and ended at different station
    start_end_diff = df.shape[0]-start_end_sameplt.pie([start_end_same,start_end_diff],
    labels=['Same start and end location',
    'Different start and end location'],
    autopct='%1.2f%%',
    textprops={'fontsize': 15})
    plt.title('Same start and end location vs Different start and end location', fontsize=20)
    plt.show()

    本月的騎行方式 (Riding pattern of the month)

    This part is where I have spent a lot of time and effort. The below graph talks a lot. Technically there is a lot of coding. Before looking at the code I will give an overview of what we are doing here. Basically, we are plotting a time series graph to see the trend of the number of rides taken per day and the trend of the total number of duration the bikes were in use per day. Let’s look at the code first then I will break it down for you.

    這是我花費大量時間和精力的地方。 下圖很講究。 從技術上講,有很多編碼。 在查看代碼之前,我將概述我們在這里所做的事情。 基本上,我們正在繪制一個時間序列圖,以查看每天騎行次數的趨勢以及每天使用自行車的持續時間總數的趨勢。 讓我們先看一下代碼,然后我將為您分解代碼。

    #converting string to datetime object
    df['starttime']= pd.to_datetime(df['starttime'])#since we are dealing with single month, we grouping by days
    #using count aggregation to get number of occurances i.e, total trips per day
    start_time_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).count()#we have data from July month for only one day which is at last row, lets drop it
    start_time_count.drop(start_time_count.tail(1).index, axis=0, inplace=True)#again grouping by day and aggregating with sum to get total trip duration per day
    #which will used while plotting
    trip_duration_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).sum()#again dropping the last row for same reason
    trip_duration_count.drop(trip_duration_count.tail(1).index, axis=0, inplace=True)#plotting total rides per day
    #using start station id to get the count
    fig,ax=plt.subplots(figsize=(25,10))
    ax.bar(start_time_count.index, 'start station id', data=start_time_count, label='Total riders')
    #bbox_to_anchor is to position the legend box
    ax.legend(loc ="lower left", bbox_to_anchor=(0.01, 0.89), fontsize='20')
    ax.set_xlabel('Days of the month June 2013', fontsize=30)
    ax.set_ylabel('Riders', fontsize=40)
    ax.set_title('Bikers trend for the month June', fontsize=50)#creating twin x axis to plot line chart is same figure
    ax2=ax.twinx()
    #plotting total trip duration of all user per day
    ax2.plot('tripduration', data=trip_duration_count, color='y', label='Total trip duration', marker='o', linewidth=5, markersize=12)
    ax2.set_ylabel('Time duration', fontsize=40)
    ax2.legend(loc ="upper left", bbox_to_anchor=(0.01, 0.9), fontsize='20')ax.set_xticks(trip_duration_count.index)
    ax.set_xticklabels([i for i in range(1,31)])#tweeking x and y ticks labels of axes1
    ax.tick_params(labelsize=30, labelcolor='#eb4034')
    #tweeking x and y ticks labels of axes2
    ax2.tick_params(labelsize=30, labelcolor='#eb4034')plt.show()

    You might have understood the basic idea by reading the comments but let me explain the process step-by-step:

    您可能通過閱讀評論已經了解了基本思想,但讓我逐步解釋了該過程:

  • The date-time is in the string, we will convert it into DateTime object.

    日期時間在字符串中,我們將其轉換為DateTime對象。
  • Grouping the data by days of the month and counting the number of occurrences to plot rides per day.

    將數據按每月的天數進行分組,并計算每天的出行次數。
  • We have only one row with the information for the month of July. This is an outlier, drop it.

    我們只有一行包含7月份的信息。 這是一個離群值,將其刪除。
  • Repeat steps 2 and 3 but the only difference this time is we sum the data instead of counting to get the total time duration of the trips per day.

    重復第2步和第3步,但是這次唯一的區別是我們對數據求和而不是進行計數以獲得每天行程的總持續時間。
  • Plot both the data on a single graph using the twin axis method.

    使用雙軸方法將兩個數據繪制在一個圖形上。
  • I have used a lot of tweaking methods on matplotlib, make sure to go through each of them. If any doubts drop a comment on the Kaggle notebook for which the link will be dropped at the end of this article.

    我在matplotlib上使用了很多調整方法,請確保每種方法都要經過。 如果有任何疑問,請在Kaggle筆記本上發表評論,其鏈接將在本文結尾處刪除。

    The number of riders increases considerably closing to the end of the month. There are negligible riders on the 1st Sunday of the month. The amount of time the bikers ride the bike decreases closing to the end of the month.

    到月底,車手的數量大大增加。 每個月的第一個星期日的車手微不足道。 騎自行車的人騎自行車的時間減少到月底接近。

    前10個出發站 (Top 10 start stations)

    This is pretty straightforward, we get the occurrences of each start station using value_counts() and slice to get the first 10 values from it then plot the same.

    這非常簡單,我們使用value_counts()和slice來獲取每個起始站點的出現,然后從中獲取前10個值,然后對其進行繪制。

    #adding value above each bar:Annotation
    for p in ax.patches:
    an = ax.annotate(format(p.get_height(), '.2f'),
    (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center',
    va = 'center',
    xytext = (0, 10),
    textcoords = 'offset points')
    an.set_size(20)
    ax.set_title("Top 10 start locations in NY",fontsize=30)
    ax.set_xlabel("Station name",fontsize=20)#rotating the x tick labels to 45 degrees
    ax.set_xticklabels(top_start_station.index, rotation = 45, ha="right")
    ax.set_ylabel("Count",fontsize=20)
    #tweeking x and y tick labels
    ax.tick_params(labelsize=15)
    plt.show()

    十佳終端站 (Top 10 end stations)

    #top 10 end station
    top_end_station = df['end station name'].value_counts()[:10]fig,ax=plt.subplots(figsize=(20,8))
    ax.bar(x=top_end_station.index, height=top_end_station.values, color='#edde68', width=0.5)#adding value above each bar:Annotation
    for p in ax.patches:
    an = ax.annotate(format(p.get_height(), '.2f'),
    (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = 'center',
    va = 'center',
    xytext = (0, 10),
    textcoords = 'offset points')
    an.set_size(20)
    ax.set_title("Top 10 end locations in NY",fontsize=30)
    ax.set_xlabel("Street name",fontsize=20)#rotating the x tick labels to 45 degrees
    ax.set_xticklabels(top_end_station.index, rotation = 45, ha="right")
    ax.set_ylabel("Count",fontsize=20)
    #tweeking x and y tick labels
    ax.tick_params(labelsize=15)
    plt.show()

    Kaggle Notebook where I worked it out. Feel free to drop queries in the comment section.

    Kaggle筆記本是我在其中解決的。 隨時在評論部分中刪除查詢。

    翻譯自: https://medium.com/towards-artificial-intelligence/analyzing-citibike-data-eda-e657409f007a

    總結

    以上是生活随笔為你收集整理的分析citibike数据eda的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

    主站蜘蛛池模板: 成人在线午夜 | 国产卡一卡二 | 性猛交xxxx乱大交孕妇2十 | 麻豆视频网站入口 | 日本一本二本三区免费 | 日本免费在线观看视频 | 1024精品一区二区三区日韩 | 国产视频麻豆 | 永久在线免费观看 | 色啪网站 | www.97超碰| 日韩免 | 精品国产丝袜一区二区三区乱码 | 成人免费在线视频观看 | 欧美拍拍视频 | 国产成人精品无码免费看81 | 在线免费观看亚洲视频 | 国产精品久久久久毛片 | 亚洲精品一区二区三区蜜桃 | av黄色片在线观看 | 激情久 | 免费香蕉视频 | 色呦呦免费观看 | 国产又色又爽又黄 | av黄色影院 | 97se综合| 国产精品91久久 | 尤物视频在线观看国产性感 | 欧美久久久影院 | 久久这里都是精品 | 久操影视| 精品午夜一区二区三区在线观看 | 黄色在线观看www | 欧美三级特黄 | 国产老女人乱淫免费可以 | 日本在线观看a | 久久精品毛片 | 久久久社区 | 国产精品久久久久久免费 | 日本高清免费aaaaa大片视频 | 国产高清自拍视频 | 午夜中文字幕 | 免费看黄色av | 亚洲视频一区二区三区在线观看 | 色偷偷综合 | 日本少妇喷水视频 | jizz在线播放 | 毛片1000部免费看 | 久久人人爽人人爽人人片 | 国产精品国产精品国产专区蜜臀ah | aaaa一级片 | 色资源在线 | 91精品欧美一区二区三区 | 国产亚洲精品成人av在线 | 4438x全国最大成人网 | 不卡的av在线免费观看 | 亚洲色图欧洲色图 | 四虎网站最新网址 | 国产污视频在线看 | 亚洲乱色熟女一区二区三区 | 一区二区三区天堂 | 免费亚洲一区二区 | 男女啪啪免费 | 男男免费视频 | 大陆av在线播放 | 免费日本黄色片 | 五月花成人网 | 97干视频| 欧美成人一二三 | 白丝动漫美女 | 精品一二三四 | 成人h动漫精品一区二区器材 | 男人的天堂影院 | 国内精品久久99人妻无码 | 久久国产色 | 黄色片久久久久 | 91精品久久久久久久99蜜桃 | 少妇一级淫免费观看 | 99热6这里只有精品 三级av在线免费观看 | 中文字幕第二页 | 黑人欧美一区二区三区 | 亚色在线观看 | 男生草女生的视频 | 加勒比一区在线 | 女人性做爰69片免费看 | a在线看| 香港三日本8a三级少妇三级99 | 三上悠亚ssⅰn939无码播放 | 日韩夜夜 | 日本三级午夜理伦三级三 | 欧美a级片在线观看 | 日本黄色大片免费看 | 久久久免费网站 | 国产精品11| 日韩欧美四区 | 九九精品在线观看 | 国产一级aa大片毛片 | 精品少妇人妻av免费久久久 | 粉嫩av国产一区二区三区 |