日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爱彼迎数据分析

發(fā)布時(shí)間:2024/1/18 编程问答 33 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爱彼迎数据分析 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

簡(jiǎn)單的python愛彼迎數(shù)據(jù)分析

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

導(dǎo)入需要的庫(kù)
1.calender數(shù)據(jù)集分析

calendar = pd.read_csv(r'C:\Users\12435\Desktop\shujufenxi\數(shù)分清華\aibiying\calendar_detail.csv') calendar.head()

導(dǎo)入數(shù)據(jù)集并查看

calendar.info()


首先將價(jià)格轉(zhuǎn)換為為浮點(diǎn)數(shù)

calendar['price']=calendar['price'].str.replace(r'[$,]','',regex=True).astype(np.float32) calendar['adjusted_price'] = calendar['adjusted_price'].str.replace(r'[$,]','',regex=True).astype(np.float32) #將日期轉(zhuǎn)換為日期格式 calendar.date=pd.to_datetime(calendar.date,format='%Y-%m-%d') #添加月份和星期 calendar['month']=calendar.date.dt.month calendar['weekday'] = calendar.date.dt.weekday+1 calendar.head() #月份與價(jià)格的關(guān)系 month_price = calendar.groupby('month')['price'].mean() sns.barplot(month_price.index,month_price.values) plt.ylim(600,700)


可以看到3.4月淡季價(jià)格較低,78月暑假10月國(guó)慶價(jià)格較高.

#星期與價(jià)格的關(guān)系 weekday_price = calendar.groupby('weekday')['price'].mean() sns.barplot(weekday_price.index,weekday_price.values) plt.ylim(600,700)


周五周六價(jià)格較高.
分析一下價(jià)格占比

sns.distplot(calendar[calendar['price']<1000]['price'])

排除一些異常值后,可以看出條形圖呈右偏分布.大部分房屋價(jià)格都在200-400元左右
2.listings數(shù)據(jù)集分析

listings = pd.read_csv(r'C:\Users\12435\Desktop\shujufenxi\數(shù)分清華\aibiying\listings_detail.csv') listings.head()

#數(shù)據(jù)集特征較多,有106個(gè),通過將列名轉(zhuǎn)換為列表查看完整特征.尋找感興趣的特征進(jìn)行處理. listings.columns.to_list()


修改金額列的數(shù)據(jù)類型

listings['price'] = listings['price'].str.replace(r'[$,]','',regex=True).astype(np.float32) listings['cleaning_fee'] = listings['cleaning_fee'].str.replace(r'[$,]','',regex=True).astype(np.float32) listings['cleaning_fee'].head() #存在空值,說(shuō)明有些旅館是不需要小費(fèi)的,用0填充即可 listings['cleaning_fee'].fillna(0,inplace = True) #添加一個(gè)新的字段,最低消費(fèi) (價(jià)格+小費(fèi))*最低入住天數(shù) listings['min_cost']=(listings['price']+listings['cleaning_fee'])*listings['minimum_minimum_nights'] listings['min_cost'].head()

#添加設(shè)施的數(shù)量 listings['amenities'].head()

listings['n_amenities'] = listings['amenities'].str[1:-1].str.split(',').apply(len) listings['n_amenities'].head()

#根據(jù)可以容納的人數(shù),添加一個(gè)新的列,用來(lái)表示類型:single(1),couple(2),family(5),group(100) listings['accommodates_type']=pd.cut(listings['accommodates'],bins=[1,2,3,5,100],include_lowest=True,right=False,labels=['single','couple','family','group']) listings['accommodates_type']


選取對(duì)價(jià)格影響較大的特征進(jìn)行分析

listings_df=listings[['id','host_id','listing_url','room_type','neighbourhood_cleansed','price','cleaning_fee','amenities','n_amenities','accommodates','accommodates_type','minimum_minimum_nights','min_cost']]

對(duì)數(shù)據(jù)集的處理:

將價(jià)格和小費(fèi)轉(zhuǎn)換成浮點(diǎn)數(shù)格式,將小費(fèi)空值填充為0,與最小居住天數(shù)聯(lián)合計(jì)算最低消費(fèi)
根據(jù)設(shè)施集計(jì)算設(shè)施數(shù)量.listings[‘a(chǎn)menities’].str[1:-1].str.split(’,’).apply(len)
新增一個(gè)列,根據(jù)可容納人數(shù)給房屋分類.pd.cut
將需要的特征單獨(dú)提出來(lái)組成一個(gè)新的df

listings_df.head()

#房間類型的情況 listings_df['room_type'].unique()


共有三種房間的類型

room_type_counts = listings_df['room_type'].value_counts() fig,axes =plt.subplots(1,2,figsize=(10,5)) axes[0].pie(room_type_counts.values,autopct='%.2f%%',labels = room_type_counts.index) sns.barplot(room_type_counts.index,room_type_counts.values) plt.tight_layout() plt.show()


查看房屋類型占比
可以看到公寓形的整租占了60%,私人房屋占了35%,多人同住一屋只占了不到6%

#分析房源所在城區(qū)分布 plt.rcParams['font.sans-serif']=['SimHei'] neighbourhood_counts = listings_df['neighbourhood_cleansed'].value_counts() sns.barplot(x=neighbourhood_counts.values,y=neighbourhood_counts.index,orient='h')


排名第一的是朝陽(yáng)區(qū),這里房源最多,其次是東城區(qū)和海淀區(qū),但是這兩個(gè)都只有朝陽(yáng)區(qū)的三分之一左右

#查看每個(gè)區(qū)的房屋類型占比 neighbourhood_room_type = listings_df.groupby(['neighbourhood_cleansed','room_type'])\.size()\.unstack('room_type')\.fillna(0)\.apply(lambda row:row/row.sum(),axis = 1)\.sort_values('Entire home/apt') neighbourhood_room_type

columns = neighbourhood_room_type.columns plt.figure(figsize=(10,8)) plt.barh(neighbourhood_room_type.index,neighbourhood_room_type[columns[0]]) left = neighbourhood_room_type[columns[0]].copy() plt.barh(neighbourhood_room_type.index,neighbourhood_room_type[columns[1]],left=left) left +=neighbourhood_room_type[columns[1]] plt.barh(neighbourhood_room_type.index,neighbourhood_room_type[columns[2]],left=left) plt.legend(columns)

#較為簡(jiǎn)單的方法 fig,ax=plt.subplots(figsize=(10,5)) neighbourhood_room_type.plot(kind='barh',stacked = True,width=0.75,ax=ax)


查看戶主名下房源數(shù)量的分布

host_num = listings_df.groupby('host_id').size() sns.distplot(host_num) #可以看到大部分人都只有幾套房子,當(dāng)然也有夸張的一人有兩百多套.這里排除這些異常值, #只考慮房子數(shù)量少于10套的數(shù)據(jù)

#只查看房源數(shù)量少于10套的戶主,排除異常值 sns.distplot(host_num[host_num<10])

#戶主分類py host_num_bins = pd.cut(host_num,bins=[1,2,3,5,1000],include_lowest=True,right=False,labels=['1','2','3-4','5+']) plt.pie(host_num_bins.value_counts().values,autopct='%.2f%%',labels=host_num_bins.value_counts().index) plt.show()


擁有一套房子的戶主占了60%,兩套房子的戶主有15%,3-4套占了12%

host_num_cumsum=host_num.sort_values().cumsum() h = host_num_cumsum.reset_index() sns.lineplot(h.index,h[0])


由上圖可以看出,8成的人只占了不到40%的房源.剩下2成的人占了超過一半的房源,比較符合二八法則

3,reviews數(shù)據(jù)分析

reviews = pd.read_csv(r'C:\Users\12435\Desktop\shujufenxi\數(shù)分清華\aibiying\reviews_detail.csv',parse_dates=['date']) reviews.head()

reviews['year'] = reviews['date'].dt.year reviews['month'] = reviews['date'].dt.month n_reviews = reviews.groupby('year').size() sns.barplot(n_reviews.index,n_reviews.values)

reviews.date.max()

month_reviews = reviews.groupby('month').size() sns.barplot(month_reviews.index,month_reviews.values)


可以看到23月春節(jié),78月暑假,10月國(guó)慶評(píng)論量都比較多,56月沒有假期,評(píng)論量比較少

year_month_reviews = reviews.groupby(['year','month']).size().unstack().fillna(0) year_month_reviews

fig,ax =plt.subplots(figsize=(10,5)) for index in year_month_reviews.index:series = year_month_reviews.loc[index]sns.lineplot(series.index,series.values) ax.legend(labels = year_month_reviews.index) ax.grid() ax.set_xticks(range(1,13)) plt.show()


從上圖可以看出,評(píng)論量其實(shí)一直都是保持同一增速平穩(wěn)平穩(wěn)上升的,每年年初的評(píng)論數(shù)與上年年末的評(píng)論數(shù)幾乎持平,一直處在波動(dòng)上升的階段.只有一小部分時(shí)間會(huì)發(fā)生下降的情況,如2018年的2,3月,2017年的9,10月,需要結(jié)合當(dāng)時(shí)的業(yè)務(wù)進(jìn)行分析

4.預(yù)測(cè)房間價(jià)格

ml_listings = listings[listings['price']<300][['host_is_superhost','host_identity_verified','neighbourhood_cleansed','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','cleaning_fee','minimum_minimum_nights','maximum_maximum_nights','availability_90','number_of_reviews',#'review_scores_rating','is_business_travel_ready','n_amenities','price' ]] ml_listings.dropna(axis = 0,inplace=True) ml_listings.isnull().sum()

from sklearn.preprocessing import StandardScaler #分割特征值和目標(biāo)值 features = ml_listings.drop(columns=['price']) target = ml_listings['price'] #對(duì)特征值進(jìn)行處理 #對(duì)離散型數(shù)據(jù)進(jìn)行one_hot編碼 disperse_columns = ['host_is_superhost','host_identity_verified','neighbourhood_cleansed','property_type','room_type','is_business_travel_ready' ] disperse = features[disperse_columns] disperse = pd.get_dummies(disperse) #對(duì)連續(xù)性數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化(因?yàn)檫B續(xù)性數(shù)值相差不大,所以對(duì)預(yù)測(cè)結(jié)果影響可能不大) continuouse_features = features.drop(columns = disperse_columns) scaler = StandardScaler() continuouse_features=scaler.fit_transform(continuouse_features) #處理后的數(shù)據(jù)進(jìn)行組合 feature_array = np.hstack([disperse,continuouse_features]) #使用線性回歸模型進(jìn)行預(yù)測(cè) from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error,r2_score xtrain,xtest,ytrain,ytest = train_test_split(feature_array,target,test_size = 0.3) line = LinearRegression() line = line.fit(xtrain,ytrain) ypredict = line.predict(xtest) print('平均誤差',mean_absolute_error(ytest,ypredict)) print('R2誤差',r2_score(ytest,ypredict))

平均誤差 3009977153.5992937
R2誤差 -5.155611729394443e+18

from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score regressor = RandomForestRegressor(n_estimators = 10) score = cross_val_score(regressor,feature_array,target,cv=5) score.mean()

0.39684796601257405

使用隨機(jī)森林進(jìn)行預(yù)測(cè),效果不好

5,評(píng)論數(shù)的預(yù)測(cè)

ym_reviews = reviews.groupby(['year','month']).size().reset_index().rename(columns={0:'count'}) feature = ym_reviews[['year','month']] target = ym_reviews['count']cross_val_score(regressor,feature,target,cv=6).mean() xtrain,xtest,ytrain,ytest = train_test_split(feature,target,test_size=0.3) regressor = RandomForestRegressor(n_estimators = 100) regressor.fit(xtrain,ytrain) ypredict = regressor.predict(xtest) print(mean_absolute_error(ytest,ypredict)) print(r2_score(ytest,ypredict))

170.16655172413795
0.9863879670160787

預(yù)測(cè)2019年之后幾個(gè)月的評(píng)論數(shù)

a=list() for i in range(5,13):a.append([2019,i]) regressor.fit(feature,target) y_predict = regressor.predict(a) ypredict = pd.DataFrame([[2019,5+index,x] for index,x in enumerate(y_predict)],columns = ['year','month','count']) final_reviews = pd.concat([ym_reviews,ypredict],axis = 0).reset_index() years = final_reviews['year'].unique() fig,ax = plt.subplots(figsize=(15,5)) for year in years:df = final_reviews[final_reviews['year']==year]sns.lineplot(x='month',y='count',data = df) ax.legend(labels = year_month_reviews.index) ax.grid() ax.set_xticks(list(range(1,13))) plt.show()


這里可以看出來(lái),預(yù)測(cè)的2019年4月之后的評(píng)論數(shù)保持略微上漲的趨勢(shì),不太符合之前的推測(cè).

總結(jié)

以上是生活随笔為你收集整理的爱彼迎数据分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。