日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章:

發布時間:2023/12/31 编程问答 21 豆豆
生活随笔 收集整理的這篇文章主要介紹了 kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章: 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據來自kaggle上tmdb5000電影數據集,本次數據分析主要包括電影數據可視化和簡單的電影推薦模型,如:
1.電影類型分配及其隨時間的變化
2.利潤、評分、受歡迎程度直接的關系
3.哪些導演的電影賣座或較好
4.最勤勞的演職人員
5.電影關鍵字分析
6.電影相似性推薦

數據分析

import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlineplt.style.use('ggplot')import jsonimport warningswarnings.filterwarnings('ignore')#忽略警告 [/code]```codemovie = pd.read_csv('tmdb_5000_movies.csv')credit = pd.read_csv('tmdb_5000_credits.csv') [/code]```codemovie.head(1) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 0 | 237000000 | [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam… | http://www.avatarmovie.com/ | 19995 | [{“id”: 1463, “name”: “culture clash”}, {“id”:… | en | Avatar | In the 22nd century, a paraplegic Marine is di… | 150.437577 | [{“name”: “Ingenious Film Partners”, “id”: 289… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2009-12-10 | 2787965087 | 162.0 | [{“iso_639_1”: “en”, “name”: “English”}, {“iso… | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800```codemovie.tail(3) [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 4800 | 0 | [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam… | http://www.hallmarkchannel.com/signedsealeddel… | 231617 | [{“id”: 248, “name”: “date”}, {“id”: 699, “nam… | en | Signed, Sealed, Delivered | “Signed, Sealed, Delivered” introduces a dedic… | 1.444476 | [{“name”: “Front Street Pictures”, “id”: 3958}… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2013-10-13 | 0 | 120.0 | [{“iso_639_1”: “en”, “name”: “English”}] | Released | NaN | Signed, Sealed, Delivered | 7.0 | 6 4801 | 0 | [] | http://shanghaicalling.com/ | 126186 | [] | en | Shanghai Calling | When ambitious New York attorney Sam is sent t… | 0.857008 | [] | [{“iso_3166_1”: “US”, “name”: “United States o… | 2012-05-03 | 0 | 98.0 | [{“iso_639_1”: “en”, “name”: “English”}] | Released | A New Yorker in Shanghai | Shanghai Calling | 5.7 | 7 4802 | 0 | [{“id”: 99, “name”: “Documentary”}] | NaN | 25975 | [{“id”: 1523, “name”: “obsession”}, {“id”: 224… | en | My Date with Drew | Ever since the second grade when he first saw … | 1.929883 | [{“name”: “rusty bear entertainment”, “id”: 87… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2005-08-05 | 0 | 90.0 | [{“iso_639_1”: “en”, “name”: “English”}] | Released | NaN | My Date with Drew | 6.3 | 16```codemovie.info()#樣本數量為4803,部分特征有缺失值 [/code]```code<class 'pandas.core.frame.DataFrame'>RangeIndex: 4803 entries, 0 to 4802Data columns (total 20 columns):budget 4803 non-null int64genres 4803 non-null objecthomepage 1712 non-null objectid 4803 non-null int64keywords 4803 non-null objectoriginal_language 4803 non-null objectoriginal_title 4803 non-null objectoverview 4800 non-null objectpopularity 4803 non-null float64production_companies 4803 non-null objectproduction_countries 4803 non-null objectrelease_date 4802 non-null objectrevenue 4803 non-null int64runtime 4801 non-null float64spoken_languages 4803 non-null objectstatus 4803 non-null objecttagline 3959 non-null objecttitle 4803 non-null objectvote_average 4803 non-null float64vote_count 4803 non-null int64dtypes: float64(3), int64(4), object(13)memory usage: 750.5+ KB

樣本數為4803,部分特征有缺失值,homepage,tagline缺損較多,但這倆不影響基本分析,release_date和runtime可以填充;仔細觀察,部分樣本的genres,keywords,production
company特征值是[],需要注意。

credit.info [/code]## 數據清理數據特征中有很多特征為json格式,即類似于字典的鍵值對形式,為了方便后續處理,我們需要將其轉換成便于python操作的str或者list形式,利于提取有用信息。```code#movie genres電影流派,便于歸類movie['genres']=movie['genres'].apply(json.loads)#apply function to axis in df,對df中某一行、列應用某種操作。 [/code]```codemovie['genres'].head(1) [/code]```code0 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...Name: genres, dtype: object list(zip(movie.index,movie['genres']))[:2] [/code]```code[(0,[{'id': 28, 'name': 'Action'},{'id': 12, 'name': 'Adventure'},{'id': 14, 'name': 'Fantasy'},{'id': 878, 'name': 'Science Fiction'}]),(1,[{'id': 12, 'name': 'Adventure'},{'id': 14, 'name': 'Fantasy'},{'id': 28, 'name': 'Action'}])] for index,i in zip(movie.index,movie['genres']):list1=[]for j in range(len(i)):list1.append((i[j]['name']))# name:genres,Action...movie.loc[index,'genres']=str(list1) [/code]```codemovie.head(1)#genres列已經不是json格式,而是將name將的value即電影類型提取出來重新賦值給genres [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 0 | 237000000 | [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… | http://www.avatarmovie.com/ | 19995 | [{“id”: 1463, “name”: “culture clash”}, {“id”:… | en | Avatar | In the 22nd century, a paraplegic Marine is di… | 150.437577 | [{“name”: “Ingenious Film Partners”, “id”: 289… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2009-12-10 | 2787965087 | 162.0 | [{“iso_639_1”: “en”, “name”: “English”}, {“iso… | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800```code#同樣的方法應用到keywords列movie['keywords'] = movie['keywords'].apply(json.loads)for index,i in zip(movie.index,movie['keywords']):list2=[]for j in range(len(i)):list2.append(i[j]['name'])movie.loc[index,'keywords'] = str(list2) [/code]```code#同理production_companiesmovie['production_companies'] = movie['production_companies'].apply(json.loads)for index,i in zip(movie.index,movie['production_companies']):list3=[]for j in range(len(i)):list3.append(i[j]['name'])movie.loc[index,'production_companies']=str(list3) [/code]```codemovie['production_countries'] = movie['production_countries'].apply(json.loads)for index,i in zip(movie.index,movie['production_countries']):list3=[]for j in range(len(i)):list3.append(i[j]['name'])movie.loc[index,'production_countries']=str(list3) [/code]```codemovie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)for index,i in zip(movie.index,movie['spoken_languages']):list3=[]for j in range(len(i)):list3.append(i[j]['name'])movie.loc[index,'spoken_languages']=str(list3) [/code]```codemovie.head(1) [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 0 | 237000000 | [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… | http://www.avatarmovie.com/ | 19995 | [‘culture clash’, ‘future’, ‘space war’, ‘spac… | en | Avatar | In the 22nd century, a paraplegic Marine is di… | 150.437577 | [‘Ingenious Film Partners’, ‘Twentieth Century… | [‘United States of America’, ‘United Kingdom’] | 2009-12-10 | 2787965087 | 162.0 | [‘English’, ‘Espa?ol’] | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800```codecredit.head(1) [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | movie_id | title | cast | crew ---|---|---|---|--- 0 | 19995 | Avatar | [{“cast_id”: 242, “character”: “Jake Sully”, “… | [{“credit_id”: “52fe48009251416c750aca23”, “de…```codecredit['cast'] = credit['cast'].apply(json.loads)for index,i in zip(credit.index,credit['cast']):list3=[]for j in range(len(i)):list3.append(i[j]['name'])credit.loc[index,'cast']=str(list3) [/code]```codecredit['crew'] = credit['crew'].apply(json.loads)#提取crew中director,增加電影導演一列,用作后續分析def director(x):for i in x:if i['job'] == 'Director':return i['name']credit['crew']=credit['crew'].apply(director)credit.rename(columns={'crew':'director'},inplace=True) [/code]```codecredit.head(1) [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | movie_id | title | cast | director ---|---|---|---|--- 0 | 19995 | Avatar | [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … | James Cameron 觀察movie中id和credit中movie_id相同,可以將兩個表合并,將所有信息統一在一個表中。```codefulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left') [/code]```codefulldf.head(1) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | … | spoken_languages | status | tagline | title_x | vote_average | vote_count | movie_id | title_y | cast | director ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 0 | 237000000 | [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… | http://www.avatarmovie.com/ | 19995 | [‘culture clash’, ‘future’, ‘space war’, ‘spac… | en | Avatar | In the 22nd century, a paraplegic Marine is di… | 150.437577 | [‘Ingenious Film Partners’, ‘Twentieth Century… | … | [‘English’, ‘Espa?ol’] | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 | 19995 | Avatar | [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … | James Cameron 1 rows × 24 columns```codefulldf.shape [/code](4803, 24)```code#觀察到有相同列title,合并后自動命名成title_x,title_yfulldf.rename(columns={'title_x':'title'},inplace=True)fulldf.drop('title_y',axis=1,inplace=True) [/code]```code#缺失值NAs = pd.DataFrame(fulldf.isnull().sum())NAs[NAs.sum(axis=1)>0].sort_values(by=[0],ascending=False) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | 0 ---|--- homepage | 3091 tagline | 844 director | 30 overview | 3 runtime | 2 release_date | 1```code#補充release_datefulldf.loc[fulldf['release_date'].isnull(),'title'] [/code] 4553 America Is Still the Place Name: title, dtype: object```code#上網查詢補充fulldf['release_date']=fulldf['release_date'].fillna('2014-06-01') [/code]```code#runtime為電影時長,按均值補充fulldf['runtime'] = fulldf['runtime'].fillna(fulldf['runtime'].mean()) [/code]```code#為方便分析,將release_date(object)轉為datetime類型,并提取year,monthfulldf['release_year'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.yearfulldf['release_month'] = pd.to_datetime(fulldf['release_date'],format='%Y-%m-%d').dt.month [/code]## 數據探索```code#電影類型genres#觀察其格式,我們需要做str相關處理,先移除兩邊中括號#相鄰類型間有空格,需要移除#再移除單引號,并按,分割提取即可fulldf['genres']=fulldf['genres'].str.strip('[]').str.replace(" ","").str.replace("'","") [/code]```code#每種類型現在以,分割fulldf['genres']=fulldf['genres'].str.split(',') [/code]```codelist1=[]for i in fulldf['genres']:list1.extend(i)gen_list=pd.Series(list1).value_counts()[:10].sort_values(ascending=False)gen_df = pd.DataFrame(gen_list)gen_df.rename(columns={0:'Total'},inplace=True) [/code]```codefulldf.ix[4801] [/code]```codebudget 0genres []homepage http://shanghaicalling.com/id 126186keywords []original_language enoriginal_title Shanghai Callingoverview When ambitious New York attorney Sam is sent t...popularity 0.857008production_companies []production_countries ['United States of America', 'China']release_date 2012-05-03revenue 0runtime 98spoken_languages ['English']status Releasedtagline A New Yorker in Shanghaititle Shanghai Callingvote_average 5.7vote_count 7movie_id 126186cast ['Daniel Henney', 'Eliza Coupe', 'Bill Paxton'...director Daniel Hsiarelease_year 2012release_month 5Name: 4801, dtype: object plt.subplots(figsize=(10,8))sns.barplot(y=gen_df.index,x='Total',data=gen_df,palette='GnBu_d')plt.xticks(fontsize=15)#設置刻度字體大小plt.yticks(fontsize=15)plt.xlabel('Total',fontsize=15)plt.ylabel('Genres',fontsize=15)plt.title('Top 10 Genres',fontsize=20)plt.show() [/code]![png](https://img- blog.csdn.net/20180523132516220?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)數量最多的前10種電影類型,有劇情、喜劇、驚悚、動作等,也是目前影院常見電影類型,那這些電影類型數量較多的背后原因有哪些呢? 我們再看看電影數量和時間的關系。```code#對電影類型去重l=[]for i in list1:if i not in l:l.append(i)#l.remove("")#有部分電影類型為空len(l)#l就是去重后的電影類型 [/code]21```codeyear_min = fulldf['release_year'].min()year_max = fulldf['release_year'].max()year_genr =pd.DataFrame(index=l,columns=range(year_min,year_max+1))#生成類型為index,年份為列的dataframe,用于每種類型在各年份的數量year_genr.fillna(value=0,inplace=True)#初始值為0intil_y = np.array(fulldf['release_year'])#用于遍歷所有年份z = 0for i in fulldf['genres']:splt_gen = list(i)#每一部電影的所有類型for j in splt_gen:year_genr.loc[j,intil_y[z]] = year_genr.loc[j,intil_y[z]]+1#計數該類型電影在某一年份的數量z+=1 year_genr = year_genr.sort_values(by=2006,ascending=False)year_genr = year_genr.iloc[0:10,-49:-1]year_genr [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | … | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- Drama | 7 | 8 | 3 | 6 | 4 | 3 | 2 | 4 | 8 | 5 | … | 97 | 106 | 122 | 115 | 99 | 79 | 110 | 110 | 95 | 37 Comedy | 3 | 4 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 1 | … | 67 | 82 | 97 | 87 | 82 | 80 | 71 | 62 | 52 | 26 Thriller | 3 | 2 | 3 | 1 | 2 | 1 | 1 | 2 | 4 | 5 | … | 53 | 55 | 59 | 56 | 69 | 58 | 53 | 66 | 67 | 27 Action | 4 | 4 | 4 | 1 | 2 | 1 | 1 | 2 | 6 | 5 | … | 44 | 46 | 51 | 49 | 58 | 43 | 56 | 54 | 46 | 39 Romance | 2 | 1 | 3 | 0 | 1 | 2 | 1 | 2 | 2 | 3 | … | 37 | 38 | 57 | 45 | 30 | 39 | 25 | 24 | 23 | 9 Family | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | … | 20 | 29 | 28 | 29 | 28 | 17 | 22 | 23 | 17 | 9 Crime | 3 | 0 | 2 | 3 | 2 | 2 | 0 | 2 | 0 | 0 | … | 28 | 33 | 32 | 30 | 24 | 27 | 37 | 27 | 26 | 10 Adventure | 2 | 3 | 1 | 2 | 1 | 2 | 2 | 2 | 5 | 4 | … | 25 | 37 | 36 | 30 | 32 | 25 | 36 | 37 | 35 | 23 Fantasy | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | … | 19 | 20 | 22 | 21 | 15 | 19 | 21 | 16 | 10 | 13 Horror | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 4 | … | 27 | 21 | 30 | 27 | 24 | 33 | 25 | 21 | 33 | 20 10 rows × 48 columns```codeplt.subplots(figsize=(10,8))plt.plot(year_genr.T)plt.title('Genres vs Time',fontsize=20)plt.xticks(range(1969,2020,5))plt.legend(year_genr.T)plt.show() [/code]![png](https://img- blog.csdn.net/20180523132553536?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)可以看到,從1994年左右,電影進入繁榮發展時期,各種類型的電影均有大幅增加,而增加最多的又以劇情、喜劇、驚悚、動作等類型電影,可見,這些類型電影數量居多和電影藝術整體繁榮發展有一定關系。```code#為了方便分析,構造一個新的dataframe,選取部分特征,分析這些特征和電影類型的關系。partdf = fulldf[['title','vote_average','vote_count','release_year','popularity','budget','revenue']].reset_index(drop=True) [/code]```codepartdf.head(2) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | title | vote_average | vote_count | release_year | popularity | budget | revenue ---|---|---|---|---|---|---|--- 0 | Avatar | 7.2 | 11800 | 2009 | 150.437577 | 237000000 | 2787965087 1 | Pirates of the Caribbean: At World’s End | 6.9 | 4500 | 2007 | 139.082615 | 300000000 | 961000000 因為一部電影可能有多種電影類型,將每種類型加入column中,對每部電影,是某種類型就賦值1,不是則賦值0```codefor per in l:partdf[per]=0z=0for gen in fulldf['genres']:if per in list(gen):partdf.loc[z,per] = 1else:partdf.loc[z,per] = 0z+=1partdf.head(2) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | title | vote_average | vote_count | release_year | popularity | budget | revenue | Action | Adventure | Fantasy | … | Romance | Horror | Mystery | History | War | Music | Documentary | Foreign | TVMovie | ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--- 0 | Avatar | 7.2 | 11800 | 2009 | 150.437577 | 237000000 | 2787965087 | 1 | 1 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 1 | Pirates of the Caribbean: At World’s End | 6.9 | 4500 | 2007 | 139.082615 | 300000000 | 961000000 | 1 | 1 | 1 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 2 rows × 28 columns現在我們想了解每種電影類型一些特征的平均值,創建一個新的dataframe,index就是電影類型,列是平均特征,如平分vote,收入revenue,受歡迎程度等。```codemean_gen = pd.DataFrame(l) [/code]```code#點評分數取均值newArray = []*len(l)for genre in l:newArray.append(partdf.groupby(genre, as_index=True)['vote_average'].mean())#現在newArray中是按類型[0]平均值[1]平均值存放,我們只關心[1]的值。newArray2 = []*len(l)for i in range(len(l)):newArray2.append(newArray[i][1])mean_gen['mean_votes_average']=newArray2mean_gen.head(2) [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | 0 | mean_votes_average ---|---|--- 0 | Action | 5.989515 1 | Adventure | 6.156962```code#同理,用到別的特征上#預算budgetnewArray = []*len(l)for genre in l:newArray.append(partdf.groupby(genre, as_index=True)['budget'].mean())newArray2 = []*len(l)for i in range(len(l)):newArray2.append(newArray[i][1])mean_gen['mean_budget']=newArray2 [/code]```code#收入revenuenewArray = []*len(l)for genre in l:newArray.append(partdf.groupby(genre, as_index=True)['revenue'].mean())newArray2 = []*len(l)for i in range(len(l)):newArray2.append(newArray[i][1])mean_gen['mean_revenue']=newArray2 [/code]```code#popularity:相關頁面查看次數newArray = []*len(l)for genre in l:newArray.append(partdf.groupby(genre, as_index=True)['popularity'].mean())newArray2 = []*len(l)for i in range(len(l)):newArray2.append(newArray[i][1])mean_gen['mean_popular']=newArray2 [/code]```code#vote_count:評分次數取countnewArray = []*len(l)for genre in l:newArray.append(partdf.groupby(genre, as_index=True)['vote_count'].count())newArray2 = []*len(l)for i in range(len(l)):newArray2.append(newArray[i][1])mean_gen['vote_count']=newArray2 [/code]```codemean_gen.rename(columns={0:'genre'},inplace=True)mean_gen.replace('','none',inplace=True)#none代表有些電影類型或其他特征有缺失,可以看到數量很小,我們將其舍得不考慮mean_gen.drop(20,inplace=True) [/code]```codemean_gen['vote_count'].describe() [/code] count 20.000000 mean 608.000000 std 606.931974 min 8.000000 25% 174.750000 50% 468.500000 75% 816.000000 max 2297.000000 Name: vote_count, dtype: float64```codemean_gen['mean_votes_average'].describe() [/code]count 20.000000 mean 6.173921 std 0.278476 min 5.626590 25% 6.009644 50% 6.180978 75% 6.344325 max 6.719797 Name: mean_votes_average, dtype: float64```code#fig = plt.figure(figsize=(10, 8))f,ax = plt.subplots(figsize=(10,6))ax1 = f.add_subplot(111)ax2 = ax1.twinx()grid1 = sns.factorplot(x='genre', y='mean_votes_average',data=mean_gen,ax=ax1)ax1.axes.set_ylabel('votes_average')ax1.axes.set_ylim((4,7))grid2 = sns.factorplot(x='genre',y='mean_popular',data=mean_gen,ax=ax2,color='blue')ax2.axes.set_ylabel('popularity')ax2.axes.set_ylim((0,40))ax1.set_xticklabels(mean_gen['genre'],rotation=90)plt.show() [/code]![png](https://img- blog.csdn.net/20180523132655277?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)從上圖可知,外國電影并不受歡迎,雖然評分不低,但也是因為評分人數太少,動漫電影(Animation)、科幻(Science Fiction)、奇幻電影(Fantasy)、動作片(Action)受歡迎程度較高,評分也不低,數量最多的劇情片評分很高,但受歡迎程度較低,猜測可能大部分劇情片不是商業類型。```codemean_gen['profit'] = mean_gen['mean_revenue']-mean_gen['mean_budget'] [/code]```codes = mean_gen['profit'].sort_values(ascending=False)[:10]pdf = mean_gen.ix[s.index]plt.subplots(figsize=(10,6))sns.barplot(x='profit',y='genre',data=pdf,palette='BuGn_r')plt.xticks(fontsize=15)#設置刻度字體大小plt.yticks(fontsize=15)plt.xlabel('Profit',fontsize=15)plt.ylabel('Genres',fontsize=15)plt.title('Top 10 Profit of Genres',fontsize=20)plt.show() [/code]![png](https://img- blog.csdn.net/20180523132747502?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)可以看出,動畫、探險、家庭和科幻是最賺錢的電影類型,適合去電影院觀看,同時也是受歡迎的類型,那么我們看看變量的關系。```codecordf = partdf.drop(l,axis=1)cordf.columns#含有我們想了解的特征,適合分析 [/code]```codeIndex(['title', 'vote_average', 'vote_count', 'release_year', 'popularity','budget', 'revenue'],dtype='object') corrmat = cordf.corr()f, ax = plt.subplots(figsize=(10,7))sns.heatmap(corrmat,cbar=True, annot=True,vmax=.8, cmap='PuBu',square=True) [/code]![png](https://img- blog.csdn.net/20180523132843861?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)從上圖可以看出,評分次數和受歡迎程度有比較強的關系,證明看的人多參與度也高,預算和票房也關系較強,票房和受歡迎程度、評分次數也有比較強的關系,為電影做好宣傳很重要,我們再進一步看一下。```code#budget, revenue在數據中都有為0的項,我們去除這些臟數據,partdf = partdf[partdf['budget']>0]partdf = partdf[partdf['revenue']>0]partdf = partdf[partdf['vote_count']>3]plt.subplots(figsize=(6,5))plt.xlabel('Budget',fontsize=15)plt.ylabel('Revenue',fontsize=15)plt.title('Budget vs Revenue',fontsize=20)sns.regplot(x='budget',y='revenue',data=partdf,ci=None) [/code]![png](https://img- blog.csdn.net/20180523132916443?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)```codeplt.subplots(figsize=(6,5))plt.xlabel('vote_average',fontsize=15)plt.ylabel('popularity',fontsize=15)plt.title('Score vs Popular',fontsize=20)sns.regplot(x='vote_average',y='popularity',data=partdf) [/code]![png](https://img- blog.csdn.net/20180523132946451?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)可以看出,成本和票房、評分高低和受歡迎程度還是呈線性關系的。但成本較低的電影,成本對票房的影響不大,評分高的的電影基本上也很受歡迎,我們再看看究竟是哪幾部電影最掙錢、最受歡迎、口碑最好。```codeprint(partdf.loc[partdf['revenue']==partdf['revenue'].max()]['title'])print(partdf.loc[partdf['popularity']==partdf['popularity'].max()]['title'])print(partdf.loc[partdf['vote_average']==partdf['vote_average'].max()]['title']) [/code]0 Avatar Name: title, dtype: object 546 Minions Name: title, dtype: object 1881 The Shawshank Redemption Name: title, dtype: object```codepartdf['profit'] = partdf['revenue']-partdf['budget']print(partdf.loc[partdf['profit']==partdf['profit'].max()]['title']) [/code]0 Avatar Name: title, dtype: object小黃人電影最受歡迎,阿凡達最賺錢,肖申克的救贖口碑最好。```codes1 = cordf.groupby(by='release_year').budget.sum()s2 = cordf.groupby(by='release_year').revenue.sum()sdf = pd.concat([s1,s2],axis=1)sdf = sdf.iloc[-39:-2]plt.plot(sdf)plt.xticks(range(1979,2020,5))plt.legend(sdf)plt.show() [/code]![png](https://img- blog.csdn.net/20180523133047409?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)電影業果然是蓬勃發展啊!現在大制作的電影越來越多,看來是有原因的啊!對于科幻迷們,也可以看看最受歡迎的科幻電影都有哪些:```code#最受歡迎的科幻電影s = partdf.loc[partdf['ScienceFiction']==1,'popularity'].sort_values(ascending=False)[:10]sdf = partdf.ix[s.index]sns.barplot(x='popularity',y='title',data=sdf)plt.show() [/code]![png](https://img- blog.csdn.net/20180523133107982?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)星際穿越最受歡迎,銀河護衛隊緊隨其后,同理,我們也可以了解其他電影類型的情況。現在。讓我們再看看電影人對電影市場的影響,一部好電影離不開臺前幕后工作人員的貢獻,是每一位優秀的電影人為我們帶來好看的電影,這里,我們主要分析導演和演員。```code#平均票房最高的導演rev_d = fulldf.groupby('director')['revenue'].mean()top_rev_d = rev_d.sort_values(ascending=False).head(20)top_rev_d = pd.DataFrame(top_rev_d) plt.subplots(figsize=(10,6))sns.barplot(x='revenue',y=top_rev_d.index,data=top_rev_d,palette='BuGn_r')plt.xticks(fontsize=15)plt.yticks(fontsize=15)plt.xlabel('Average Revenue',fontsize=15)plt.ylabel('Director',fontsize=15)plt.title('Top 20 Revenue by Director',fontsize=20)plt.show() [/code]![png](https://img- blog.csdn.net/20180523133135909?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)如圖是市場好的導演,那么電影產量最高、或者既叫好又叫座的導演有哪些呢?```codelist2 = fulldf[fulldf['director']!=''].director.value_counts()[:10].sort_values(ascending=True)list2 = pd.Series(list2)list2 [/code]Oliver Stone 14 Renny Harlin 15 Steven Soderbergh 15 Robert Rodriguez 16 Spike Lee 16 Ridley Scott 16 Martin Scorsese 20 Clint Eastwood 20 Woody Allen 21 Steven Spielberg 27 Name: director, dtype: int64```codeplt.subplots(figsize=(10,6))ax = list2.plot.barh(width=0.85,color='y')for i,v in enumerate(list2.values):ax.text(.5, i, v,fontsize=12,color='white',weight='bold')ax.patches[9].set_facecolor('g')plt.title('Directors with highest movies')plt.show() [/code]![png](https://img- blog.csdn.net/2018052313320860?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)```codetop_vote_d = fulldf[fulldf['vote_average']>=8].sort_values(by='vote_average',ascending=False)top_vote_d = top_vote_d.dropna()top_vote_d = top_vote_d.loc[:,['director','vote_average']] [/code]```codetmp = rev_d.sort_values(ascending=False)vote_rev_d = tmp[tmp.index.isin(list(top_vote_d['director']))]vote_rev_d = vote_rev_d.sort_values(ascending=False)vote_rev_d = pd.DataFrame(vote_rev_d) [/code]```codeplt.subplots(figsize=(10,6))sns.barplot(x='revenue',y=vote_rev_d.index,data=vote_rev_d,palette='BuGn_r')plt.xticks(fontsize=15)plt.yticks(fontsize=15)plt.xlabel('Average Revenue',fontsize=15)plt.ylabel('Director',fontsize=15)plt.title('Revenue by vote above 8 Director',fontsize=20)plt.show() [/code]![png](https://img- blog.csdn.net/2018052313323079?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)再看看演職人員,cast特征里每一部電影有很多演職人員,幸運的是,cast是按演職人員的重要程度排序的,那么排名靠前的我們可以認為是主要演員。```codefulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')fulldf['cast']=fulldf['cast'].str.split(',') list1=[]for i in fulldf['cast']:list1.extend(i)list1 = pd.Series(list1)list1 = list1.value_counts()[:15].sort_values(ascending=True) plt.subplots(figsize=(10,6))ax = list1.plot.barh(width=0.9,color='green')for i,v in enumerate(list1.values):ax.text(.8, i, v,fontsize=10,color='white',weight='bold')plt.title('Actors with highest appearance')ax.patches[14].set_facecolor('b')plt.show() [/code]![png](https://img- blog.csdn.net/2018052313325049?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)```codefulldf['keywords'][2] [/code]“[‘spy’, ‘based on novel’, ‘secret agent’, ‘sequel’, ‘mi6’, ‘british secret service’, ‘united kingdom’]”```codefrom wordcloud import WordCloud, STOPWORDSimport nltkfrom nltk.corpus import stopwords#如果stopwords報錯沒有安裝,可以在anaconda cmd中import nltk;nltk.download()#在彈出窗口中選擇corpa,stopword,刷新并下載import iofrom PIL import Image [/code]```codeplt.subplots(figsize=(12,12))stop_words=set(stopwords.words('english'))stop_words.update(',',';','!','?','.','(',')','$','#','+',':','...',' ','')img1 = Image.open('timg1.jpg')hcmask1 = np.array(img1)words=fulldf['keywords'].dropna().apply(nltk.word_tokenize)word=[]for i in words:word.extend(i)word=pd.Series(word)word=([i for i in word.str.lower() if i not in stop_words])wc = WordCloud(background_color="black", max_words=4000, mask=hcmask1,stopwords=STOPWORDS, max_font_size= 60)wc.generate(" ".join(word))plt.imshow(wc,interpolation="bilinear")plt.axis('off')plt.figure()plt.show() [/code]![png](https://img- blog.csdn.net/20180523133325401?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2lhbV9lbWlseQ==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)我們可以對關鍵詞有大概了解,女性導演、獨立電影占比較大,這也可能是電影的一個發展趨勢。## 電影推薦模型現在我們根據上述的分析,可以考慮做一個電影推薦,通常來說,我們在搜索電影時,我們會去找同類的電影、或者同一導演演員的電影、或者評分較高的電影,那么需要的特征有genres,cast,director,score```codel[:5] [/code][‘Action’, ‘Adventure’, ‘Fantasy’, ‘ScienceFiction’, ‘Crime’]### 特征向量化#### genre```codedef binary(genre_list):binaryList = []for genre in l:if genre in genre_list:binaryList.append(1)else:binaryList.append(0)return binaryList [/code]```codefulldf['genre_vec'] = fulldf['genres'].apply(lambda x: binary(x)) [/code]```codefulldf['genre_vec'][0] [/code][1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]#### cast```codefor i,j in zip(fulldf['cast'],fulldf.index):list2=[]list2=i[:4]list2.sort()fulldf.loc[j,'cast']=str(list2)fulldf['cast'][0] [/code]“[‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]”```codefulldf['cast']=fulldf['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')fulldf['cast']=fulldf['cast'].str.split(',')fulldf['cast'][0] [/code][‘SamWorthington’, ‘SigourneyWeaver’, ‘StephenLang’, ‘ZoeSaldana’]```codecastList = []for index, row in fulldf.iterrows():cast = row["cast"]for i in cast:if i not in castList:castList.append(i) [/code]```codelen(castList) [/code]7515```codedef binary(cast_list):binaryList = []for genre in castList:if genre in cast_list:binaryList.append(1)else:binaryList.append(0)return binaryList [/code]```codefulldf['cast_vec'] = fulldf['cast'].apply(lambda x:binary(x))fulldf['cast_vec'].head(2) [/code]0 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … 1 [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, … Name: cast_vec, dtype: object#### director```codefulldf['director'][0] [/code]‘James Cameron’```codedef xstr(s):if s is None:return ''return str(s)fulldf['director']=fulldf['director'].apply(xstr) [/code]```codedirectorList=[]for i in fulldf['director']:if i not in directorList:directorList.append(i) [/code]```codedef binary(director_list):binaryList = []for direct in directorList:if direct in director_list:binaryList.append(1)else:binaryList.append(0)return binaryList [/code]```codefulldf['director_vec'] = fulldf['director'].apply(lambda x:binary(x)) [/code]#### keywords```codefulldf['keywords'][0] [/code]“[‘culture clash’, ‘future’, ‘space war’, ‘space colony’, ‘society’, ‘space travel’, ‘futuristic’, ‘romance’, ‘space’, ‘alien’, ‘tribe’, ‘alien planet’, ‘cgi’, ‘marine’, ‘soldier’, ‘battle’, ‘love affair’, ‘anti war’, ‘power relations’, ‘mind and soul’, ‘3d’]”```code#change keywords to type listfulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')fulldf['keywords']=fulldf['keywords'].str.split(',') [/code]```codefor i,j in zip(fulldf['keywords'],fulldf.index):list2=[]list2 = ilist2.sort()fulldf.loc[j,'keywords']=str(list2)fulldf['keywords'][0] [/code]“[‘3d’, ‘alien’, ‘alienplanet’, ‘antiwar’, ‘battle’, ‘cgi’, ‘cultureclash’, ‘future’, ‘futuristic’, ‘loveaffair’, ‘marine’, ‘mindandsoul’, ‘powerrelations’, ‘romance’, ‘society’, ‘soldier’, ‘space’, ‘spacecolony’, ‘spacetravel’, ‘spacewar’, ‘tribe’]”```codefulldf['keywords']=fulldf['keywords'].str.strip('[]').str.replace(' ','').str.replace("'",'').str.replace('"','')fulldf['keywords']=fulldf['keywords'].str.split(',') [/code]```codewords_list = []for index, row in fulldf.iterrows():genres = row["keywords"]for genre in genres:if genre not in words_list:words_list.append(genre)len(words_list) [/code]9772```codedef binary(words):binaryList = []for genre in words_list:if genre in words:binaryList.append(1)else:binaryList.append(0)return binaryList [/code]```codefulldf['words_vec'] = fulldf['keywords'].apply(lambda x: binary(x)) [/code]#### recommend model取余弦值作為相似性度量,根據選取的特征向量計算影片間的相似性;計算距離最近的前10部影片作為推薦```codefulldf=fulldf[(fulldf['vote_average']!=0)] #removing the fulldf with 0 score and without drector names fulldf=fulldf[fulldf['director']!=''] [/code]```codefrom scipy import spatialdef Similarity(movieId1, movieId2):a = fulldf.iloc[movieId1]b = fulldf.iloc[movieId2]genresA = a['genre_vec']genresB = b['genre_vec']genreDistance = spatial.distance.cosine(genresA, genresB)castA = a['cast_vec']castB = b['cast_vec']castDistance = spatial.distance.cosine(castA, castB)directA = a['director_vec']directB = b['director_vec']directDistance = spatial.distance.cosine(directA, directB)wordsA = a['words_vec']wordsB = b['words_vec']wordsDistance = spatial.distance.cosine(directA, directB)return genreDistance + directDistance + castDistance + wordsDistance [/code]```codeSimilarity(3,160) [/code]2.7958758547680684```codecolumns =['original_title','genres','vote_average','genre_vec','cast_vec','director','director_vec','words_vec']tmp = fulldf.copy()tmp =tmp[columns]tmp['id'] = list(range(0,fulldf.shape[0]))tmp.head() [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | original_title | genres | vote_average | genre_vec | cast_vec | director | director_vec | words_vec | id ---|---|---|---|---|---|---|---|---|--- 0 | Avatar | [Action, Adventure, Fantasy, ScienceFiction] | 7.2 | [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | James Cameron | [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … | 0 1 | Pirates of the Caribbean: At World’s End | [Adventure, Fantasy, Action] | 6.9 | [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, … | Gore Verbinski | [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | 1 2 | Spectre | [Action, Adventure, Crime] | 6.3 | [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, … | Sam Mendes | [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | 2 3 | The Dark Knight Rises | [Action, Crime, Drama, Thriller] | 7.6 | [1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, … | Christopher Nolan | [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | 3 4 | John Carter | [Action, Adventure, ScienceFiction] | 6.1 | [1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | Andrew Stanton | [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … | 4```codetmp.isnull().sum() [/code] original_title 0 genres 0 vote_average 0 genre_vec 0 cast_vec 0 director 0 director_vec 0 words_vec 0 id 0 dtype: int64```codeimport operatordef recommend(name):film=tmp[tmp['original_title'].str.contains(name)].iloc[0].to_frame().Tprint('Selected Movie: ',film.original_title.values[0])def getNeighbors(baseMovie):distances = []for index, movie in tmp.iterrows():if movie['id'] != baseMovie['id'].values[0]:dist = Similarity(baseMovie['id'].values[0], movie['id'])distances.append((movie['id'], dist))distances.sort(key=operator.itemgetter(1))neighbors = []for x in range(10):neighbors.append(distances[x])return neighborsneighbors = getNeighbors(film)print('\nRecommended Movies: \n')for nei in neighbors: print( tmp.iloc[nei[0]][0]+" | Genres: "+str(tmp.iloc[nei[0]][1]).strip('[]').replace(' ','')+" | Rating: "+str(tmp.iloc[nei[0]][2]))print('\n') [/code]```coderecommend('Godfather') [/code]Selected Movie: The Godfather: Part IIIRecommended Movies:```codeThe Godfather: Part II | Genres: 'Drama','Crime' | Rating: 8.3The Godfather | Genres: 'Drama','Crime' | Rating: 8.4The Rainmaker | Genres: 'Drama','Crime','Thriller' | Rating: 6.7The Outsiders | Genres: 'Crime','Drama' | Rating: 6.9The Conversation | Genres: 'Crime','Drama','Mystery' | Rating: 7.5The Cotton Club | Genres: 'Music','Drama','Crime','Romance' | Rating: 6.6Apocalypse Now | Genres: 'Drama','War' | Rating: 8.0Twixt | Genres: 'Horror','Thriller' | Rating: 5.0New York Stories | Genres: 'Comedy','Drama','Romance' | Rating: 6.2Peggy Sue Got Married | Genres: 'Comedy','Drama','Fantasy','Romance' | Rating: 5.9

相關函數解釋

json格式處理

json是一種數據交換格式,以鍵值對的形式呈現,支持任何類型

  • json.loads用于解碼json格式,將其轉為dict;
  • 其逆操作,即轉為json格式,是json.dumps(),若要存儲為json文件,需要先dumps轉換再寫入
  • json.dump()用于將dict類型的數據轉成str,并寫入到json文件中,json.dump(json,file)
  • json.load()用于從json文件中讀取數據。json.load(file)
exam = {'a':'1111','b':'2222','c':'3333','d':'4444'}file = 'exam.json'jsobj = json.dumps(exam)# solution 1with open(file,'w') as f:f.write(jsobj)f.close()#solution 2json.dump(exam,open(file,'w')) [/code]## zip()操作* zip()操作:用于將可迭代的對象作為參數,將對象中對應的元素打包成一個個元組,然后返回由這些元組組成的列表。 * 其逆操作為*zip(),舉例如下: ```codea = [1,2,3]b = [4,5,6]c = [4,5,6,7,8]zipped = zip(a,b)for i in zipped:print(i)print('\n')shor_z = zip(a,c)for j in shor_z:#取最短print(j) [/code](1, 4) (2, 5) (3, 6) (1, 4) (2, 5) (3, 6)```codez=list(zip(a,b))z [/code][(1, 4), (2, 5), (3, 6)]```codelist(zip(*z))#轉為list能看見 [/code][(1, 2, 3), (4, 5, 6)]## pandas merge/renamepd.merge()通過鍵合并```codea=pd.DataFrame({'lkey':['foo','foo','bar','bar'],'value':[1,2,3,4]})a [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | lkey | value ---|---|--- 0 | foo | 1 1 | foo | 2 2 | bar | 3 3 | bar | 4```codefor index,row in a.iterrows():print(index)print('*****')print(row) [/code] 0 ***** lkey foo value 1 Name: 0, dtype: object 1 ***** lkey foo value 2 Name: 1, dtype: object 2 ***** lkey bar value 3 Name: 2, dtype: object 3 ***** lkey bar value 4 Name: 3, dtype: object```codeb=pd.DataFrame({'rkey':['foo','foo','bar','bar'],'value':[5,6,7,8]})b [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | rkey | value ---|---|--- 0 | foo | 5 1 | foo | 6 2 | bar | 7 3 | bar | 8```codepd.merge(a,b,left_on='lkey',right_on='rkey',how='left') [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | lkey | value_x | rkey | value_y ---|---|---|---|--- 0 | foo | 1 | foo | 5 1 | foo | 1 | foo | 6 2 | foo | 2 | foo | 5 3 | foo | 2 | foo | 6 4 | bar | 3 | bar | 7 5 | bar | 3 | bar | 8 6 | bar | 4 | bar | 7 7 | bar | 4 | bar | 8```codepd.merge(a,b,left_on='lkey',right_on='rkey',how='inner') [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | lkey | value_x | rkey | value_y ---|---|---|---|--- 0 | foo | 1 | foo | 5 1 | foo | 1 | foo | 6 2 | foo | 2 | foo | 5 3 | foo | 2 | foo | 6 4 | bar | 3 | bar | 7 5 | bar | 3 | bar | 8 6 | bar | 4 | bar | 7 7 | bar | 4 | bar | 8 pd.rename()對行列重命名```codedframe= pd.DataFrame(np.arange(12).reshape((3, 4)),index=['NY', 'LA', 'SF'],columns=['A', 'B', 'C', 'D'])dframe [/code].dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | A | B | C | D ---|---|---|---|--- NY | 0 | 1 | 2 | 3 LA | 4 | 5 | 6 | 7 SF | 8 | 9 | 10 | 11```codedframe.rename(columns={'A':'alpha'}) [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | alpha | B | C | D ---|---|---|---|--- NY | 0 | 1 | 2 | 3 LA | 4 | 5 | 6 | 7 SF | 8 | 9 | 10 | 11```codedframe [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | A | B | C | D ---|---|---|---|--- NY | 0 | 1 | 2 | 3 LA | 4 | 5 | 6 | 7 SF | 8 | 9 | 10 | 11```codedframe.rename(columns={'A':'alpha'},inplace=True)dframe [/code] .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } | alpha | B | C | D ---|---|---|---|--- NY | 0 | 1 | 2 | 3 LA | 4 | 5 | 6 | 7 SF | 8 | 9 | 10 | 11 ## pandas datetime格式pandas to_datetime()轉為datetime格式## Wordcloudwordcloud詞云模塊: 1.安裝:在conda cmd中輸入conda install -c conda-forge wordcloud 2.步驟:讀入背景圖片,文本,實例化Wordcloud對象wc, wc.generate(text)產生云圖,plt.imshow()顯示圖片參數: mask:遮罩圖,字的大小布局和顏色都會依據遮罩圖生成 background_color:背景色,默認黑 max_font_size:最大字號## nltk簡單介紹from nltk.corpus import stopwords 如果stopwords報錯沒有安裝,可以在anaconda cmd中import nltk;nltk.download() 在彈出窗口中選擇corpa,stopword,刷新并下載 同理,在models選項卡中選擇Punkt Tokenizer Model刷新并下載,可安裝nltk.word_tokenize()分詞: nltk.sent_tokenize(text) #對文本按照句子進行分割nltk.word_tokenize(sent) #對句子進行分詞stopwords:個人理解是對表述不構成影響,大量存在,且可以直接過濾掉的詞# 參考文章:[ what’s my score ](https://www.kaggle.com/ash316/what-s-my-score) [ TMDB means per genre ](https://www.kaggle.com/kkooijman/tmdb-means-per- genre)* * *_新手學習,歡迎指教!_![在這里插入圖片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)

總結

以上是生活随笔為你收集整理的kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章:的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。