日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) >

Kaggle入门 - TMDB 5000 电影推荐数据分析

發(fā)布時(shí)間:2023/12/31 52 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Kaggle入门 - TMDB 5000 电影推荐数据分析 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

本文是針對(duì)kaggle上的數(shù)據(jù)集TMDB 5000 Movie Dataset進(jìn)行數(shù)據(jù)分析。

數(shù)據(jù)集在以下鏈接就可下載 https://www.kaggle.com/tmdb/tmdb-movie-metadata

![](https://img-
blog.csdn.net/2018071616175174?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

本文將按以下幾個(gè)步驟描述,數(shù)據(jù)分析的流程:

1.提出問(wèn)題,給出分析目的;

2.數(shù)據(jù)清洗;

3.針對(duì)問(wèn)題建立模型;

4.數(shù)據(jù)可視化;

5.分析結(jié)果,形成數(shù)據(jù)分析報(bào)告

1.提出問(wèn)題,給出分析目的

首先觀察數(shù)據(jù),tmdb_5000_credit文件標(biāo)簽有電影id,名稱(chēng),演員,工作人員

![](https://img-
blog.csdn.net/20180721153512569?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

tmdb_5000_movies標(biāo)簽有,很多。能用上的有id,名稱(chēng),電影標(biāo)語(yǔ),電影時(shí)長(zhǎng),評(píng)分,預(yù)算金額,電影類(lèi)型,關(guān)鍵字,制作公司,上映時(shí)期,收入。

從本人閱片無(wú)數(shù)的角度來(lái)看,基于電影推薦提出幾個(gè)問(wèn)題如下:

  • 分類(lèi)型推薦。每個(gè)人都有自己的愛(ài)好,電影也一樣,找出每個(gè)類(lèi)型下評(píng)分最高前20名,并給出相應(yīng)電影的標(biāo)語(yǔ)tagline,簡(jiǎn)介overview。

  • 按制作國(guó)家分類(lèi)推薦。也許就是一時(shí)興起就想看一個(gè)美國(guó)大片,或者看個(gè)迪士尼的動(dòng)漫也還挺好,哎看個(gè)日本的文藝小清新片子也是個(gè)不錯(cuò)的idea。

  • 按熱門(mén)電影推薦,根據(jù)popularity的值從高到低排序。

  • 按評(píng)分推薦,分?jǐn)?shù)要較高且評(píng)分人數(shù)高于某值。

  • 按觀影者心情推薦。抑郁的人推薦小眾文藝片,從生活出發(fā)到靈魂結(jié)束,在平淡中找到人生的意義,積極向上的電影;無(wú)聊的人推薦喜劇,科技探索片也是個(gè)不錯(cuò)的選擇;開(kāi)心的人推薦燒腦片之類(lèi)的劇情電影,讓你忘掉開(kāi)心,【笑臉】。

  • 2.數(shù)據(jù)清洗

    數(shù)據(jù)清洗主要分三步:1.數(shù)據(jù)預(yù)處理;2.特征提取;3.特征選取[1]。

    2.1 數(shù)據(jù)預(yù)處理

    數(shù)據(jù)預(yù)處理包括:發(fā)現(xiàn)和填補(bǔ)缺失值、數(shù)據(jù)類(lèi)型轉(zhuǎn)換、異常值刪除等。

    首先合并兩個(gè)數(shù)據(jù)表,刪除重復(fù)的movie_id,刪除本次分析不需要用到的列。

    ![](https://img-
    blog.csdn.net/20180723171712467?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    查看數(shù)據(jù)信息,看那個(gè)數(shù)據(jù)缺失。

    ![](https://img-
    blog.csdn.net/2018072317182088?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    數(shù)據(jù)中release_date列缺失1條數(shù)據(jù),runtime列缺失2條數(shù)據(jù),通過(guò)索引的方式找到具體是哪一部電影,上網(wǎng)搜索準(zhǔn)確數(shù)據(jù)填上,homepage,
    overview, tagline以字符null填充。對(duì)于release_date列,需將其轉(zhuǎn)換為日期類(lèi)型,然后提取出“年份”數(shù)據(jù)。

    查找release_date缺失的那一列,搜索數(shù)據(jù)填上,同理runtime列。

    ![](https://img-
    blog.csdn.net/20180723174600561?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    homepage, overview, tagline以字符null填充。

    ![](https://img-
    blog.csdn.net/20180723180534987?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    2.2 特征提取

    針對(duì)提出的每一個(gè)問(wèn)題,選取合適的特征去研究分析,構(gòu)造dataframe進(jìn)行數(shù)據(jù)可視化。

    credits數(shù)據(jù)中,cast、crew是json的格式,需要將演員、導(dǎo)演讀取出來(lái),以字符串格式顯示。movies數(shù)據(jù)中g(shù)enres、keywords、production_companies也是json格式,需要轉(zhuǎn)化成字符串。通過(guò)json.loads先將JSON字符串轉(zhuǎn)換為字典列表"[{},{},{}]"的形式,再遍歷每個(gè)字典,取出鍵(key)為‘name’所對(duì)應(yīng)的值(value),并將這些值(value)用
    “,” 分隔。[1]

    ![](https://img-
    blog.csdn.net/20180724160642160?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    再提取導(dǎo)演和主演。

    ![](https://img-
    blog.csdn.net/20180724162728172?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/2018072416282182?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    3.數(shù)據(jù)可視化

    3.1 問(wèn)題1. 分類(lèi)型推薦

    每個(gè)人都有自己的愛(ài)好,電影也一樣,找出每個(gè)類(lèi)型下評(píng)分最高前20名,并給出相應(yīng)電影的標(biāo)語(yǔ)tagline,簡(jiǎn)介overview及電影主頁(yè)homepage。

    首先看哪種類(lèi)型電影數(shù)量最多,及電影類(lèi)型隨時(shí)間的變化趨勢(shì)。提取所有的電影類(lèi)型,對(duì)各種電影類(lèi)型進(jìn)行one-
    hot編碼,如果一個(gè)值中包含指定內(nèi)容,則編碼為1,否則編碼為0。

    ![](https://img-
    blog.csdn.net/20180724185031644?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/20180724185139909?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/20180724210110975?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/20180724210133915?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/20180724210451879?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    ![](https://img-
    blog.csdn.net/20180724210706187?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)

    # 電影類(lèi)型隨時(shí)間的變化趨勢(shì)圖fig = plt.figure(figsize=(10, 8)) # 設(shè)置畫(huà)圖框尺寸ax1 = plt.subplot(1, 1, 1)# 設(shè)置圖的位置plt.plot(genre_year60) #畫(huà)折線(xiàn)圖# 設(shè)置圖形格式plt.title('電影類(lèi)型隨時(shí)間的變化趨勢(shì)圖', fontsize=18)plt.xlabel('年份', fontsize=18)plt.ylabel('數(shù)量', fontsize=18)plt.xticks(range(1960, 2017, 10))# 設(shè)置x軸的刻度plt.legend(genre_year60)plt.show()fig.savefig('film genre by year.png', dpi=600) [/code]![](https://img- blog.csdn.net/20180724210815139?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)在研究分類(lèi)型推薦電影時(shí),為了降低主觀性先刪除評(píng)價(jià)人數(shù)小于100的電影記錄。```codegenres_df['id'] = merge_df['id']genres_df['title'] = merge_df['title']genres_df['vote_average'] = merge_df['vote_average']genres_df['vote_count'] = merge_df['vote_count']# 刪除評(píng)價(jià)人數(shù)小于100的電影genres_df = genres_df[genres_df['vote_count'] > 100] [/code]例如針對(duì)劇情Drama電影,在genres_df中找出'Drama'值為1的記錄,新建數(shù)據(jù)框,加入電影的id, title, vote_average, tagline, overview,利用vote_average降序排列,取前20個(gè)記錄。![](https://img- blog.csdn.net/20180726205534678?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)drama類(lèi)型下,評(píng)分前20的電影如下圖,同理其他類(lèi)型。drama類(lèi)型下的推薦電影:### ![](https://img- blog.csdn.net/20180726205244798?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)### 3.2 問(wèn)題2. 按制作國(guó)家分類(lèi)推薦也許就是一時(shí)興起就想看一個(gè)美國(guó)大片,或者看個(gè)迪士尼的動(dòng)漫也還挺好,哎看個(gè)日本的文藝小清新片子也是個(gè)不錯(cuò)的idea。首先看哪種國(guó)家電影數(shù)量最多,及每個(gè)國(guó)家的電影隨時(shí)間的變化趨勢(shì)。由于很多電影的制作國(guó)家不止一個(gè),所以按問(wèn)題一中的思路,提取所有的電影制作國(guó)家,對(duì)各個(gè)國(guó)家進(jìn)行one- hot編碼,如果一個(gè)值中包含指定內(nèi)容,則編碼為1,否則編碼為0。![](https://img- blog.csdn.net/20180727120115178?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)考查哪種國(guó)家電影數(shù)量最多,用pie圖看每個(gè)國(guó)家電影數(shù)目比例。![](https://img- blog.csdn.net/20180727120138886?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)![](https://img- blog.csdn.net/20180727120337821?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)每個(gè)國(guó)家的電影隨時(shí)間的變化趨勢(shì)。![](https://img- blog.csdn.net/20180727122635854?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)![](https://img- blog.csdn.net/20180727132516609?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)以美國(guó)為例,按制作國(guó)家推薦電影。![](https://img- blog.csdn.net/20180727134151281?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)美國(guó)電影評(píng)分top20:id | title | vote_average | tagline ---|---|---|--- 278 | The Shawshank Redemption | 8.5 | Fear can hold you prisoner. Hope can set you f... 238 | The Godfather | 8.4 | An offer you can't refuse. 550 | Fight Club | 8.3 | Mischief. Mayhem. Soap. 240 | The Godfather: Part II | 8.3 | I don't feel I have to wipe everybody out, Tom... 424 | Schindler's List | 8.3 | Whoever saves one life, saves the world entire. 244786 | Whiplash | 8.3 | The road to greatness can take you to the edge. 680 | Pulp Fiction | 8.3 | Just because you are a character doesn't mean ... 510 | One Flew Over the Cuckoo's Nest | 8.2 | If he's crazy, what does that make you? 497 | The Green Mile | 8.2 | Miracles do happen. 769 | GoodFellas | 8.2 | Three Decades of Life in the Mafia. 73 | American History X | 8.2 | Some Legacies Must End. 13 | Forrest Gump | 8.2 | The world will never be the same, once you've ... 311 | Once Upon a Time in America | 8.2 | Crime, passion and lust for power - Sergio Leo... 1891 | The Empire Strikes Back | 8.2 | The Adventure Continues... 539 | Psycho | 8.2 | The master of suspense moves his cameras into ... 155 | The Dark Knight | 8.2 | Why So Serious? 389 | 12 Angry Men | 8.2 | Life is in their hands. Death is on their minds. 27205 | Inception | 8.1 | Your mind is the scene of the crime. 11 | Star Wars | 8.1 | A long time ago in a galaxy far, far away... 77 | Memento | 8.1 | Some memories are best forgotten. ### 3.3 問(wèn)題3. 按熱門(mén)電影推薦根據(jù)popularity的值從高到低排序。![](https://img- blog.csdn.net/20180726150624599?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)輸出結(jié)果如下,可見(jiàn)按popularity推薦的是比較新的人們關(guān)注度高的電影,例如小黃人,星際穿越,死侍,銀河護(hù)衛(wèi)隊(duì)等等。id | title | year | polularity ---|---|---|--- 211672 | Minions | 2015 | 875.581305 157336 | Interstellar | 2014 | 724.247784 293660 | Deadpool | 2016 | 514.569956 118340 | Guardians of the Galaxy | 2014 | 481.098624 76341 | Mad Max: Fury Road | 2015 | 434.278564 135397 | Jurassic World | 2015 | 418.708552 22 | Pirates of the Caribbean: The Curse of the Bla... | 2003 | 271.972889 119450 | Dawn of the Planet of the Apes | 2014 | 243.791743 131631 | The Hunger Games: Mockingjay - Part 1 | 2014 | 206.227151 177572 | Big Hero 6 | 2014 | 203.734590 87101 | Terminator Genisys | 2015 | 202.042635 271110 | Captain America: Civil War | 2016 | 198.372395 244786 | Whiplash | 2014 | 192.528841 155 | The Dark Knight | 2008 | 187.322927 286217 | The Martian | 2015 | 167.932870 27205 | Inception | 2010 | 167.583710 109445 | Frozen | 2013 | 165.125366 209112 | Batman v Superman: Dawn of Justice | 2016 | 155.790452 19995 | Avatar | 2009 | 150.437577 550 | Fight Club | 1999 | 146.757391 ### 3.4 問(wèn)題4. 按評(píng)分推薦分?jǐn)?shù)要較高且評(píng)分人數(shù)高于某值,取評(píng)分人數(shù)大于100的記錄,顯示前20個(gè)。![](https://img- blog.csdn.net/20180726145239661?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)輸出結(jié)果如下,可見(jiàn)按評(píng)分推薦的是一些比較經(jīng)典的,像肖申克的救贖,教父,辛德勒的名單等等。id | title | year | vote_average ---|---|---|--- 278 | The Shawshank Redemption | 1994 | 8.5 238 | The Godfather | 1972 | 8.4 424 | Schindler's List | 1993 | 8.3 680 | Pulp Fiction | 1994 | 8.3 129 | Spirited Away | 2001 | 8.3 240 | The Godfather: Part II | 1974 | 8.3 244786 | Whiplash | 2014 | 8.3 550 | Fight Club | 1999 | 8.3 510 | One Flew Over the Cuckoo's Nest | 1975 | 8.2 13 | Forrest Gump | 1994 | 8.2 155 | The Dark Knight | 2008 | 8.2 389 | 12 Angry Men | 1957 | 8.2 128 | Princess Mononoke | 1997 | 8.2 497 | The Green Mile | 1999 | 8.2 539 | Psycho | 1960 | 8.2 346 | Seven Samurai | 1954 | 8.2 1891 | The Empire Strikes Back | 1980 | 8.2 73 | American History X | 1998 | 8.2 4935 | Howl's Moving Castle | 2004 | 8.2 769 | GoodFellas | 1990 | 8.2 與問(wèn)題二的結(jié)果對(duì)比,可見(jiàn)美國(guó)評(píng)分top20里面包括了大部分總體電影評(píng)分top20,也正好證實(shí)了美國(guó)是電影大國(guó),有質(zhì)有量。### 3.5 問(wèn)題5. 按觀影者心情推薦抑郁的人推薦小眾文藝片,從生活出發(fā)到靈魂結(jié)束,在平淡中找到人生的意義,積極向上的電影;無(wú)聊的人推薦喜劇,科技探索片也是個(gè)不錯(cuò)的選擇;開(kāi)心的人推薦燒腦片之類(lèi)的劇情電影,讓你忘掉開(kāi)心,【笑臉】。觀影者心情與電影類(lèi)型對(duì)應(yīng)表 觀影者心情 | 對(duì)應(yīng)推薦的電影類(lèi)型 ---|--- 高興happy | Drama, adventure, thriller, horror 傷心sad | Comedy, science fiction, family, fantasy 迷茫exhausted | Romance, adventure, family, mystery 無(wú)聊bored | Comedy, science fiction, thriller, crime 輕松relaxed | Drama, comedy, romance,music 孤獨(dú)lonely | comedy, family, mystery, documentary 生氣angry | Comedy, adventure, family, crime 以傷心sad為例,推薦這四種Comedy, science fiction, family, fantasy電影的前10個(gè)。![](https://img- blog.csdn.net/20180727141307705?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)![](https://img- blog.csdn.net/20180727141335999?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpbmxpdTA5MDE=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)###有個(gè)問(wèn)題暫時(shí)不知道怎么解決,就是怎么同時(shí)找出每個(gè)類(lèi)型的評(píng)分前20,由于每部電影屬于多個(gè)類(lèi)型,這樣就會(huì)有重復(fù)的記錄,在類(lèi)型少的時(shí)候可以一個(gè)一個(gè)找出但是太多的時(shí)候就不行了,如果有人看的話(huà)希望可以多多交流多多學(xué)習(xí)。[1] Kaggle——TMDB 5000 Movie Dataset電影數(shù)據(jù)分析. [ https://blog.csdn.net/zhuoyue65/article/details/80285875 ](https://blog.csdn.net/zhuoyue65/article/details/80285875)![在這里插入圖片描述](https://img-blog.csdnimg.cn/20210608151750993.gif)

    總結(jié)

    以上是生活随笔為你收集整理的Kaggle入门 - TMDB 5000 电影推荐数据分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。