TMDB 5000电影数据集
原文:
Data Source Transfer Summary
We (Kaggle) have removed the original version of this dataset per a?DMCA?takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from?The Movie Database (TMDb)?in accordance with?their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.
The good news is that:
-
You can port your existing kernels over with a bit of editing.?This kernel?offers functions and examples for doing so. You can also find?a general introduction to the new format here.
-
The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
-
Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
-
The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
-
Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example,?this IMDB entry?has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.
Data Source Transfer Details
-
Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
-
Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
-
There's now a separate file containing the full credits for both the cast and crew.
-
All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
-
Your existing kernels will continue to render normally until they are re-run.
-
If you are curious about how this dataset was prepared, the code to access TMDb's API is posted?here.
New columns:
-
homepage
-
id
-
original_title
-
overview
-
popularity
-
production_companies
-
production_countries
-
release_date
-
spoken_languages
-
status
-
tagline
-
vote_average
Lost columns:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
譯:
TMDB 5000電影數(shù)據(jù)集
來自TMDb的約5000部電影的元數(shù)據(jù)
一部電影在上映前的成功我們能說些什么呢?是否有某些公司(皮克斯?)找到了一致的公式?考慮到制作成本超過1億美元的主要電影仍然會失敗,這個問題對電影業(yè)來說比以往任何時候都更加重要。電影迷可能有不同的興趣。我們能預(yù)測哪些電影會獲得很高的評價,不管它們是否商業(yè)成功?
這是一個開始深入研究這些問題的好地方,它提供了數(shù)千部電影的情節(jié)、演員、人員、預(yù)算和收入的數(shù)據(jù)。
數(shù)據(jù)源傳輸摘要
我們(Kaggle)已經(jīng)根據(jù)IMDB的DMCA takedown請求刪除了這個數(shù)據(jù)集的原始版本。為了減少影響,我們根據(jù)電影數(shù)據(jù)庫(TMDb)的使用條款,用一組類似的膠片和數(shù)據(jù)字段來代替它。壞消息是,基于舊數(shù)據(jù)集構(gòu)建的內(nèi)核很可能不再工作。
好消息是:
-
你可以通過一些編輯來移植現(xiàn)有的內(nèi)核。這個內(nèi)核提供了這樣做的函數(shù)和示例。您也可以在這里找到對新格式的一般介紹。
-
新的數(shù)據(jù)集包含了演員和劇組的全部學(xué)分,而不僅僅是前三名演員。
-
演員和女演員現(xiàn)在按他們在演職員中出現(xiàn)的順序排列。目前還不清楚原始數(shù)據(jù)集使用了什么樣的排序;對于我抽查過的電影,它既不符合學(xué)分順序,也不符合IMDB的明星順序。
-
收入似乎更具流動性。例如,IMDB對《阿凡達(dá)》的數(shù)據(jù)似乎來自2010年,低估了該片的全球收入超過20億美元。
-
有些電影我們無法移植(幾百部)只是不好的作品。例如,這個IMDB條目基本上沒有準(zhǔn)確的信息。它被列為第七集的紀(jì)錄片。
?
數(shù)據(jù)源傳輸詳細(xì)信息
-
一些新列包含json。通過移植[來自這個內(nèi)核]的加載數(shù)據(jù)函數(shù)()可以節(jié)省一點時間。
-
即使是在諸如runtime這樣的簡單字段中,版本之間也可能不一致。例如,先前的數(shù)據(jù)集顯示了Avatar的擴(kuò)展剪切的持續(xù)時間,而TMDB顯示了原始版本的時間。
-
現(xiàn)在有一個單獨的文件包含演員和劇組的全部演職人員。
-
所有的字段都是由用戶填寫的,所以不要期望他們在關(guān)鍵字、流派、評分等方面達(dá)成一致。
-
現(xiàn)有內(nèi)核將繼續(xù)正常呈現(xiàn),直到重新運行為止。
-
如果您想知道這個數(shù)據(jù)集是如何準(zhǔn)備的,那么訪問TMDb API的代碼就發(fā)布在這里。
?
新列:
-
主頁
-
身份證件
-
原始_標(biāo)題
-
概述
-
人氣
-
制片公司
-
生產(chǎn)國
-
發(fā)布日期
-
口語
-
地位
-
標(biāo)語
-
投票平均數(shù)
?
丟失的列:
-
actor1facebook_likes
-
actor2facebook_likes
-
actor3facebook_likes
-
aspect_ratio
-
casttotalfacebook_likes
-
color
-
content_rating
-
directorfacebooklikes
-
facenumberinposter
-
moviefacebooklikes
-
movieimdblink
-
numcriticfor_reviews
-
numuserfor_reviews
大家可以到官網(wǎng)地址下載數(shù)據(jù)集,我自己也在百度網(wǎng)盤分享了一份。可關(guān)注本人公眾號,回復(fù)“2020101705”獲取下載鏈接。
?
?
總結(jié)
以上是生活随笔為你收集整理的TMDB 5000电影数据集的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Java输入输出流
- 下一篇: DWM1000 定位操作流程--[蓝点无