當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

虎牙直播电影一天收入_电影收入

發(fā)布時間：2023/12/20 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了虎牙直播电影一天收入_电影收入小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

虎牙直播電影一天收入

“美國電影協(xié)會(MPAA)的首席執(zhí)行官J. Valenti提到：“沒有人能告訴您電影在市場上的表現(xiàn)。直到電影在黑暗的劇院里放映并且銀幕和觀眾之間都散發(fā)出火花。 (“The CEO of Motion Picture Association of America (MPAA) J. Valenti mentioned that ‘No one can tell you how a movie is going to do in the marketplace. Not until the film opens in darkened theater and sparks fly up between the screen and the audience’”)

Cigdem Tuncer西格德·圖姆斯 Follow跟隨 Aug 9 8月9

The modern film industry, a business of nearly 10 billion dollars per year, is a cutthroat business competition.

現(xiàn)代電影業(yè)每年的營業(yè)額接近100億美元，是一場殘酷的商業(yè)競爭。

Each year in the United States, hundreds of films are released to domestic audiences in the hope that they will become the next “blockbuster.” Predicting how well a movie will perform at the box office is hard because there are so many factors involved in success.

在美國，每年都會向國內(nèi)觀眾放映數(shù)百部電影，希望它們將成為下一部“大片”。很難預(yù)測電影在票房上的表現(xiàn)如何，因為成功涉及很多因素。

The goal of this project is to develop a computational model for predicting the revenues based on public data for movies extracted from Boxofficemojo.com online movie database.

該項目的目標是開發(fā)一種計算模型，該模型可以基于從Boxofficemojo.com在線電影數(shù)據(jù)庫中提取的電影的公共數(shù)據(jù)來預(yù)測收入。

The first phase is web scraping. Different types of features are extracted from Boxofficemojo.com which will be described later. Second phase is data cleaning. After scrapping data from our source, we cleaned our data mainly depend on unavailability of some features. After cleaning all data, next phase is exploratory data analysis. In third phase we create graphics to understand data. Fourth phase is feature engineering, where you create features for machine learning model from raw text data. Fifth phase is model analysis, where I applied one of the machine learning algorithms on our data set.

第一階段是刮紙。從Boxofficemojo.com中提取了不同類型的功能，這將在后面描述。第二階段是數(shù)據(jù)清理。從我們的來源中刪除數(shù)據(jù)后，我們清理數(shù)據(jù)主要取決于某些功能的不可用性。清除所有數(shù)據(jù)后，下一階段是探索性數(shù)據(jù)分析。在第三階段，我們創(chuàng)建圖形來理解數(shù)據(jù)。第四階段是功能工程，其中您可以從原始文本數(shù)據(jù)創(chuàng)建用于機器學習模型的功能。第五階段是模型分析，其中我在數(shù)據(jù)集上應(yīng)用了一種機器學習算法。

網(wǎng)頁抓取 (Web Scraping)

Web scraping is a program or algorithm to extract and process large amounts of data from the web. Whether you are a data scientist, engineer, or anybody who analyzes large amounts of datasets, the ability to scrape data from the web is a useful skill to have.

Web抓取是從Web提取和處理大量數(shù)據(jù)的程序或算法。無論您是數(shù)據(jù)科學家，工程師，還是任何分析大量數(shù)據(jù)集的人員，從網(wǎng)絡(luò)中抓取數(shù)據(jù)的能力都是一項有用的技能。

It’s a good idea to do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project. To learn more about the legal aspects of web scraping, check out Legal Perspectives on Scraping Data from the Modern Web.

最好自己進行一些研究，并確保在開始大規(guī)模項目之前，不要違反任何服務(wù)條款。要了解有關(guān)網(wǎng)絡(luò)抓取的法律方面的更多信息，請查閱《現(xiàn)代網(wǎng)絡(luò)中關(guān)于數(shù)據(jù)搜集的法律觀點》。

For this project;

對于這個項目；

· BeautifulSoup Library is used for data extraction from the web.

· BeautifulSoup庫用于從Web提取數(shù)據(jù)。

· Pandas Library is used for data manipulation and cleaning.

· 熊貓庫用于數(shù)據(jù)處理和清理。

· Matplotlib and Seaborn are used for data visualization.

· Matplotlib和Seaborn用于數(shù)據(jù)可視化。

My data set contains 8319 movies released in between 2010 to 2019. Recent movies are not selected because Covid-19 not much movie released in 2020. I collect Title, Distributor, Release, MPAA, Time, Genre, Domestic, International, Worldwide, Opening, Budget, and Actors information.

我的數(shù)據(jù)集包含2010年至2019年之間發(fā)行的8319部電影。由于Covid-19 2020年發(fā)行的電影不多，因此未選擇近期電影。我收集標題，發(fā)行商，發(fā)行，MPAA，時間，類型，國內(nèi)，國際，全球，開幕，預(yù)算和演員信息。

數(shù)據(jù)清理 (Data Cleaning)

At the beginning my data set had 8319 movies. Then I recognize that there were many movies which don’t have all data available. So unavailability of features was the main reason behind eliminating movies from my data set.

最初，我的數(shù)據(jù)集包含8319部電影。然后我意識到有很多電影沒有所有可用數(shù)據(jù)。因此，功能不可用是從我的數(shù)據(jù)集中刪除電影的主要原因。

Most of the movie doesn’t have budget data available. So, null rows have been deleted.

這部電影大部分沒有可用的預(yù)算數(shù)據(jù)。因此，空行已被刪除。

Dtype is converted from “Object” to “float” for numeric columns.

對于數(shù)字列，Dtype從“對象”轉(zhuǎn)換為“浮點”。

“Release” data is checked for leap year detail and found data is modified. Dtype is converted from “Object” to “datetime” for Release column. Data from “Distributor” column is cleaned from not related info.

檢查“發(fā)布”數(shù)據(jù)中的leap年細節(jié)，并修改找到的數(shù)據(jù)。 Dtype從Release列的“ Object”轉(zhuǎn)換為“ datetime”。來自“分銷商”列的數(shù)據(jù)已從不相關(guān)的信息中清除。

Duplicate rows have been deleted from data set.

重復(fù)的行已從數(shù)據(jù)集中刪除。

After removing those movies I finally got my data set with 1293 movies which have all information available.

刪除這些電影后，我終于獲得了包含所有可用信息的1293電影的數(shù)據(jù)集。

探索性數(shù)據(jù)分析(EDA) (Exploratory Data Analysis (EDA))

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

在統(tǒng)計中，探索性數(shù)據(jù)分析(EDA)是一種分析數(shù)據(jù)集以總結(jié)其主要特征的方法，通常使用視覺方法。可以使用統(tǒng)計模型，也可以不使用統(tǒng)計模型，但是EDA主要用于查看數(shù)據(jù)可以在形式建模或假設(shè)檢驗任務(wù)之外告訴我們的內(nèi)容。

Let’s look at the data relation between “Domestic Total Gross” and “Budget” for each year.

讓我們看一下每年“國內(nèi)總收入”和“預(yù)算”之間的數(shù)據(jù)關(guān)系。

While there are an almost overwhelming number of methods to use in EDA, one of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships between two variables. Pair plots are a great method to identify trends for follow-up analysis and, fortunately, are easily implemented in Python.

盡管在EDA中使用了幾乎絕大多數(shù)方法，但最有效的入門工具之一是結(jié)對圖(也稱為散點圖矩陣)。配對圖使我們可以看到單個變量的分布以及兩個變量之間的關(guān)系。配對圖是識別趨勢以進行后續(xù)分析的一種好方法，幸運的是，可以在Python中輕松實現(xiàn)。

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors. Heatmaps are perfect for exploring the correlation of features in a data set. We can now use either Matplotlib or Seaborn to create the heatmap. To get the correlation of the features inside a data set we can call <dataset>.corr(), which is a Pandas dataframe method. This will give us the correlation matrix.

熱圖是數(shù)據(jù)的圖形表示，其中矩陣中包含的各個值表示為顏色。熱圖非常適合探索數(shù)據(jù)集中要素的相關(guān)性。現(xiàn)在，我們可以使用Matplotlib或Seaborn來創(chuàng)建熱圖。為了獲得數(shù)據(jù)集中內(nèi)部<dataset>.corr()的相關(guān)性，我們可以調(diào)用<dataset>.corr() ，這是Pandas數(shù)據(jù)<dataset>.corr()方法。這將給我們相關(guān)矩陣。

特征工程 (Feature Engineering)

Feature engineering means building additional features out of existing data which is often spread across multiple related tables. Feature engineering requires extracting the relevant information from the data and getting it into a single table which can then be used to train a machine learning model.

特征工程意味著從現(xiàn)有數(shù)據(jù)中構(gòu)建附加特征，這些數(shù)據(jù)通常分布在多個相關(guān)表中。特征工程需要從數(shù)據(jù)中提取相關(guān)信息，并將其放入一個表中，然后該表可用于訓練機器學習模型。

Machine learning fits mathematical notations to the data in order to derive some insights. The models take features as input. A feature is generally a numeric representation of an aspect of real-world phenomena or data. Just the way there are dead ends in a maze, the path of data is filled with noise and missing pieces. Our job as a Data Scientist is to find a clear path to the end goal of insights.

機器學習使數(shù)學符號適合數(shù)據(jù)，以得出一些見解。這些模型將要素作為輸入。特征通常是真實現(xiàn)象或數(shù)據(jù)方面的數(shù)字表示。就像迷宮中的死胡同一樣，數(shù)據(jù)的路徑充滿了噪聲和丟失的碎片。作為數(shù)據(jù)科學家，我們的工作是找到通往最終見解的明確路徑。

Let’s look at the description of dataset and see distribution of target column.

讓我們看一下數(shù)據(jù)集的描述并查看目標列的分布。

We want the target variable to be predicted in the model to have a normal distribution. When we examine the distribution of our target variable, we see that there is no right skewed distribution. We can correct this situation by applying a logarithmic transformation to the target variable.

我們希望在模型中預(yù)測目標變量具有正態(tài)分布。當我們檢查目標變量的分布時，我們發(fā)現(xiàn)沒有右偏分布。我們可以通過對目標變量應(yīng)用對數(shù)轉(zhuǎn)換來糾正這種情況。

Ordinary least-squares (OLS) models assume that the analysis is fitting a model of a relationship between one or more explanatory variables and a continuous or at least interval outcome variable that minimizes the sum of square errors, where an error is the difference between the actual and the predicted value of the outcome variable.

普通最小二乘(OLS)模型假設(shè)分析適合一個或多個解釋變量與連續(xù)或至少區(qū)間結(jié)果變量之間的關(guān)系模型，該變量使平方誤差之和最小，其中誤差是結(jié)果變量的實際值和預(yù)測值。

When I do OLS model with two numerical features from data set, I got low cond. no, but also got low R-2 score. To increase R-2 score I will do feature engineering to add new features from categorical variables from out data set.

當我使用數(shù)據(jù)集中的兩個數(shù)值特征進行OLS模型建模時，cond降低。不，但R-2得分也很低。為了增加R-2分數(shù)，我將進行特征工程設(shè)計以從數(shù)據(jù)集中的分類變量中添加新特征。

· The “year” column and four season columns were created from the “Release” column.

·從“發(fā)布”列中創(chuàng)建了“年”列和四個季節(jié)列。

· Four Dummy columns were created from “MPAA” column.

·從“ MPAA”列中創(chuàng)建了四個虛擬列。

· Running time (min) column were created from “time” column.

·運行時間(分鐘)列是從“時間”列中創(chuàng)建的。

· New columns created for all distributors with more than 49 rows.

·為具有49行以上的所有分發(fā)者創(chuàng)建的新列。

· Logs of “Budget” and “Opening” columns were created.

·創(chuàng)建了“預(yù)算”和“開放”列的日志。

模型分析 (Model Analysis)

Now is the time to split our data into sets of training, testing and validation. Let’s rerun our model and finally compare the Ridge, Lasso and Polynomial regression results.

現(xiàn)在是時候?qū)⑽覀兊臄?shù)據(jù)分為訓練，測試和驗證的集合了。讓我們重新運行模型，最后比較Ridge，Lasso和多項式回歸結(jié)果。

Data set was split as a train (60%), validation (20%), and test (20%). The tuning parameters (alpha) of the Lasso and Ridge models were chosen from a wide value range than put the 10-fold cross-validation.

數(shù)據(jù)集分為訓練(60％)，驗證(20％)和測試(20％)。拉索和里奇模型的調(diào)整參數(shù)(alpha)是從10倍交叉驗證的寬泛范圍內(nèi)選擇的。

When we included the variables we applied feature engineering into the model, OLS model R-2 score is increased to 0.759, but at the same time the cond. no. increased. Lasso Regression and Ridge Regression brought us the same results. The result of Linear Regression was also very close to them. We have the best result in a Degree 2 Polynomial Regression and the second is Ridge Polynomial Regression.

當我們將變量應(yīng)用到模型中時，將OLS模型R-2得分提高到0.759，但同時條件也有所提高。沒有。增加。拉索回歸和嶺回歸為我們帶來了相同的結(jié)果。線性回歸的結(jié)果也非常接近它們。我們在2次多項式回歸中得到最好的結(jié)果，第二個是Ridge多項式回歸。

Now it’s time to do Cross Validation (CV) and look at Mean Absolute Error (MAE) score. When we cross validate each model (kfold = 10), we see little drop in scores.

現(xiàn)在是時候進行交叉驗證(CV)和查看平均絕對誤差(MAE)分數(shù)了。當我們交叉驗證每個模型(kfold = 10)時，我們看到分數(shù)幾乎沒有下降。

結(jié)論 (Conclusion)

Finally, when we look at the mean absolute errors on the established models, we can say that Ridge Polynomial Regression will bring us the most accurate results.

最后，當我們查看已建立模型的平均絕對誤差時，可以說嶺多項式回歸將為我們帶來最準確的結(jié)果。

Five fundamental assumptions of the linear regression analysis were checked as these can be seen on Jupyter Notebook.

檢查了線性回歸分析的五個基本假設(shè)，因為可以在Jupyter Notebook中看到這些假設(shè)。

GitHub repository for web scraping and data processing is here.

用于Web抓取和數(shù)據(jù)處理的GitHub存儲庫在這里。

Thank you for your time and reading my article. Please feel free to contact me if you have any questions or would like to share your comments.

感謝您的時間和閱讀我的文章。如果您有任何疑問或想分享您的意見，請隨時與我聯(lián)系。

翻譯自: https://medium.com/analytics-vidhya/predicting-a-movies-revenue-3709fb460604

虎牙直播電影一天收入

總結(jié)

以上是生活随笔為你收集整理的虎牙直播电影一天收入_电影收入的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： DOM getElementById
下一篇： Openstack 一直在调度中解决