在加利福尼亚州投资于新餐馆:一种数据驱动的方法
“It is difficult to make predictions, especially about the future.”
“很難做出預(yù)測(cè),尤其是對(duì)未來(lái)的預(yù)測(cè)。”
~Niels Bohr
?尼爾斯·波爾
Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.
通過(guò)數(shù)據(jù)可以更好地解釋一切。 數(shù)據(jù)驅(qū)動(dòng)的決策對(duì)于任何行業(yè)的成功都是至關(guān)重要的。
And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.
自從難忘的時(shí)光以來(lái),這就是事實(shí)。 現(xiàn)在的區(qū)別在于,我們更好地發(fā)展了健康的數(shù)據(jù)前景,并且我們擁有比以前更多的數(shù)據(jù)。 而且,我們擁有以前無(wú)法想象的計(jì)算能力。
In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.
在這種情況下,應(yīng)利用計(jì)算能力和數(shù)據(jù)做出更好的決策來(lái)解決業(yè)務(wù)問(wèn)題。
In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.
在我的項(xiàng)目中,我選擇為在加利福尼亞市開(kāi)設(shè)新餐館提供建議。 在這個(gè)項(xiàng)目中,我提供了一份具體的投資建議清單。對(duì)餐館類(lèi)型(例如日式餐廳,甜點(diǎn)店等)和各個(gè)縣提出了建議。
In this post, I will go over the full process of a Data Science project.
在本文中,我將介紹數(shù)據(jù)科學(xué)項(xiàng)目的整個(gè)過(guò)程。
數(shù)據(jù)源 (Data Sources)
For solving this problem, data from four sources have been leveraged-
為了解決這個(gè)問(wèn)題,我們利用了來(lái)自四個(gè)來(lái)源的數(shù)據(jù)-
Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.
由加利福尼亞政府提供的加利福尼亞開(kāi)放數(shù)據(jù)門(mén)戶(hù)中提供的地理位置數(shù)據(jù)稱(chēng)為“加利福尼亞縣” 。
The Foursquare API for information about established restaurants and other relevant detailed information about the same.
Foursquare API,用于提供有關(guān)已建立餐廳的信息以及有關(guān)該餐廳的其他相關(guān)詳細(xì)信息。
County-wise population data from the US Government Census site.
來(lái)自美國(guó)政府人口普查站點(diǎn)的縣級(jí)人口數(shù)據(jù)。
County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.
美國(guó)商務(wù)部經(jīng)濟(jì)分析局提供的縣級(jí)實(shí)際GDP數(shù)據(jù)。
探索性數(shù)據(jù)分析 (Exploratory Data Analysis)
After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.
清理數(shù)據(jù)后( 絕對(duì)超過(guò)數(shù)據(jù)科學(xué)家工作的90% ),從數(shù)據(jù)中獲得了有意義的見(jiàn)解。
City Centers of California’s Counties, source: Author加利福尼亞州縣城中心,資料來(lái)源:作者It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.
還發(fā)現(xiàn)縣的GDP與縣的人口密切相關(guān)。 因此,具有高GDP和高人口的縣成為吸引投資的目的地。
Strong Correlation Between GDP and Population of Californian Counties, source: Author加利福尼亞縣的GDP與人口之間的強(qiáng)相關(guān)性,來(lái)源:作者 Number of Eateries in Each County (capped at 50 by Foursquare), source: Author每個(gè)縣的餐館數(shù)量(Foursquare限制為50),來(lái)源:作者With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.
借助Foursquare API提供的信息,獲得了每個(gè)縣的十個(gè)最常見(jiàn)的場(chǎng)所列表。 這將在決策中加以利用。
Five Row五排應(yīng)用機(jī)器學(xué)習(xí)模型 (Applying Machine Learning Model)
選擇算法 (Choosing Algorithm)
The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.
業(yè)務(wù)問(wèn)題是尋找餐館類(lèi)型和投資地點(diǎn)。數(shù)據(jù)未標(biāo)記。 這使得要解決的問(wèn)題成為無(wú)監(jiān)督學(xué)習(xí)的經(jīng)典應(yīng)用。
The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.
目的不是尋找價(jià)值或?qū)ふ译A級(jí)。 目的不是建議某人僅提出一項(xiàng)投資建議。 向利益相關(guān)者建議可能的場(chǎng)所清單是目標(biāo)。
And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.
這可以通過(guò)基于GDP和人口對(duì)縣進(jìn)行聚類(lèi)來(lái)實(shí)現(xiàn)。 而KMeans聚類(lèi)是實(shí)現(xiàn)這一目標(biāo)的最佳統(tǒng)計(jì)學(xué)習(xí)算法。
Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.
使用了Scikit-learn庫(kù)的KMeans聚類(lèi)算法實(shí)現(xiàn)。
選擇k (Choosing k)
For choosing the best k for clustering, the elbow method was employed.
為了選擇最佳的k進(jìn)行聚類(lèi),采用了彎頭法。
Inertia vs. Values of k Plot, source: Author慣性與k圖的值的關(guān)系,來(lái)源:作者As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.
從圖中可以看出,最佳k為4。因此,在k = 4時(shí)應(yīng)用了聚類(lèi)算法。因此,根據(jù)縣的人口和GDP形成了4個(gè)縣集群。
結(jié)果 (Results)
4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).
形成了包含縣的4個(gè)集群。 經(jīng)檢查,發(fā)現(xiàn)洛杉磯縣因其GDP和人口相對(duì)較高而與其自身形成了一個(gè)集群(集群2)。 另一個(gè)集群中的縣的GDP較高且人口眾多,但洛杉磯縣附近沒(méi)有。 奧蘭治,圣克拉拉和圣地亞哥是該集群中的三個(gè)縣(集群3)。 然后是一個(gè)集群(集群1)中的Plumas,內(nèi)華達(dá)州,塞拉利昂等GDP較低且人口較少的縣(另一個(gè)集群)(薩克拉曼多,河濱等)中部GDP和人口較低的縣(集群) -4)。
Resulting Clusters on a Map, source: Author地圖上的結(jié)果集群,來(lái)源:作者In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.
在第2、3組中,我們的縣人口眾多,GDP很高。 在這些縣中,投資于任何一家餐館都是有利可圖的,而建議投資于不在前三名場(chǎng)所中的餐館則是有利的。
In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.
在集群4中,縣的人口和GDP高于集群1中的縣,但低于集群2或3中的縣。在這些縣中投資優(yōu)先于集群2和集群3中的縣。 。 應(yīng)該在不常見(jiàn)的餐館里進(jìn)行投資,以使他們面臨的競(jìng)爭(zhēng)更少。
Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.
集群1由人口較少的縣主導(dǎo)。 在對(duì)第2組或第3組或第4組的縣進(jìn)行投資之后,應(yīng)該優(yōu)先選擇對(duì)這些縣進(jìn)行投資。 建議不要在這些縣進(jìn)行投資。
After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.
在提出投資選擇建議之后,每個(gè)集群的表格都是用餐館類(lèi)型構(gòu)成的,而不是三種最常見(jiàn)的類(lèi)型。
Table for Counties and Investment Recommendations in Cluster 3表3中的縣和投資建議表Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer
完整報(bào)告鏈接: GitHub存儲(chǔ)庫(kù)筆記本中的PDF ,完整代碼: NB Viewer
Feel free to comment, provide feedback, or criticize.
隨時(shí)發(fā)表評(píng)論,提供反饋或批評(píng)。
Connect with me on LinkedIn or Twitter.
在LinkedIn或Twitter上與我聯(lián)系。
This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.
這篇博客文章與IBM通過(guò)Coursera提供的Applied Data Science Capstone Project有關(guān)。
翻譯自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e
總結(jié)
以上是生活随笔為你收集整理的在加利福尼亚州投资于新餐馆:一种数据驱动的方法的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到蛇心里害怕怎么回事
- 下一篇: 近似算法的近似率_选择最佳近似最近算法的