大数据数据量估算_如何估算数据科学项目的数据收集成本
大數(shù)據(jù)數(shù)據(jù)量估算
(Notes: All opinions are my own)
(注:所有觀點(diǎn)均為我自己)
介紹 (Introduction)
Data collection is the initial and fundamental step in any Data Science or Analytics project, and on which all following activities rely, from data analysis to model deployment.
數(shù)據(jù)收集是任何數(shù)據(jù)科學(xué)或Analytics(分析)項(xiàng)目中的第一步,也是基礎(chǔ)步驟,從數(shù)據(jù)分析到模型部署,所有后續(xù)活動(dòng)都依賴(lài)于此。
With the pervasive presence of APIs and Cloud Computing, I am ever more intrigued in maximizing the efficiency and level of automation of data collection activities for both work and personal projects.
隨著API和云計(jì)算的普遍存在,我對(duì)將工作和個(gè)人項(xiàng)目的數(shù)據(jù)收集活動(dòng)的效率和自動(dòng)化水平最大化實(shí)現(xiàn)了極大的興趣。
In the latter category, I have been interested in collecting data from online home-rental platforms in the UK market (Zoopla, RightMove, OnTheMarket, and similar) with the aim of extracting image and text data to be processed for use in machine learning models (for use cases such as prediction of a property’s price, extraction of key features from image-data to infer a listing’s true value, processing of customer reviews through NLP techniques, etc..)
在后一類(lèi)中,我感興趣的是從英國(guó)市場(chǎng)( Zoopla , RightMove , OnTheMarket等)的在線家庭租賃平臺(tái)收集數(shù)據(jù),目的是提取要處理的圖像和文本數(shù)據(jù),以用于機(jī)器學(xué)習(xí)模型。 (對(duì)于用例,例如預(yù)測(cè)房地產(chǎn)價(jià)格,從圖像數(shù)據(jù)中提取關(guān)鍵特征以推斷出房源的真實(shí)價(jià)值,通過(guò)NLP技術(shù)處理客戶(hù)評(píng)論等)。
In the following lines, I aim to discuss how to potentially go about:
在下面的幾行中,我旨在討論如何實(shí)現(xiàn):
The identification of the most critical data sources
識(shí)別最關(guān)鍵的數(shù)據(jù)源
The estimation of data collection costs should you want to put your solution to commercial use
如果您要將解決方案投入商業(yè)使用,則需要估算數(shù)據(jù)收集成本
I gave the article a broader cut, which touches upon market and regulatory considerations to be made when reasoning around data collection for potentially commercial purposes, as well as the more technical considerations of working with APIs, as I realize there are multiple layers to be surfaced within this very interesting topic.
我對(duì)文章進(jìn)行了更廣泛的介紹,其中涉及了出于潛在商業(yè)目的而進(jìn)行數(shù)據(jù)收集推理時(shí)要考慮的市場(chǎng)和監(jiān)管方面的考慮,以及涉及API的更多技術(shù)方面的考慮,因?yàn)槲乙庾R(shí)到要浮出水面在這個(gè)非常有趣的話題中。
I hope the below key points will result useful in setting up the Data Collection block of your current and future Data Science projects, no matter your industry focus.
我希望以下要點(diǎn)將有助于您建立當(dāng)前和將來(lái)的數(shù)據(jù)科學(xué)項(xiàng)目的數(shù)據(jù)收集模塊,無(wú)論您關(guān)注的是行業(yè)如何。
做市場(chǎng)調(diào)查并確定您的關(guān)鍵數(shù)據(jù)源 (Do your market research & identify your key data sources)
In two-sided markets such as online home rental platforms, which are dominated by supply and demand agents (on the supply side, homeowners looking to rent, either directly or through a real-estate agent; on the demand side, individuals looking to rent), you are going to find the most data, both in terms of quantity and quality, on those platforms which drive the majority of traffic in a given market, from both supply and demand sides.
在雙向市場(chǎng)(例如在線房屋租賃平臺(tái))中,供求代理占主導(dǎo)地位(在供應(yīng)方面,希望直接或通過(guò)房地產(chǎn)代理進(jìn)行租賃的房主;在需求方,希望進(jìn)行租賃的個(gè)人) ,您將在驅(qū)動(dòng)特定市場(chǎng)中來(lái)自供需雙方的大部分流量的平臺(tái)上找到數(shù)量和質(zhì)量方面最多的數(shù)據(jù)。
In this sense, you need to identify the platforms which hold the majority of market power as they pull and attract most eyeballs. Knowing the market’s distribution of overall traffic/data volume is very useful if you are looking to pull high amounts of data over time, and do not want to be integrating multiple data streams coming from smaller market players.
從這個(gè)意義上講,您需要確定在吸引和吸引大多數(shù)眼球的同時(shí)擁有大部分市場(chǎng)力量的平臺(tái)。 如果您希望隨時(shí)間推移獲取大量數(shù)據(jù),并且不想集成來(lái)自較小市場(chǎng)參與者的多個(gè)數(shù)據(jù)流,則了解市場(chǎng)的總體流量/數(shù)據(jù)量分布非常有用。
In the UK’s online home rental market, the majority of the traffic and listings is distributed between the top 1–5 players, and those companies (the left of the curve in the below illustrative distribution) are therefore the ones on which you want to focus your data collection efforts on.
在英國(guó)的在線房屋租賃市場(chǎng)中,大部分流量和列表都分布在排名靠前的1-5個(gè)參與者之間,因此,您要關(guān)注的公司(以下示例性分布中曲線的左側(cè))您的數(shù)據(jù)收集工作正在繼續(xù)。
Mode.comMode.comThis is of course a double-edged sword, as the big players from which you are going to be sourcing from have high leverage when it comes to entering data-sharing agreements, which allows them to:
當(dāng)然,這是一把雙刃劍,因?yàn)橐?strong>簽訂數(shù)據(jù)共享協(xié)議時(shí) ,您將要從中采購(gòu)的大型參與者具有很高的杠桿作用 ,這使他們可以:
1) act as de-facto gatekeepers to a particular market and set their own data usage policies, especially in a less regulated market scenario
1)充當(dāng)特定市場(chǎng)的事實(shí)上的守門(mén)人,并制定自己的數(shù)據(jù)使用策略,尤其是在市場(chǎng)監(jiān)管不嚴(yán)格的情況下
2) charge more per the same unit of data volume when entering data sharing agreements
2)簽訂數(shù)據(jù)共享協(xié)議時(shí),按同一單位數(shù)據(jù)量收取更多費(fèi)用
3) effectively monitor potential competitive threats to their core-business from startups who require access to their data and who are thus more dependent on their services
3)有效監(jiān)控那些需要訪問(wèn)其數(shù)據(jù)并因此更加依賴(lài)其服務(wù)的初創(chuàng)公司對(duì)其核心業(yè)務(wù)的潛在競(jìng)爭(zhēng)威脅
At the same time, given a skewed distribution of market share and in the absence of enforcing anti-competitive regulation, this is where the true value of the data resides, and thus aspiring Data Science teams which want to put their hands on this data need to pay a price to tackle the majority of the market and access high volume, high quality data points.
同時(shí),由于市場(chǎng)份額的分配存在偏差,并且沒(méi)有實(shí)施反競(jìng)爭(zhēng)法規(guī),這就是數(shù)據(jù)的真正價(jià)值所在,因此,有志向的數(shù)據(jù)科學(xué)團(tuán)隊(duì)希望將他們的手放在這一數(shù)據(jù)需求上付出一定的代價(jià)來(lái)應(yīng)對(duì)大多數(shù)市場(chǎng),并獲得大量,高質(zhì)量的數(shù)據(jù)點(diǎn)。
N.B For non commercial or research purposes, you are probably OK just scraping data off these websites (although the activity is not always appreciated when done at high frequency and volume — this is purely a practical consideration, I do not encourage web scraping on websites which have policies against it, and you are always better off respecting the terms and conditions of the data provider).
注意:出于非商業(yè)或研究目的,您可能只是從這些網(wǎng)站上抓取數(shù)據(jù)就可以了(盡管以高頻率和高流量進(jìn)行操作時(shí)并不總是能體會(huì)到這種活動(dòng)-純粹是出于實(shí)際考慮,我不鼓勵(lì)在這些網(wǎng)站上進(jìn)行網(wǎng)絡(luò)抓取有反對(duì)的政策,那么您始終最好遵守?cái)?shù)據(jù)提供者的條款。
始終先尋找API (Always look for APIs first)
Once you have identified the main data sources, your first bet is looking through their developer resources and figuring out:
一旦確定了主要數(shù)據(jù)源,您的第一個(gè)賭注就是瀏覽他們的開(kāi)發(fā)人員資源并弄清楚:
Whether they have an active API from which you can pull the data you need
他們是否具有活動(dòng)的API,您可以從中提取所需的數(shù)據(jù)
What their overall data sharing terms and conditions (T&Cs) are
他們的總體數(shù)據(jù)共享?xiàng)l款和條件(T&C)是什么
Zoopla, for example, has an API page, which can be useful to return a few features and listings data. Zoopla’s specific API has not being updated in a while and has apparently drawn criticism previously documented on Medium, but this type of information is what you want to look for when comparing different data sources.
例如, Zoopla有一個(gè)API頁(yè)面,可以用于返回一些功能和列表數(shù)據(jù)。 Zoopla的特定API暫時(shí)沒(méi)有更新,并且顯然引起了先前在Medium上記錄的批評(píng),但是當(dāng)您比較不同的數(shù)據(jù)源時(shí),您需要查找此類(lèi)信息。
When moving on to RightMove, you are directed to their Data Services page, per their official website. They do not seem to have or authorize any official API at the time of writing. OnTheMarket.com also does not seem to have any API as well.
轉(zhuǎn)到RightMove時(shí),您將通過(guò)其官方網(wǎng)站轉(zhuǎn)到其“ 數(shù)據(jù)服務(wù)”頁(yè)面。 在撰寫(xiě)本文時(shí),他們似乎沒(méi)有或未授權(quán)任何官方API。 OnTheMarket.com似乎也沒(méi)有任何API。
Checking the main players is incredibly useful to determine your next steps in your data collection strategy. You can get some sample data if you find an active API and decide:
檢查主要參與者對(duì)于確定數(shù)據(jù)收集策略中的下一步非常有用。 如果找到有效的API并決定以下內(nèi)容,則可以獲得一些示例數(shù)據(jù):
Whether the data volume and quality is enough for your application
數(shù)據(jù)量和質(zhì)量是否足以滿足您的應(yīng)用程序
Whether you are in violation of their T&Cs
您是否違反其條款和條件
Whether you want to get in touch with the Data Providers (see next steps) to submit a format data request to obtain further and hopefully richer datasets
是否要與數(shù)據(jù)提供者聯(lián)系(請(qǐng)參閱后續(xù)步驟)以提交格式數(shù)據(jù)請(qǐng)求以獲取更多(希望是更豐富)的數(shù)據(jù)集
Whether to move on to other smaller players in the market which may give you enough data (via their own API) to start off with (other aggregators such as Nestoria, which does provide one)
是否轉(zhuǎn)向市場(chǎng)上其他較小的參與者,這可能會(huì)(通過(guò)他們自己的API)為您提供足夠的數(shù)據(jù)作為開(kāi)始(其他類(lèi)似 Nestoria的 聚合器( 確實(shí)提供了這一點(diǎn)))
No matter the case, do not skip this step as it provides very valuable information, even if you are not immediately given access to what you need.
無(wú)論如何,即使您沒(méi)有立即獲得所需的信息,也不要跳過(guò)此步驟,因?yàn)樗鼤?huì)提供非常有價(jià)值的信息。
不要害怕與數(shù)據(jù)提供者聯(lián)系并討論潛在的數(shù)據(jù)共享協(xié)議 (Don’t be afraid to get in touch with data providers and discuss potential data-sharing agreements)
In my case, I decided to dig a bit deeper and thus got tentatively in touch with RightMove & Zoopla, via email and LinkedIn, by searching for Analytics roles and by reaching out to viable prospects.
就我而言,我決定進(jìn)行更深入的研究,并通過(guò)電子郵件和LinkedIn來(lái)搜索Right Analytics和Zoopla,并通過(guò)搜索Analytics角色并尋求可行的潛在客戶(hù)來(lái)暫時(shí)聯(lián)系。
I recommend doing this as you can always find people on the other side who are interested in supporting developers and hearing out interesting use cases. You may also uncover information which you did not previously noticed while reading through the various documentations.
我建議您這樣做,因?yàn)槟偸强梢栽诹硪粋?cè)找到對(duì)支持開(kāi)發(fā)人員和聽(tīng)到有趣的用例感興趣的人員。 您可能還會(huì)發(fā)現(xiàn)在閱讀各種文檔時(shí)以前沒(méi)有注意到的信息。
In my case, I found RightMove to be very restrictive of their data’s usage, and thus the only thing I really obtained from them was a cold shoulder. Same with Zoopla, which merely referred me back to their existing API, whose data richness I doubted after having tested it briefly with a Python script.
就我而言,我發(fā)現(xiàn)RightMove限制了他們數(shù)據(jù)的使用,因此,我真正從他們那里獲得的唯一一件事就是冷漠的肩膀。 與Zoopla一樣,后者只是讓我回到了他們現(xiàn)有的API,在使用Python腳本對(duì)其進(jìn)行了簡(jiǎn)短測(cè)試之后,我對(duì)它的數(shù)據(jù)豐富性表示懷疑。
At this point, I decided to search online to identify applications and platforms which already made use of data coming from either one of the two main providers, and see if I could extract further information on how they had done so and potentially at what cost.
在這一點(diǎn)上,我決定在線搜索以標(biāo)識(shí)已經(jīng)利用了來(lái)自?xún)蓚€(gè)主要提供商之一的數(shù)據(jù)的應(yīng)用程序和平臺(tái),并查看我是否可以提取有關(guān)他們這樣做的進(jìn)一步信息以及潛在的成本。
I could have also doubled down on Zoopla & RightMove and decided to propose a data-sharing agreement, but as a single individual, how much leverage would I realistically possess in such a conversation?
我本可以對(duì)Zoopla和RightMove進(jìn)行一番研究,然后決定提出一項(xiàng)數(shù)據(jù)共享協(xié)議,但是作為一個(gè)人,我實(shí)際上可以在這種對(duì)話中擁有多少杠桿作用?
In similar cases in which you are trying to decide where and how to collect your data from, I suggest you either:
在嘗試確定從何處以及如何收集數(shù)據(jù)的類(lèi)似情況下,我建議您:
Take your time with researching the market and various data providers, and give yourself as many potential data sources as possible, which will also allow you to compare their costs against the budget you are willing to allocate to your project
花些時(shí)間研究市場(chǎng)和各種數(shù)據(jù)提供者,并給自己盡可能多的潛在數(shù)據(jù)源,這也使您可以將它們的成本與您愿意分配給項(xiàng)目的預(yù)算進(jìn)行比較。
Take you time to establish a relationship with the few providers of choice (if they do not necessarily have a clear-cut API, such as in this case) and extract as much price/other information from them, while also being very transparent in the use you plan to make of their data (research, commercial, personal, etc.)
花一些時(shí)間與所選的少數(shù)提供者建立關(guān)系(如果它們不一定具有明確的API,例如在這種情況下),并從它們中提取盡可能多的價(jià)格/其他信息,同時(shí)在提供者中也非常透明使用您打算利用其數(shù)據(jù)(研究,商業(yè),個(gè)人等)的數(shù)據(jù)
利用您之前收集數(shù)據(jù)的其他人的專(zhuān)業(yè)知識(shí) (Leverage the expertise of others who have collected the data before you)
After having identified your main data sources and having checked for APIs and their usage potential, you’d also want to reach out to other market players who are exploiting those same data sources and see if you can uncover further insights.
在確定了主要數(shù)據(jù)源并檢查了API及其潛在用途之后,您還希望與其他正在利用相同數(shù)據(jù)源的市場(chǎng)參與者建立聯(lián)系,看看您是否可以發(fā)現(xiàn)進(jìn)一步的見(jiàn)解。
I found this to be an incredible little steps in getting some great-quality contextual information around data collection costs.
我發(fā)現(xiàn)這是獲取有關(guān)數(shù)據(jù)收集成本的高質(zhì)量上下文信息的令人難以置信的小步驟。
For example, I found a great website, Property Data, which cites the same data sources I was looking for, and thus I immediately sent an email using their contact form.
例如,我發(fā)現(xiàn)了一個(gè)很棒的網(wǎng)站Property Data ,它引用了我一直在尋找的相同數(shù)據(jù)源 ,因此我立即使用他們的聯(lián)系表發(fā)送了一封電子郵件。
To my surprise, the founder himself replied, mentioning the amount of money one provider was charging PropertyData to get them what they needed, as well as confirming they had not been able to convince another provider to send over their data, no matter the price point proposed, thus confirming my previous negative experience when reaching out to most of them via email/LinkedIn.
令我驚訝的是,創(chuàng)始人本人回答說(shuō),提到一家提供商向PropertyData收取的費(fèi)用,以獲取他們所需的東西,并確認(rèn)無(wú)論價(jià)格高低,他們都無(wú)法說(shuō)服另一家提供商發(fā)送其數(shù)據(jù)。建議,從而證實(shí)了我以前通過(guò)電子郵件/ LinkedIn與大多數(shù)人聯(lián)系時(shí)的負(fù)面經(jīng)歷。
-(below is the extract from the email response I got from PropertyData, sanitised where possible for confidentiality reasons)-
-(以下是我從PropertyData獲得的電子郵件回復(fù)的摘錄,出于機(jī)密原因,在可能的情況下進(jìn)行了清理)-
“We pay Source 1 £XX per month. That did the trick to get us what we needed!
“我們每月向Source 1支付XX英鎊。 這樣做的竅門(mén)就是獲得我們所需的東西!
Source 2, no amount of money makes them interested!
來(lái)源2,沒(méi)有多少錢(qián)讓他們感興趣!
PropertyData”
PropertyData”
This is great information as:
這是非常有用的信息,因?yàn)?#xff1a;
It gives you an actual estimation amount from which to extrapolate data collection costs for similar providers, in the absence of any API or price points.
在沒(méi)有任何API或價(jià)格點(diǎn)的情況下,它為您提供了一個(gè)實(shí)際的估算金額,可以從中估算出類(lèi)似提供商的數(shù)據(jù)收集成本。
Gives you further indication of which data sources might be more feasible to work with and which ones you might avoid altogether, using the experience of others as a compass.
借助其他人的經(jīng)驗(yàn),進(jìn)一步指示使用哪些數(shù)據(jù)源可能更可行,以及完全避免使用哪些數(shù)據(jù)源。
I always recommend taking the time to reach out to who has done it before and just ask, you might get positively surprising and helpful responses in return!
我總是建議花點(diǎn)時(shí)間聯(lián)系以前做過(guò)的事情的人,然后再問(wèn),您可能會(huì)得到積極的驚喜和有益的回應(yīng)!
運(yùn)行您的估計(jì)并檢查財(cái)務(wù)和技術(shù)可行性 (Run your estimations and check financial and technical feasibility)
By this point, you should have collected all the information needed to calculate the monthly running costs for data collection, which can be estimated by:
至此,您應(yīng)該已經(jīng)收集了計(jì)算數(shù)據(jù)收集每月運(yùn)行成本所需的所有信息,可以通過(guò)以下方式進(jìn)行估算:
(Number of data sources * Avg. Monthly Subscription Costs of API/Data Agreement)
(數(shù)據(jù)源數(shù)量* API /數(shù)據(jù)協(xié)議的平均每月訂閱費(fèi)用)
To this, you might want to factor in any Cloud Computing resources, which are going to be dependent on your data collections scripts and the amount of processing resources (time, data size driven) you are going to be utilizing to get your data into your data lake/data warehouse for later processing and analysis.
為此,您可能需要考慮 任何云計(jì)算資源 ,這些資源將取決于您的數(shù)據(jù)收集腳本和將用于將數(shù)據(jù)放入您的處理資源(時(shí)間,數(shù)據(jù)大小驅(qū)動(dòng))的數(shù)量。數(shù)據(jù)湖/數(shù)據(jù)倉(cāng)庫(kù),供以后處理和分析。
Aside from the mere numbers, at this moment you should also develop a sense for the overall technical feasibility of the approach given your project set up, and whether it can make sense to proceed or to completely pivot your data collection strategy.
除了數(shù)量之外,此刻,您還應(yīng)該對(duì)項(xiàng)目建立后的方法的整體技術(shù)可行性以及是否繼續(xù)進(jìn)行或完全采用數(shù)據(jù)收集策略有意義。
綜上所述 (In summary)
Having a sound data collection methodology and approach can really set your data science project up and running in the best way, while getting the best possible data at the best possible price given your market domain knowledge and the data providers available.
擁有完善的數(shù)據(jù)收集方法論和方法,可以真正以最佳方式設(shè)置和運(yùn)行數(shù)據(jù)科學(xué)項(xiàng)目,同時(shí)根據(jù)您的市場(chǎng)領(lǐng)域知識(shí)和可用的數(shù)據(jù)提供者,以最優(yōu)惠的價(jià)格獲得最佳的數(shù)據(jù)。
If you can:
如果你可以的話:
Conduct solid market research and identify the best quality sources
進(jìn)行扎實(shí)的市場(chǎng)研究并確定最佳質(zhì)量來(lái)源
Thoroughly check for existing’s APIs and their (usually) rich documentation
徹底檢查現(xiàn)有的API及其(通常)豐富的文檔
Additionally reach out to data providers to address potential data requests and their willingness to assist you
此外,還可以與數(shù)據(jù)提供商聯(lián)系,以解決潛在的數(shù)據(jù)請(qǐng)求及其愿意為您提供幫助的意愿
Further increase your knowledge base by asking around to people and companies who have been given access to the data before you
通過(guò)在訪問(wèn)您之前先詢(xún)問(wèn)有權(quán)訪問(wèn)數(shù)據(jù)的人員和公司,進(jìn)一步增加您的知識(shí)庫(kù)
Get a fair estimation of how much time and money you are realistically going to spend to capture all the data you need
合理估算您實(shí)際上將花費(fèi)多少時(shí)間和金錢(qián)來(lái)捕獲所需的所有數(shù)據(jù)
You can greatly increase your chances of developing a sound approach for data collection and maximize your chances of getting great data in an efficient way. Thanks for reading!
您可以極大地提高開(kāi)發(fā)合理的數(shù)據(jù)收集方法的機(jī)會(huì),并最大限度地提高以有效方式獲取優(yōu)質(zhì)數(shù)據(jù)的機(jī)會(huì)。 謝謝閱讀!
Access my free Data Science resource checklist here
在此處 訪問(wèn)我的免費(fèi)數(shù)據(jù)科學(xué)資源清單
翻譯自: https://towardsdatascience.com/how-to-estimate-data-collection-costs-for-your-data-science-project-8938ca9acc5f
大數(shù)據(jù)數(shù)據(jù)量估算
總結(jié)
以上是生活随笔為你收集整理的大数据数据量估算_如何估算数据科学项目的数据收集成本的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 首架国产大飞机 C919 完成兔年首次飞
- 下一篇: 为什么和平精英无响应_什么和为什么