日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

无法从套接字中获取更多数据_数据科学中应引起更多关注的一个组成部分

發(fā)布時間:2023/11/29 编程问答 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 无法从套接字中获取更多数据_数据科学中应引起更多关注的一个组成部分 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

無法從套接字中獲取更多數(shù)據(jù)

介紹 (Introduction)

Data science, machine learning, artificial intelligence, those terms are all over the news. They get everyone excited with the promises of automation, new savings or higher earnings, new features, markets or techniques. Some of those promises are well-founded, while others are still in inception or haven’t passed the proof-of-concept stage (another way to say they’re just at the wet-dreams stage).

數(shù)據(jù)科學,機器學習,人工智能這些術語無處不在。 他們使每個人對自動化的承諾,新的節(jié)余或更高的收益,新功能,市場或技術感到興奮。 這些承諾中有一些是有根據(jù)的,而另一些仍處于起步階段或尚未通過概念驗證階段(另一種表示它們只是夢wet以求的階段)。

There have been major improvements in techniques we use to extract, transform and load data. New and refined algorithms or techniques such as PCA, hyperparameter optimization, and designs, such as Neural Network, have brought improvements in outcomes. But there’s that one aspect that doesn’t get enough attention, the villain little duck. If you’re accustomed to working with data you might have already guessed it. If not, you’ll find out next. Let’s dive in.

我們用于提取,轉換和加載數(shù)據(jù)的技術有了重大改進。 新的和完善的算法或技術(例如PCA),超參數(shù)優(yōu)化和設計(例如神經網(wǎng)絡)帶來了成果的改善。 但是有一個方面沒有引起足夠的重視,反派小鴨子。 如果您習慣于處理數(shù)據(jù),則可能已經猜到了。 如果沒有,您將找到下一步。 讓我們潛入。

數(shù)據(jù)科學領域的佼佼者 (The Unloved One In Data Science)

At the heart of everything in business and research, aside from money, is data. Data is the new oil, or the new electricity depends on who you ask. A key asset. Computers make it easy to collect, share and analyse, it’s now a strategic asset.

除了金錢,商業(yè)和研究中所有內容的核心都是數(shù)據(jù)。 數(shù)據(jù)是新石油,還是新電力取決于您問誰。 關鍵資產。 計算機使收集,共享和分析變得容易,這已成為一項戰(zhàn)略資產。

But there’s an aspect about that that isn’t sufficiently discussed, it’s its quality. Quantity, whether Big Data or small data, doesn’t matter if the quality of the data is poor.

但是有一個方面沒有得到充分討論,那就是它的質量。 無論數(shù)據(jù)質量不好,數(shù)量(大數(shù)據(jù)還是小數(shù)據(jù))都無關緊要。

Garbage in, garbage out

垃圾進垃圾出

No matter how good is your data pipeline, your cleaning and training/testing models, no matter your hypothesis, or the complexity of your algorithm, nothing valuable will result from your work if your data isn’t good or is of poor quality. That’s the famous “garbage in, garbage out”. You can’t bake a good cake with ripe ingredients.

無論您的數(shù)據(jù)管道,清潔和培訓/測試模型有多好,無論您的假設是什么,或者算法的復雜性如何,如果數(shù)據(jù)不好或質量很差,您的工作都不會產生有價值的東西。 那就是著名的“垃圾進,垃圾出”。 您不能烘烤帶有成熟食材的好蛋糕。

This flow provides another way to look at data quality:

此流程提供了另一種查看數(shù)據(jù)質量的方法:

Data Quality → Information Quality → Decision Quality → Business Outcome

數(shù)據(jù)質量→信息質量→決策質量→業(yè)務成果

Reddit — Question adjustment made by the article’s authorReddit撰寫的原始模因模板-文章作者進行的問題調整

當我們談論不良數(shù)據(jù)時我們談論什么 (What We Talk About When We Talk About Poor Data)

To find bad data, one needs to know what to look for. The industry mostly defines 6 to 7 dimensions to quantify the quality of data. Although the number of dimensions can vary depending on the needs, industry and focus.

為了找到錯誤的數(shù)據(jù),需要知道要查找什么。 該行業(yè)通常定義6到7個維度來量化數(shù)據(jù)質量。 盡管尺寸的數(shù)量可以根據(jù)需求,行業(yè)和重點而變化。

An easy acronym will be ACCCUT. Let’s review it:

一個簡單的縮寫將是ACCCUT。 讓我們回顧一下:

Accuracy. Every data point should have correct values. Example: Names properly spelled, and real recorded events.

一個 ccuracy。 每個數(shù)據(jù)點應具有正確的值。 示例:正確拼寫的名稱和真實記錄的事件。

Completeness. Data records should contain all required information. Optional elements are… optional. Example: Name, Surname & Email are required, but the physical address is optional.

?ompleteness。 數(shù)據(jù)記錄應包含所有必需的信息。 可選元素是…可選。 示例:姓名,姓氏和電子郵件是必填項,但實際地址是可選的。

Consistency. Any given data point should be the same across the organization and all its systems. Example: Records from the client relationship management tool should match the financial tool.

?onsistency。 在組織及其所有系統(tǒng)中,任何給定的數(shù)據(jù)點都應該相同。 示例:來自客戶關系管理工具的記錄應與財務工具匹配。

Conformity. Data records should follow standards (format, size, type, …). Example: Rules for name, date formats.

?onformity。 數(shù)據(jù)記錄應遵循標準(格式,大小,類型等)。 示例:名稱,日期格式規(guī)則。

Uniqueness. No duplicates. This is key to prevent any confusion or using outdated records. When facing two data records, which one should you select? Or none? This could be quite a dilemma and a waste of time.

üniqueness。 無重復。 這是防止任何混亂或使用過時的記錄的關鍵。 當面對兩個數(shù)據(jù)記錄時,應該選擇哪個? 還是沒有? 這可能是一個兩難選擇,浪費時間。

Timeliness. Availability could also be used as a synonym. The data needs to be usable and available when the user needs it. Example: For online booking websites, the expectations are that the data (e.g. free or busy rooms) is updated in real-time, as to avoid confusion, frustration, overall poor user experience.

?imeliness。 可用性也可以用作同義詞。 數(shù)據(jù)需要在用戶需要時可用并可用。 示例:對于在線預訂網(wǎng)站,期望實時更新數(shù)據(jù)(例如,空閑或忙碌的房間),以避免造成混亂,沮喪和整體不良的用戶體驗。

You can’t bake a good cake with ripe ingredients.

您不能烘烤帶有成熟食材的好蛋糕。

這怎么發(fā)生的? (How Did This Happen?)

Photo by Emily Morter on Unsplash 艾米麗·莫特 ( Emily Morter)在Unsplash上拍攝的照片

Bad data is one thing, but finding the causes is another. Culprits can be divided into two main categories:

壞數(shù)據(jù)是一回事,而尋找原因是另一回??事。 罪犯可以分為兩大類:

  • Systems

    系統(tǒng)篇

  • Because we’re using multiple systems and software tools to track and update our records, the possibility for improper integration exists. Over time, those can lead to incomplete records, creation of duplicates, and lack of consistency. Migration of data between platforms is also a cause for data to deteriorate, think degraded or lost records.

    由于我們正在使用多個系統(tǒng)和軟件工具來跟蹤和更新記錄,因此存在不正確集成的可能性。 隨著時間的流逝,它們可能導致記錄不完整,重復項的創(chuàng)建以及缺乏一致性。 平臺之間的數(shù)據(jù)遷移也是導致數(shù)據(jù)惡化,認為降級或丟失記錄的原因。

    While platforms might be properly integrated at inception, time and updates might lead them to converge and cause quality concerns. Humans might play an unfortunate role in this.

    盡管平臺可能在一開始就已正確集成,但是時間和更新可能會導致它們收斂并引起質量問題。 人類可能在其中扮演不幸的角色。

    2. Humans

    2. 人類

    Typos when completing an input form (easy fix: if possible use a drop-down list), failing to follow guidelines, a new entry for an existing record. Misreported data can also lead to poor quality, this can be caused by a lack of understanding of the data to be reported, it should be clearly stated what is expected, maybe by providing examples. Those errors can happen both from your co-workers or customers.

    在填寫輸入表單時輸入錯字(容易解決:如果可能,請使用下拉列表),不遵循準則,為現(xiàn)有記錄添加新條目。 錯誤報告的數(shù)據(jù)也可能導致質量較差,這可能是由于對要報告的數(shù)據(jù)缺乏了解而引起的,應通過提供示例清楚地說明預期的結果。 這些錯誤可能同時發(fā)生在您的同事或客戶身上。

    3. Bonus: Data degrading over time

    3. 獎勵:隨著時間的推移數(shù)據(jù)性能下降

    The data might be of good quality at first, but if some changes occur in the background, it might be rendered useless. Think of a customer applying a change without informing you this could be phone, email, address. It can also be a change in the methodology used to compute a given metric.

    數(shù)據(jù)一開始可能質量不錯,但是如果后臺發(fā)生某些更改,則可能會變得無用。 想一想客戶在不通知您的情況下進行更改的情況,可以是電話,電子郵件,地址。 也可以是用于計算給定指標的方法的變化。

    我該怎么辦? (What Should I Do Doc’?)

    National Cancer Institute on 美國國家癌癥研究所的Unsplash照片
  • Finding out aka the unpleasant discovery: first things first, find out about the poor quality. It seems trivial but if you aren’t aware of the state of your data, you are already wasting some valuable time and resources. To find out about the poor quality, you can either be informed by your data provider, e.g. another department in your company or your client, have initial suspicions based on hearsay, or discovering when doing your homework: by quickly eyeballing the data and/or by doing your exploratory data analysis (EDA),

    找出不愉快的發(fā)現(xiàn):首先,首先要了解質量差的問題。 這似乎是微不足道的,但是如果您不了解數(shù)據(jù)的狀態(tài),那么您已經在浪費一些寶貴的時間和資源。 要了解質量差的問題,可以通過數(shù)據(jù)提供者(例如公司或客戶的其他部門)來通知您,對基于傳聞的初步懷疑,或者在做作業(yè)時發(fā)現(xiàn)它們:快速查看數(shù)據(jù)和/或通過進行探索性數(shù)據(jù)分析(EDA),

  • Define & Report: define the extent of the damage, try to quantify it meaningfully, different categories, summarize your findings so it can be reported. Define also means finding the sources of poor data quality. More on this in the next part. When putting your report or deck together, don’t forget to present the ‘not-so-bad’ parts as well, to keep spirits high.

    定義并報告:定義損害程度,嘗試有意義地量化損害,不同類別,總結您的發(fā)現(xiàn)以進行報告。 定義還意味著找到數(shù)據(jù)質量差的來源。 下一部分將對此進行更多介紹。 將報告或文件放在一起時,不要忘記展示“不太糟糕”的部分,以保持高昂的情緒。

  • Inform: Inform your stakeholders, but make sure your team has been informed as well to prevent asymmetries of information internally, so future interactions with stakeholders do not look awkward,

    通知:通知您的利益相關者,但要確保您的團隊也已獲知,以防止內部信息的不對稱,因此將來與利益相關者的交互不會顯得笨拙,

  • Get Feedback: Based on your report on the data quality gaps, you’ll hopefully get some concrete feedback, you might, for instance, get a green light to move on with what you currently have. Surprising? Maybe a bit. If the gaps are on 1–5% of the data, or on chunks that do not matter that much, this could simply be tagged as to be removed from the dataset. On the other hand, if there’s recognition the state of the data is not sufficient, there’s some work ahead.

    獲得反饋:基于您對數(shù)據(jù)質量差距的報告,希望您能得到一些具體的反饋,例如,您可能會獲得開綠燈,繼續(xù)進行當前的工作。 奇怪? 也許有一點。 如果差距在數(shù)據(jù)的1-5%或無關緊要的數(shù)據(jù)塊上,則可以簡單地將其標記為從數(shù)據(jù)集中刪除。 另一方面,如果認識到數(shù)據(jù)的狀態(tài)還不夠,那就需要做一些工作。

  • Return To Data: Now that a decision was taken, one can return to the data matrix. If the decision was to go ahead, then that’s the end of this article. Alternatively, if there’s a need to ‘fix things’ that’s when the next part kicks in.

    返回數(shù)據(jù):現(xiàn)在已做出決定,現(xiàn)在可以返回數(shù)據(jù)矩陣。 如果決定繼續(xù)進行,那么到此結束。 或者,如果需要“解決問題”,那就是下一部分開始。

  • 解決它 (Solving It)

    This is an imaginary dialogue, but one you might encounter, one way or the other.

    這是一種虛構的對話,但是您可能會遇到一種或另一種方式。

    “Can it be solved?”

    “可以解決嗎?”

    “…”

    “……”

    “Can it be?”

    “是真的嗎?”

    “Well, if you insist, let’s go for it”.

    “好吧,如果您堅持,那就去吧。”

    The question is ‘how to go for it, how to solve it?’, and there are multiple answers. The focus will be

    問題是“如何去做,如何解決?”,并且有多個答案。 重點將是

  • The first solution is a bit brute-force, and probably not always suitable for business environments, it consists of two main pillars:

    第一個解決方案有點蠻力,可能并不總是適合于業(yè)務環(huán)境,它包含兩個主要Struts:
    • Drop incomplete records

      刪除不完整的記錄

    Dropping rows with data quality concerns on a limited set of columns, or dropping all rows where there’s even one single issue or missing value within a field.

    在有限的一組列上刪除具有數(shù)據(jù)質量問題的行,或者在字段中甚至存在單個問題或缺少值的地方刪除所有行。

    • Complete missing fields

      填寫缺少的字段

    Think statistical imputation. There exists multiple methods, each with their pros and cons. Filling or fixing by using ‘Most Frequent’, Mean/Median or replacing by zero or some constant. Alternatively, we can (programmatically) complete records’ missing fields with the help of similar records, that is kNN imputation, using the nearest neighbours of incomplete data records.

    考慮統(tǒng)計歸因。 存在多種方法,每種方法各有利弊。 通過使用“最頻繁”,均值/中位數(shù)來填充或固定,或替換為零或某個常數(shù)。 或者,我們可以使用不完整數(shù)據(jù)記錄的最接近鄰居,通過類似記錄(即kNN插補)(以編程方式)完成記錄的缺失字段。

    2. Leverage data quality tools

    2. 利用數(shù)據(jù)質量工具

    Depending on your needs, there exist multiple solutions. Problems such as duplication, inconsistent format, language or unit can easily be flagged by software, then corrected.

    根據(jù)您的需求,存在多種解決方案 。 諸如重復,格式不一致,語言或單位之類的問題可以很容易地由軟件標記,然后加以糾正。

    3. Correct faulty records

    3. 更正錯誤的記錄

    Faulty records do not have to stay that way. If data quality tools cannot help, humans can always support. It’s not fancy nor pleasant, but human input could help correct what human (or system) input messed in the first place. This should be approached on a case by case basis, with a cost-benefit analysis to ensure it makes sense.

    錯誤的記錄不必保持這種狀態(tài)。 如果數(shù)據(jù)質量工具無法提供幫助,那么人類可以隨時提供支持。 這既不花哨也不令人愉悅,但是人工輸入可以幫助糾正最初人工(或系統(tǒng))輸入混亂的情況。 應根據(jù)具體情況進行處理,并進行成本效益分析以確保其合理性。

    可持續(xù)發(fā)展 (Sustainability)

    You’ve attained a decent level of data quality. This was quite a ride. What comes next?

    您已經獲得了不錯的數(shù)據(jù)質量。 這是一個很好的旅程。 接下來是什么?

    By this point, you might be feeling that despite all the fun, dealing with such quality issue is a process you’d gladly put aside for some time. To ensure this stays a distant memory, a few safeguards can be implemented. With the help of data governance best practices:

    至此,您可能會覺得盡管有很多樂趣,但是處理這樣的質量問題是您很樂意擱置一段時間的過程。 為確保此信息能保存在遠處,可以采取一些保護措施。 借助數(shù)據(jù)治理最佳實踐:

    • Track your data quality over time

      跟蹤您的數(shù)據(jù)質量

    You have a clear idea of what is critical. Monitor quality with the help of KPIs, it should be improving over time, if not then something is awry and should be dealt with. If the quality is not improving, search for the root-causes. What are examples of new poor records, where do they come from, how were they generated? There might have been rule-changes upstream or the reintroduction of human input which lead to incorrectness.

    您對關鍵問題有一個清晰的認識。 借助KPI監(jiān)控質量,隨著時間的推移,它應該會不斷提高,如果沒有,那就有些問題了,應該加以解決。 如果質量沒有提高,請查找根本原因。 新的不良記錄有哪些例子,它們來自何處,又是如何產生的? 可能是上游發(fā)生了規(guī)則更改,或者可能由于人工輸入而導致錯誤。

    • Monitor newly added data sources or new fields

      監(jiān)視新添加的數(shù)據(jù)源或新字段

    New data sources should be inspected to ensure the existence of data quality rules. These rules should also match the standards of the ones set previously.

    應該檢查新的數(shù)據(jù)源,以確保存在數(shù)據(jù)質量規(guī)則。 這些規(guī)則還應符合先前設置的標準。

    New fields should limit the range of errors on the human side. With a restricted set of options, such as by using drop-down lists, and only accepting a record if it is complete.

    新字段應限制人為錯誤的范圍。 具有一組受限制的選項,例如使用下拉列表,并且僅在記錄完成時才接受。

    • Audit your systems & teams

      審核您的系統(tǒng)和團隊

    If the systems are faulty, or the teams misunderstand certain dimensions or characteristics of the data, the poor data quality will be perpetuated.

    如果系統(tǒng)出現(xiàn)故障,或者團隊誤解了數(shù)據(jù)的某些維度或特征,那么不良的數(shù)據(jù)質量將繼續(xù)存在。

    • Create a data quality team

      建立數(shù)據(jù)質量團隊

    Because everyone’s too busy, having a team dedicated to data quality ensures this key asset is well maintained. The team would set data governance principles, and focus on the aforementioned points.

    因為每個人都太忙,所以擁有一支致力于數(shù)據(jù)質量的團隊可確保此關鍵資產得到良好維護。 該團隊將制定數(shù)據(jù)治理原則,并專注于上述要點。

    反對忽視不良數(shù)據(jù)質量的理由 (The Case Against Disregarding Poor Data Quality)

    If all of the above seems too much of a burden both on resource and time, here is a list of consequences of poor data quality:

    如果以上所有因素似乎都給資源和時間帶來了沉重負擔,則以下是不良數(shù)據(jù)質量的后果列表:

    • Mistrust

      不信任

    If there is evidence a chunk of data cannot be trusted, then any record from that dataset or tool will be looked at very sceptically. Ultimately this mistrust could plague other datasets or systems throughout the organization, or cause your customers to have doubts everything else.

    如果有證據(jù)表明不能信任大量數(shù)據(jù),則將非常懷疑地查看該數(shù)據(jù)集或工具中的任何記錄。 最終,這種不信任可能會困擾整個組織中的其他數(shù)據(jù)集或系統(tǒng),或者使您的客戶對其他所有內容產生懷疑。

    • Reputation

      聲譽

    Errors happen, but if they are blatant and uncared for, it does not put your team or organization under a good light. Your reputation would suffer.

    錯誤會發(fā)生,但是如果錯誤過分且無人理for,則不會使您的團隊或組織處于良好狀態(tài)。 您的聲譽會受到影響。

    • Productivity

      生產率

    Your team, customers and yourself will waste time and resource with poor data. They might have to cross-check with other sources, call other departments for confirmation, it could have a domino effect on many.

    您的團隊,客戶和您自己都會浪費時間和資源來處理不良數(shù)據(jù)。 他們可能必須與其他來源進行交叉核對,致電其他部門進行確認,這可能對許多企業(yè)產生多米諾骨牌效應。

    • Decision-Making

      做決定

    Data can work as eyes. With poor eyes, it’s difficult to roam around. An organization with poor data is navigating blindly, or with a handicap which hinders its decision-making and strategy. To leave data at the door is to trust gut feelings, personal bias or personal agendas. At your risks.

    數(shù)據(jù)可以發(fā)揮作用。 眼睛不好,很難漫游。 數(shù)據(jù)貧乏的組織正在盲目導航,或存在阻礙其決策和戰(zhàn)略的障礙。 將數(shù)據(jù)留在門口就是信任直覺,個人偏見或個人議程。 風險自負。

    結論 (Conclusion)

    Data is a key asset, as long as it is of decent quality. Most organizations will deal with data quality issues, but they do not have to handicap the business forever. Frequent monitoring using KPIs, limited use of ‘input field’ used by humans, regular testing of systems integrations can prevent unpleasant surprises. If the problem exists, it will not go away on its own, especially not if it is pushed under the carpet. Prevention is to be preferred above cure so that a minor discomfort does not turn into a massive migraine.

    數(shù)據(jù)是高質量的關鍵資產。 大多數(shù)組織將處理數(shù)據(jù)質量問題,但他們不必永遠限制業(yè)務。 經常使用KPI進行監(jiān)視,有限地使用人類使用的“輸入字段”,對系統(tǒng)集成進行定期測試可以防止不愉快的意外發(fā)生。 如果存在問題,它將不會自行消失,尤其是將其推到地毯下面時。 預防優(yōu)先于治療,以免輕微不適不會導致嚴重的偏頭痛。

    翻譯自: https://towardsdatascience.com/the-one-component-that-deserves-more-attention-in-data-science-22de7b596d83

    無法從套接字中獲取更多數(shù)據(jù)

    總結

    以上是生活随笔為你收集整理的无法从套接字中获取更多数据_数据科学中应引起更多关注的一个组成部分的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內容還不錯,歡迎將生活随笔推薦給好友。