當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据技术学习之旅_数据-数据科学之旅的起点

發布時間：2023/12/15 编程问答 56 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据技术学习之旅_数据-数据科学之旅的起点小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

大數據技術學習之旅

什么是數據科學？ (What is Data Science?)

The interesting thing about Data Science is that it is a young area and the definitions can differ from textbooks to newspapers to whitepapers. The general definition is that Data Science is a mixture of multiple tools, algorithms, and machine learning principles, in order to discover hidden patterns in the data. How is this different from statistics that have existed and have been used for years? The answer lies in the difference between explanation and prediction.

數據科學的有趣之處在于，它是一個年輕的領域，其定義從教科書到報紙再到白皮書都可能有所不同。一般的定義是，數據科學是多種工具，算法和機器學習原理的結合，以便發現數據中的隱藏模式。這與已有且使用多年的統計數據有何不同？答案在于解釋和預測之間的差異。

數據科學過程 (The data science process)

Data science is composed of seven main steps. Each one of them is important for the accuracy of the model. Let’s see what is contained in each step.

數據科學由七個主要步驟組成。其中每個對于模型的準確性都很重要。讓我們看看每個步驟中包含的內容。

業務了解 (Business understanding)

If we want to create a data science project, we need to understand the problem that we are trying to solve. So, in this step we have to get answers to the following questions:

如果要創建數據科學項目，則需要了解我們要解決的問題。因此，在這一步中，我們必須獲得以下問題的答案：

- How many?

- 多少？

- Which category?

-哪個類別？

- Which group?

-哪一組？

- Is this strange?

-奇怪嗎？

- Which option should be considered?

-應該考慮哪個選項？

Based on the answers to these questions, we can conclude which variable / variables should be predicted.

根據這些問題的答案，我們可以得出結論：應該預測哪個變量。

數據挖掘 (Data Mining)

The next step is finding the right data. Data mining is a process of finding and collecting data from different sources. We need to answer the following questions:

下一步是找到正確的數據。數據挖掘是從不同來源查找和收集數據的過程。我們需要回答以下問題：

- Which data is needed for the project?

-該項目需要哪些數據？

- Where can I find that data?

-在哪里可以找到這些數據？

- How to obtain the data?

-如何獲取數據？

- Which is the most effective way of storing and accessing the data?

-哪種存儲和訪問數據最有效的方法？

If the data is in one place — this process will be easy for us. Usually, this is not the case.

如果數據在一個地方，對我們來說這個過程很容易。通常情況并非如此。

數據清理 (Data Cleaning)

This is the most complicated step and it takes 50 to 80 percent of the time. After the data is collected, we must clean it. The data might contain missing values, or it might be inconsistent in one column. That is why we need to clean and organize our data.

這是最復雜的步驟，需要50％到80％的時間。收集數據后，我們必須對其進行清理。數據可能包含缺少的值，或者在一列中可能不一致。這就是為什么我們需要清理和整理數據。

數據探索 (Data Exploration)

After the data is cleaned, we will try to find a hidden pattern in it. This step includes extracting a subset, analyzing, and visualizing the subset. After this, we get a complete image behind every data point.

清除數據后，我們將嘗試在其中查找隱藏的模式。此步驟包括提取一個子集，分析和可視化該子集。此后，我們將在每個數據點后面獲得完整的圖像。

特征工程 (Feature Engineering)

In machine learning, a feature is explained as an attribute of a phenomenon that is observed. For example, if we are observing the results of a student — a possible attribute might be the amount of sleep that the student gets. This step is divided into two sub-steps. The first one is the Feature selection. In this step, we can remove some features in order to reduce the dimensionality that might cause complexity of the model. Also, the feature that we want to remove usually brings more noise than useful information. The second sub-step is Feature construction — this means that we can build a new feature based on the ones that we have.

在機器學習中，將特征解釋為觀察到的現象的屬性。例如，如果我們觀察一個學生的成績，則可能的屬性可能是該學生獲得的睡眠量。此步驟分為兩個子步驟。第一個是功能選擇。在此步驟中，我們可以刪除一些功能以降低可能導致模型復雜性的維數。此外，我們要刪除的功能通常會帶來比有用信息更多的噪音。第二個子步驟是功能構建-這意味著我們可以根據已有功能構建新功能。

預測建模 (Predictive modeling)

This is the step when we finally build out the model. Here we decide which model we will use — based on the answers that we obtained in the first step. This is not an easy decision and there is not always one answer. The model and its accuracy depend on the data — how big the data is, the type of the data and also the quality of the data. After the model is trained, we must evaluate the accuracy and determine if the model is successful.

這是我們最終構建模型的步驟。在這里，我們根據第一步獲得的答案來決定將使用哪種模型。這不是一個容易決定的決定，而且也不總是一個答案。模型及其準確性取決于數據-數據的大小，數據的類型以及數據的質量。訓練模型后，我們必須評估準確性并確定模型是否成功。

數據可視化 (Data Visualization)

After we have obtained the information from the model, we need to visualize them in different ways in order to be understood by everyone included in the project.

從模型中獲得信息后，我們需要以不同的方式對其進行可視化，以便項目中的每個人都能理解。

業務了解 (Business understanding)

Once everything is done, we return to the first step and check if the model meets the initial requirements. If we came across new insights during the first iteration of the life cycle (and I am sure that we will), we can now enter that knowledge into the next iteration to generate even more powerful insights and unleash the power of data to extract phenomenal results for the project.

一切完成后，我們返回第一步，檢查模型是否符合初始要求。如果我們在生命周期的第一次迭代中遇到了新的見解(并且我肯定會)，那么我們現在就可以將知識輸入到下一個迭代中，以生成更強大的見解，并釋放數據的力量以提取驚人的結果該項目。

什么是數據？ (What is data?)

We can see that almost every step needs data. We can see that four out of five steps in the previous part are data related. So, we can assume that the data plays a crucial role in a data science project. What is data? How the data is defined? This might seem like an unimportant definition to look at, but it is. Whenever we use the word “data,” we refer to a collection of information in either an organized or unorganized format.

我們可以看到幾乎每個步驟都需要數據。我們可以看到，上一部分中的五個步驟中有四個與數據相關。因此，我們可以假設數據在數據科學項目中起著至關重要的作用。什么是數據？數據如何定義？這看起來似乎是一個不重要的定義，但是確實如此。每當我們使用“數據”一詞時，我們指的是有組織或無組織格式的信息集合。

基本數據類型 (Basic types of data)

There are two types of formats based on the definition in the previous part:

根據上一部分的定義，有兩種格式：

o Structured (organized) data: Data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation

o 結構化(組織)數據：排序為行/列結構的數據，其中每一行代表一個觀察值，列代表該觀察值的特征

o Unstructured (unorganized) data: Data that is in a free form, usually text or raw audio/signals that must be parsed further to become organized.

o 非結構化(非組織)數據：自由格式的數據，通常是文本或原始音頻/信號，必須進一步解析才能變得有組織。

When we talk about data, the first thing that we need to answer is whether the data is quantitative or qualitative. When we talk about quantitative data, we usually think about a structured dataset. These two data types can be defined as follows:

在談論數據時，我們需要回答的第一件事是數據是定量的還是定性的。當談論定量數據時，我們通常會考慮結構化數據集。這兩種數據類型可以定義如下：

o Quantitative data: When the data can be described using numbers, and basic mathematical operations, including addition, are possible on the set.

o 定量數據：當可以使用數字描述數據時，可以在集合上進行包括加法在內的基本數學運算。

o Qualitative data: When the data cannot be described using numbers and basic mathematics. This data is generally being described using natural categories and language.

o 定性數據：當無法使用數字和基本數學描述數據時。通常使用自然類別和語言來描述此數據。

定量數據 (Quantitative data)

Quantitative data can be:

定量數據可以是：

o Discrete data: This describes data that is counted. It can only take on certain values. Examples of discrete quantitative data include a dice roll, because it can only take on six values, and the number of customers in a coffee shop because you can’t have a real range of people.

o 離散數據：這描述了計數的數據。它只能采用某些值。離散定量數據的示例包括骰子擲骰(因為它只能取六個值)，以及咖啡店的顧客數量(因為您沒有真正的人脈)。

o Continuous data: This describes data that is measured. It exists on an infinite range of values.

o 連續數據：這描述了要測量的數據。它存在于無限范圍的值中。

數據的四個層次 (The four levels of data)

It is generally understood that a specific characteristic (feature/column) of structured data can be broken into four levels of data. These levels are the following:

通常可以理解，結構化數據的特定特征(特征/列)可以分為四個數據級別。這些級別如下：

o The nominal level

o標稱水平

o The ordinal level

o順序級別

o The interval level

o間隔等級

o The ratio level

o比率水平

Let’s go deeper into each level and explain each one of them.

讓我們更深入地介紹每個級別，并解釋每個級別。

名義水平 (The nominal level)

This level contains data that is described by name or category. For example, gender, name, species, and so on. The data cannot be described using numbers, so it is qualitative data and because of this we cannot perform mathematical operations such as addition or division on this data. The operations that we can perform on this level are equality and set membership function. Also, we cannot use the measure of center — a measure of center is explained as a number that shows us what the data tends to, and sometimes it is called a balance point of the data. Why we cannot use the measure of center? The explanation is simple — usually, when we use this measure we use the mode, median or the mean value. But, at the nominal level we cannot use mathematical operations, so these measures do not make sense. In conclusion, this level is composed of categorical data and we must be careful with this data — since it might contain very useful insights for us.

此級別包含按名稱或類別描述的數據。例如，性別，名稱，物種等。不能使用數字來描述數據，因此它是定性數據，因此，我們無法對此數據執行數學運算，例如加法或除法。我們可以在此級別上執行的操作是相等性和設置成員資格函數。另外，我們不能使用中心度量-中心度量被解釋為一個數字，向我們顯示數據趨向于什么，有時也稱為數據的平衡點。為什么我們不能使用中心度量？解釋很簡單-通常，當我們使用此度量時，我們使用眾數，中位數或平均值。但是，在名義上，我們不能使用數學運算，因此這些度量沒有意義。總之，此級別由分類數據組成，我們必須謹慎使用此數據-因為它可能包含對我們非常有用的見解。

順序級別 (The ordinal level)

The nominal level is not very flexible when we talk about mathematical operations. The data in the ordinal level provides a rank order, but we still cannot use more complex mathematical operations — like subtraction or addition in order to get a real meaning. For example, the grades from 1–10 are ordinal data — if we want to use addition, we won’t get any useful information from this. Another example is a survey result. At this level, we have more freedom with mathematical operations than in the nominal. The mathematical operations from the nominal level (equality and set membership) are inherited, and the additional operations that are allowed are ordering and comparison. At the ordinal level, the median is usually an appropriate way of defining the center of the data, but we can use the mode as well. The mean, however, would be impossible because the division is not allowed at this level.

當我們談論數學運算時，名義水平不是很靈活。順序級別的數據提供了排名順序，但是我們仍然不能使用更復雜的數學運算(例如減法或加法)以獲得真實含義。例如，從1到10的等級是序數數據—如果我們要使用加法，則將無法從中獲得任何有用的信息。另一個例子是調查結果。在此級別上，我們在數學運算方面的自由度比名義上更大。繼承了名義級別(相等和集合成員)的數學運算，并且允許的其他運算是排序和比較。在順序級別上，中位數通常是定義數據中心的一種合適方法，但是我們也可以使用該模式。但是，均值將是不可能的，因為在此級別不允許進行除法運算。

間隔等級 (The interval level)

Now, we are getting at a level where the data can be expressed through mean and we can use more complicated mathematical formulas. Data at the interval level support subtraction between data points. For example, data that contains temperature belongs to the interval level. The operations from the lower levels (ordering, comparisons, and so on), are inherited and the additional operations that are allowed are addition and subtraction. When we talk about the measure of center, we can use the median, the mode or the mean value — and usually, the most accurate description of the center would be the arithmetic mean. Let’s look at an example. We are trying to find the measure of center using data that contains temperatures of a fridge in which vaccines are stored. The optimal temperature must be under 29 degrees. After finding the mean and the median, we assumed both of them are near to 31 — so this is not acceptable for our dataset. This is the point when we need another measure — the measure of variance or standard deviation. We can use this measure if we want to see how our data is spread out. If we want to find the measure of variance, we need to calculate the mean, subtract each point from the mean, find the average of each square difference and take the square root. Here is the formula:

現在，我們可以達到可以通過均值表示數據的水平，并且可以使用更復雜的數學公式。間隔級別的數據支持數據點之間的減法。例如，包含溫度的數據屬于間隔級別。較低級別的操作(排序，比較等)將被繼承，而允許的其他操作為加法和減法。當我們談論中心的度量時，我們可以使用中位數，眾數或平均值-通常，最準確的中心描述將是算術平均值。讓我們來看一個例子。我們正在嘗試使用包含存儲疫苗的冰箱溫度的數據來找到中心度量。最佳溫度必須低于29度。找到均值和中位數后，我們假設它們均接近31，因此這對于我們的數據集是不可接受的。在這一點上，我們需要另一種度量-方差或標準偏差的度量。如果要查看數據如何分布，可以使用此度量。如果要找到方差的度量，則需要計算平均值，從平均值中減去每個點，找到每個平方差的平均值，然后取平方根。這是公式：

If we use this formula on the example with the temperatures, we can calculate the standard deviation on our dataset, and based on this measure we can see that the temperature might go down (mean minus standard deviation).

如果在帶有溫度的示例中使用此公式，則可以在數據集上計算標準偏差，并且基于此度量，我們可以看到溫度可能會下降(平均值減去標準偏差)。

比例等級 (The ratio level)

The last level is called the ratio level. There are not a lot of differences between the ratio and the interval level — sometimes we might be confused about which one is the right one. At the interval level, we don’t have a natural starting point or a natural zero, but in the ratio level — we have. The mathematical operations from the lower level are inherited and the additional ones are multiplication and division. For example, money in a bank account are classified in this level — one bank account can have a natural zero. As a measure of center, we can use the geometric mean — it is the square root of the product of all the values. The data at this level should be non-negative so that is why this level is not preferred.

最后一個級別稱為比率級別。比率和間隔水平之間沒有太多差異-有時我們可能會混淆哪個是正確的。在時間間隔級別，我們沒有自然的起點或自然的零，但是在比率級別上，我們有。較低級別的數學運算是繼承的，附加的運算是乘法和除法。例如，銀行帳戶中的錢在此級別中分類-一個銀行帳戶可以具有自然零值。作為中心的度量，我們可以使用幾何平均值-它是所有值的乘積的平方根。此級別的數據應為非負數，這就是為什么不首選此級別的原因。

結論 (Conclusion)

Data science can add values to any business — the important thing is to use the data well. Also, Data science can help us make better decisions based on measurable evidence. Data should always be available to us when making decisions. Using data science methodologies, we can research historical data, make comparisons with the competition, analyze the market, and most importantly, make recommendations on how the product or service would perform best. These analyzes, which are part of data science, provide deep knowledge and understanding of the market as well as their feedback on the product or service. It is estimated that about 2.5 billion gigabytes of data are generated daily. With this increase in the amount of data, getting what is important for the target group can be difficult. Every piece of data that a company collects from customers — whether it is social media likes, website visits or email surveys — contains data that can be analyzed to understand customers more effectively. This means that the services and products of certain groups can be customized. For example, finding correlations between age and income can help a company create new promotions or offers for groups that may not have been available before.

數據科學可以為任何業務增加價值，重要的是要充分利用數據。同樣，數據科學可以幫助我們基于可衡量的證據做出更好的決策。決策時，數據應始終可供我們使用。使用數據科學方法，我們可以研究歷史數據，與競爭對手進行比較，分析市場，最重要的是就產品或服務的最佳性能提出建議。這些分析是數據科學的一部分，可提供對市場的深入了解和了解，以及對產品或服務的反饋。據估計每天大約產生25億千兆字節的數據。隨著數據量的增加，獲取對于目標群體重要的數據可能會很困難。公司從客戶那里收集的每條數據(無論是喜歡的社交媒體，網站訪問還是電子郵件調查)都包含可以進行分析以更有效地了解客戶的數據。這意味著可以定制某些組的服務和產品。例如，發現年齡和收入之間的相關性可以幫助公司為以前可能沒有的團體創建新的促銷或優惠。

If you are interested in this topic, do not hesitate to contact me.

如果您對此主題感興趣，請隨時與我聯系。

LinkedIn profile: https://www.linkedin.com/in/ceftimoska/

領英簡介： https : //www.linkedin.com/in/ceftimoska/

翻譯自: https://towardsdatascience.com/data-the-starting-point-of-a-data-science-journey-f7880f9f0eb7

大數據技術學習之旅

總結

以上是生活随笔為你收集整理的大数据技术学习之旅_数据-数据科学之旅的起点的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： "双原生 ISO"
下一篇：编写分段函数子函数_编写自己的函数

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

大数据技术 学习之旅_数据-数据科学之旅的起点

什么是數據科學？ (What is Data Science?)

數據科學過程 (The data science process)

業務了解 (Business understanding)

數據挖掘 (Data Mining)

數據清理 (Data Cleaning)

數據探索 (Data Exploration)

特征工程 (Feature Engineering)

預測建模 (Predictive modeling)

數據可視化 (Data Visualization)

業務了解 (Business understanding)

什么是數據？ (What is data?)

基本數據類型 (Basic types of data)

定量數據 (Quantitative data)

數據的四個層次 (The four levels of data)

名義水平 (The nominal level)

順序級別 (The ordinal level)

間隔等級 (The interval level)

比例等級 (The ratio level)

結論 (Conclusion)

總結

大数据技术学习之旅_数据-数据科学之旅的起点