日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据技术 学习之旅_如何开始您的数据科学之旅?

發布時間:2023/11/29 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 大数据技术 学习之旅_如何开始您的数据科学之旅? 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

大數據技術 學習之旅

Machine Learning seems to be fascinating to a lot of beginners but they often get lost into the pool of information available across different resources. This is true that we have a lot of different algorithms and steps to learn but starting with a strong base not only gives confidence but also the motivation to learn and explore further. In this story, we will go through the steps on how to set up your environment and start learning with the help of a known dataset i.e. IRIS dataset which is a multi-class classification problem in Machine Learning. We will also go through some helpful python libraries which can speed up the learning process which can help you even if you are a data scientist. If you have done the setup already, you can skip the setup steps. Let’s begin with the first step of your journey.

機器學習似乎使許多初學者著迷,但是他們常常迷失在跨不同資源的可用信息庫中。 的確,我們有很多不同的算法和步驟需要學習,但是從一個強大的基礎開始不僅會給人信心,而且還會激發學習和進一步探索的動力。 在這個故事中,我們將逐步了解如何設置環境并借助已知數據集(即IRIS數據集)開始學習,該數據集是機器學習中的多類分類問題。 我們還將介紹一些有用的python庫 ,這些可以加快學習過程,即使您是數據科學家 ,也可以為您提供幫助。 如果已經完成設置,則可以跳過設置步驟。 讓我們從旅程的第一步開始。

搭建環境 (Setting up the Environment)

We will use the Anaconda distribution to setup the data science environment. Download the latest version of Anaconda from here and open the anaconda prompt and run the following command-

我們將使用Anaconda發行版來設置數據科學環境。 從此處下載最新版本的Anaconda,然后打開anaconda提示符并運行以下命令:

jupyter notebook

The above command will start the Jupyter server and load the notebook directory in your browser.

上面的命令將啟動Jupyter服務器,并在瀏覽器中加載筆記本目錄。

創建一個虛擬環境 (Create a virtual environment)

I hope you are aware of virtual environments, if not you can read about them here. Although Anaconda comes with a base environment which is having most of the libraries already installed, it is recommended to use virtual environments as they help us to manage different environments that can have different packages and if something goes wrong with one environment it will not affect others. Here are the commands you can use to create, activate, and install packages in the virtual environment.

我希望您了解虛擬環境,如果沒有,您可以在這里閱讀有關它們的信息 。 盡管Anaconda帶有一個已安裝了大多數庫的基本環境 ,但建議使用虛擬環境,因為它們可以幫助我們管理可以具有不同程序包的不同環境,如果一個環境出了問題,則不會影響其他環境。 。 這是可用于在虛擬環境中創建,激活和安裝軟件包的命令。

將虛擬環境與筆記本鏈接 (Link virtual environment with the Notebook)

By default, the new environment would not show up in the Jupyter notebook. You need to run following commands to link your environment with Jupyter client

默認情況下,新環境不會顯示在Jupyter筆記本中。 您需要運行以下命令來將環境與Jupyter客戶端鏈接

啟動筆記本和有用的命令 (Starting notebook and useful commands)

Once you have a virtual environment, go to the browser and open a new notebook as shown below. Select the environment you just created.

擁有虛擬環境后,請轉到瀏覽器并打開一個新筆記本,如下所示。 選擇您剛剛創建的環境。

Jupyter notebook provides many handy shortcuts. Below 2 are my favorite-

Jupyter筆記本電腦提供了許多方便的快捷方式。 以下2是我的最愛-

  • Tab- this acts as a autocomplete.

    Tab-這是自動完成。
  • Shift + Tab- this will give you the command details and you do not need to go to the library documentation every time.

    Shift + Tab鍵-這將為您提供命令詳細信息,您無需每次都轉到庫文檔。
  • See, how these commands can be helpful

    看,這些命令有什么幫助

    Source: GIF by Author資料來源:作者提供的GIF

    探索Python庫并應用機器學習 (Exploring Python libraries and applying Machine Learning)

    We need to have different libraries for loading the data sets, visualizations, and modeling. We will go through each and install them in the environment. You can have a look at my notebook, feel free to download and import it in your environment and play around with it-

    我們需要有不同的庫來加載數據集,可視化和建模。 我們將逐一檢查并將它們安裝在環境中。 您可以看一下我的筆記本,可以在您的環境中隨意下載和導入它,并可以隨意使用-

    Jupyter Contrib Nbextensions (Jupyter Contrib Nbextensions)

    We often need to share our notebook with different stakeholders or might need to present them, this library provides us a lot of different extensions. I will not go through the extensions here, but I recommend using this. My favorite ones are-

    我們經常需要與不同的利益相關者共享我們的筆記本,或者可能需要展示他們,這個庫為我們提供了許多不同的擴展。 我不會在這里進行擴展,但是我建議使用它。 我最喜歡的是-

  • Collapsible headings.

    可折疊的標題。
  • Table of Contents.

    目錄。
  • Execution Time.

    執行時間處理時間。
  • You can install it using

    您可以使用安裝

    conda install -c conda-forge jupyter_contrib_nbextensions

    Here is a short demo on how it can help-

    這是一個簡短的演示,介紹了如何幫助您-

    Source: GIF by Author資料來源:作者提供的GIF

    熊貓-Python數據分析庫 (Pandas- Python Data Analysis Library)

    This is the heart of data science with python and provides many different capabilities like

    這是python數據科學的核心,并提供許多不同的功能,例如

    • Data structures to work with the data.

      使用數據的數據結構。
    • Operations you can perform on the data.

      您可以對數據執行的操作。
    • Load and save data in different formats.

      加載和保存不同格式的數據。

    and many more. Many other libraries we use for machine learning with python have pandas as a dependency. Install it using-

    還有很多。 我們用于python機器學習的許多其他庫都將pandas作為依賴項。 使用-安裝它

    conda install -c conda-forge pandas

    The above command will install other libraries like NumPy which pandas uses under the hood.

    上面的命令將安裝其他庫,例如Pandas在后臺使用的NumPy。

    斯克萊恩(Scikit-Learn) (Sklearn (Scikit-Learn))

    We will use this library to download test datasets and apply different machine learning algorithms. Install using the following command.

    我們將使用該庫下載測試數據集并應用不同的機器學習算法。 使用以下命令進行安裝。

    conda install -c conda-forge scikit-learn

    In machine learning classification problems, the problem can be understood as for X features (input variables) predict y (target value). Sklearn provides few test datasets which we can use to play with, we will take the IRIS dataset for this exercise but if you would like to play with others then you can refer to this.

    在機器學習分類問題中,該問題可以理解為X特征(輸入變量)預測y(目標值)。 Sklearn提供了一些可用于測試的測試數據集,我們將使用IRIS數據集進行此練習,但是如果您想與其他人一起玩,則可以參考此內容 。

    Scikit-learn 0.23.1 added a feature by which we can return the test dataset directly into the X and y dataframe. Make sure you are running version 0.23.1.

    Scikit-learn 0.23.1添加了一項功能,通過該功能,我們可以將測試數據集直接返回到X和y數據框中。 確保您正在運行0.23.1版。

    from sklearn.datasets import load_irisSource: Created by Author來源:作者創建

    We will now go through the other libraries and we will use Sklearn for modeling later

    現在,我們將介紹其他庫,稍后將使用Sklearn進行建模

    熊貓分析 (Pandas Profiling)

    I am sure many of you would be aware of this library but if you are not please do give it a try. It provides a rich profiling report for the data which gives a lot of information from missing values to correlations. You need to install it using pip as conda-package downloads old version of it.

    我敢肯定,你們中的許多人都會知道這個庫,但是如果您不知道,請嘗試一下。 它為數據提供了豐富的概要分析報告,該報告提供了從缺失值到相關性的大量信息。 您需要使用pip安裝它,因為conda-package會下載舊版本。

    pip install --user pandas-profilingSource: GIF by Author資料來源:作者提供的GIF

    This report provides many details, out of which few are-

    該報告提供了許多詳細信息,其中很少是-

  • Overview of different variables in the dataset.

    數據集中不同變量的概述。
  • Correlation between variables.

    變量之間的相關性。
  • Interactions between variables.

    變量之間的相互作用。
  • Details about each variable.

    有關每個變量的詳細信息。
  • The following commands can be used to generate and save the profile report-

    以下命令可用于生成和保存配置文件報告-

    情節快遞 (Plotly Express)

    Although pandas-profiling provides a lot of useful information, we still need to visualize different information like for example we need to find how the target variable is distributed among multiple input variables. There exist many libraries for visualization, Matlplotlib and Seaborn are the famous ones you would have heard about. The main thing where Plotly stands out is the interactive plots i.e. you can interact with the generated plots. Install it using the following command

    盡管pandas分析提供了許多有用的信息,但是我們仍然需要可視化不同的信息,例如我們需要找到目標變量如何在多個輸入變量之間分配。 有許多可視化庫,Matlplotlib和Seaborn是您會聽說的著名庫。 Plotly脫穎而出的主要功能是交互式繪圖,即您可以與生成的繪圖進行交互。 使用以下命令安裝

    conda install -c conda-forge plotly

    Below we plotted a scatter plot between sepal length with petal length and used ‘color’ to show how the target variable is related.

    下面我們在萼片長度和花瓣長度之間繪制了散點圖,并使用“顏色”顯示了目標變量之間的關系。

    You can see below how we can filter out different targets.

    您可以在下面看到我們如何過濾出不同的目標。

    Source: GIF by Author來源:作者提供的GIF

    This library provides a lot of additional functionality, maybe we can cover that in a different story.

    該庫提供了許多其他功能,也許我們可以在另一個故事中進行介紹。

    培訓和測試數據集 (Training and Test dataset)

    The idea of generating models is to predict values that are not known. If we learn the model on the entire dataset then we will not be able to evaluate how it performs on the unseen data. To achieve this, we split the dataset into training and test dataset. A training dataset is used to train the model and the test set is used to evaluate the model. Sklearn provides a function ‘train_test_split’ which split the dataset into train and test datasets. The following code can be used to split the datasets.

    生成模型的想法是預測未知的值。 如果我們在整個數據集上學習該模型,那么我們將無法評估該模型在看不見的數據上的表現。 為此,我們將數據集分為訓練和測試數據集。 訓練數據集用于訓練模型,而測試集用于評估模型。 Sklearn提供了一個函數“ train_test_split”,可將數據集分為訓練和測試數據集。 以下代碼可用于拆分數據集。

    from sklearn.model_selection import train_test_splitSource: Image by Author來源:作者提供的圖片

    調整超參數 (Tuning Hyperparameters)

    One of the important tasks in machine learning is to tune hyperparameters, these parameters are the different attributes of the algorithm which control the learning process. Different values are suitable for different learning problems and it is important to find out the best parameters. Sklearn provides mainly two ways ‘GridSearchCV’ and ‘RandomizedSearchCV’ to find the best parameters. For large training sets, we might need to use RandomizedSearchCV as it will take a lot of time to learn all parameters. In the IRIS dataset, we have only 150 rows and hence we used ‘GridSearchCV’.

    機器學習的重要任務之一是調整超參數,這些參數是控制學習過程的算法的不同屬性。 不同的值適用于不同的學習問題,因此找出最佳參數很重要。 Sklearn主要提供“ GridSearchCV”和“ RandomizedSearchCV”兩種查找最佳參數的方法。 對于大型訓練集,我們可能需要使用RandomizedSearchCV,因為它將花費大量時間來學習所有參數。 在IRIS數據集中,我們只有150行,因此我們使用了“ GridSearchCV”。

    For this story, we will train the LogisticsRegression model which is well-suited for classification problems and have different hyperparameters like ‘solver’, ‘C’, ‘penalty’, and ‘l1-ratio’. Not every solver supports all parameters and hence we create different dictionaries for all different solvers.

    對于這個故事,我們將訓練物流回歸模型,該模型非常適合分類問題,并且具有不同的超參數,例如“求解器”,“ C”,“罰分”和“ l1-比率”。 并非每個求解器都支持所有參數,因此我們為所有不同的求解器創建不同的字典。

    The above code would search for different combinations of parameters and find the best one which best generalizes the problem.

    上面的代碼將搜索參數的不同組合,并找到最能概括該問題的最佳組合。

    評估模型 (Evaluating the model)

    As we mentioned, we need to evaluate the model on the test dataset, many different metrics are available. The common one is accuracy for classification problems. Here we will show the accuracy, classification_report, and confusion matrix which Sklearn provides.

    如前所述,我們需要在測試數據集上評估模型,可以使用許多不同的指標。 常見的一種是分類問題的準確性。 在這里,我們將顯示Sklearn提供的準確性,classification_report和混淆矩陣。

    The IRIS dataset is classified as an easy dataset which means data is already suitable for machine learning purposes and hence we were able to get a perfect score i.e. accuracy score of 1.0 with our model. This means our model predicted all the samples in the test dataset correctly. It will vary with the different problems you are trying to solve.

    IRIS數據集被歸類為簡單數據集,這意味著數據已經適合于機器學習目的,因此我們能夠獲得理想的評分,即模型的準確性評分為1.0。 這意味著我們的模型可以正確預測測試數據集中的所有樣本。 您要解決的不同問題會有所不同。

    結論 (Conclusion)

    The idea of this story was to give you a head start on machine learning and a glimpse of different libraries that you can utilize to speed up the process. I provided a simple overview of many things to keep this story short and precise. There is still a lot to explore such as different types of machine learning problems, different models, different metrics, and where to use them. You can try different things, in the same way, I did here so that you can see how it works. I will try to add more stories to give you deep dive into specific areas to help accelerate your learning.

    這個故事的目的是讓您搶先學習機器學習,并瞥見可以用來加速過程的各種庫。 為了使這個故事簡短而準確,我提供了許多內容的簡單概述。 還有很多需要探索的東西,例如不同類型的機器學習問題,不同的模型,不同的度量標準以及在何處使用它們。 您可以用相同的方法嘗試不同的事情,我在這里做了,所以您可以了解它的工作原理。 我將嘗試添加更多故事,以使您深入了解特定領域,以幫助您加速學習。

    Update 16th June 2020: Recently I found a way to combine Sklearn Pipeline with GridSearchCV to search for best preprocessing steps. If interested, check out this- Are you using Pipeline in Scikit-Learn?

    2020年6月16日更新 :最近,我找到了一種將Sklearn Pipeline與GridSearchCV結合以搜索最佳預處理步驟的方法。 如果有興趣,請查看此- 您是否在Scikit-Learn中使用管道?

    翻譯自: https://medium.com/swlh/start-your-data-science-journey-today-37366ee463f

    大數據技術 學習之旅

    總結

    以上是生活随笔為你收集整理的大数据技术 学习之旅_如何开始您的数据科学之旅?的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

    主站蜘蛛池模板: 国产精品乱子伦 | 神马一区二区三区 | 青青草精品| 午夜九九 | 日韩亚洲欧美在线观看 | 亚洲综合在线视频 | 波多野结衣1区2区3区 | 97超碰人人澡人人爱学生 | 日韩一卡二卡 | 亚洲大成色 | 免费在线观看黄色网址 | v片在线观看 | 欧美日韩亚洲国产另类 | 日韩一区二区三区中文字幕 | 五月婷婷激情四射 | 久久久精品一区二区涩爱 | 国产女主播在线观看 | 国产高清视频免费在线观看 | 免费看黄在线观看 | 超碰超碰 | 亚洲AV无码片久久精品 | 秋霞视频在线观看 | 免费在线观看中文字幕 | 欧美老熟妇又粗又大 | 亚洲一区精品视频在线观看 | 国产精品高潮av | 娇妻第一次尝试交换的后果 | 91久久精品日日躁夜夜躁国产 | 国产在线麻豆精品观看 | 日韩欧美专区 | 欧美乱子伦 | 久久片| 在线h片 | 九九热在线观看 | 久久久免费毛片 | 亚洲国产无线乱码在线观看 | 永久免费av网站 | 97se亚洲| 妺妺窝人体色www在线下载 | 免费网站av | 非洲黑妞xxxxhd精品 | 国产精品色在线 | 成人av免费看 | 日本视频在线 | 九七伦理电影 | 一本色道综合久久欧美日韩精品 | 亚洲av无码国产精品久久 | 在线视频观看你懂得 | 一区二区伦理片 | 黄色小视频网 | 3p视频在线| 免费无遮挡无码永久在线观看视频 | 国产刺激对白 | 黄色资源网| 国产大奶在线 | 成人亚洲视频 | 就爱av | 日本熟妇一区二区 | 久久久久亚洲AV成人无在 | 日日干干 | 69xxx国产| 中文字幕在线播放 | 色爱亚洲 | 日本免费三级网站 | 国产aaaaa毛片 | 少妇被狂c下部羞羞漫画 | 成人免费在线看片 | av在线网站观看 | 久久久影院| 国产三级精品视频 | 国产精品女人久久久 | 青青草视频免费 | 少妇喷白浆 | 啪免费视频 | 打白嫩屁屁网站视频短裙 | 天天躁日日躁狠狠躁av麻豆 | 成人h片 | 狠狠操人人干 | 狠狠躁18三区二区一区传媒剧情 | 亚洲人妻电影一区 | 91操操操| 日干夜操| 成人tiktok黄短视频 | 日韩mv欧美mv国产网站 | 国产乱叫456在线 | 国产欧美一区二区 | 成人精品电影 | 男女激情视频网站 | 亚洲精品无码久久久 | 亚洲女人天堂成人av在线 | 欧美黄色片免费看 | 国产欧美日韩精品在线观看 | 欧美毛片在线 | 精品国产第一页 | 最新中文字幕在线观看 | 午夜精品一二三区 | 亚洲视频六区 | 日本一区二区三区四区视频 | 黄色三级免费网站 |