算法竞赛训练指南代码仓库_数据仓库综合指南
算法競(jìng)賽訓(xùn)練指南代碼倉庫
重點(diǎn) (Top highlight)
As a data scientist, it’s valuable to have some idea of fundamental data warehouse concepts. Most of the work we do involves adding enterprise value on top of datasets that need to be clean and readily comprehensible. For a dataset to reach that stage of its lifecycle, it has already passed through many components of data architecture and, hopefully, many data quality filters. This is how we avoid the unfortunate situation wherein the data scientist ends up spending 80% of their time on data wrangling.
作為數(shù)據(jù)科學(xué)家,了解基本數(shù)據(jù)倉庫概念非常有價(jià)值。 我們所做的大部分工作都涉及在需要整潔且易于理解的數(shù)據(jù)集之上增加企業(yè)價(jià)值。 為了使數(shù)據(jù)集達(dá)到其生命周期的這一階段,它已經(jīng)通過了數(shù)據(jù)體系結(jié)構(gòu)的許多組件,并希望通過許多數(shù)據(jù)質(zhì)量過濾器。 這樣,我們就避免了不幸的情況,在這種情況下,數(shù)據(jù)科學(xué)家最終將80%的時(shí)間都花在了數(shù)據(jù)整理上。
Let’s take a moment to deepen our appreciation of the data architecture process by learning about various considerations relevant to setting up a data warehouse.
讓我們花一點(diǎn)時(shí)間,通過學(xué)習(xí)與建立數(shù)據(jù)倉庫有關(guān)的各種注意事項(xiàng),加深對(duì)數(shù)據(jù)體系結(jié)構(gòu)過程的認(rèn)識(shí)。
The data warehouse is a specific infrastructure element that provides down-the-line users, including data analysts and data scientists, access to data that has been shaped to conform to business rules and is stored in an easy-to-query format.
的 數(shù)據(jù)倉庫 是一個(gè)特定的基礎(chǔ)架構(gòu)元素,它為包括數(shù)據(jù)分析師和數(shù)據(jù)科學(xué)家在內(nèi)的下層用戶提供對(duì)已成形為符合業(yè)務(wù)規(guī)則并以易于查詢的格式存儲(chǔ)的數(shù)據(jù)的訪問權(quán)限。
The data warehouse typically connects information from multiple “source-of-truth” transactional databases, which may exist within individual business units. In contrast to information stored in a transactional database, the contents of a data warehouse are reformatted for speed and ease of querying.
數(shù)據(jù)倉庫通常連接來自多個(gè)“真相”交易數(shù)據(jù)庫的信息,這些數(shù)據(jù)庫可能存在于各個(gè)業(yè)務(wù)部門中。 與存儲(chǔ)在事務(wù)數(shù)據(jù)庫中的信息相反,數(shù)據(jù)倉庫的內(nèi)容經(jīng)過重新格式化,以提高查詢速度和查詢難度。
The data must conform to specific business rules that validate quality. Then it is stored in a denormalized structure — that means storing together pieces of information that will likely be queried together. This serves to increase performance by decreasing the complexity of queries required to get data out of the warehouse (i.e., by reducing the number of data joins).
數(shù)據(jù)必須符合驗(yàn)證質(zhì)量的特定業(yè)務(wù)規(guī)則。 然后,將其存儲(chǔ)在非規(guī)范化結(jié)構(gòu)中-這意味著將可能會(huì)被一起查詢的信息存儲(chǔ)在一起。 這可通過降低將數(shù)據(jù)移出倉庫所需的查詢的復(fù)雜性(即通過減少數(shù)據(jù)聯(lián)接的數(shù)量)來提高性能。
In this guide:
在本指南中:
Architecting the Data Warehouse
架構(gòu)數(shù)據(jù)倉庫
Enhancing Performance and Adjusting Size
增強(qiáng)性能并調(diào)整大小
Related Data Storage Options
相關(guān)數(shù)據(jù)存儲(chǔ)選項(xiàng)
Working with Big Data
處理大數(shù)據(jù)
Extract, Transform, Load (ETL)
提取,轉(zhuǎn)換,加載(ETL)
Getting Data out of the Warehouse
從倉庫中取出數(shù)據(jù)
Data Archiving
資料封存
Summary
摘要
架構(gòu)數(shù)據(jù)倉庫 (Architecting the Data Warehouse)
In the process of developing the dimension model for the data warehouse, the design will typically pass through three stages: (1) business model, which generalizes the data based on business requirements, (2) logical model, which sets the column types, and (3) physical model, which represents the actual design blueprint of the relational data warehouse.
在開發(fā)尺寸模型的過程中 對(duì)于數(shù)據(jù)倉庫,設(shè)計(jì)通常將經(jīng)歷三個(gè)階段:(1)業(yè)務(wù)模型,該模型根據(jù)業(yè)務(wù)需求對(duì)數(shù)據(jù)進(jìn)行概括;(2)邏輯模型,用于設(shè)置列類型;以及(3)物理模型,用于表示關(guān)系數(shù)據(jù)倉庫的實(shí)際設(shè)計(jì)藍(lán)圖。
Because the data warehouse will contain information from across all aspects of the business, stakeholders must agree in advance to the grain (i.e. level of granularity) of the data that will be stored.
由于數(shù)據(jù)倉庫將包含來自全國(guó)各地業(yè)務(wù)的各個(gè)方面的信息,利益相關(guān)者必須提前向同意糧食將被存儲(chǔ)的數(shù)據(jù)(粒度即水平)。
Reminder to validate the model across various stakeholder groups before implementation.
在實(shí)施之前,提醒您在各個(gè)利益相關(guān)者群體中驗(yàn)證模型 。
A sample star schema for a hypothetical safari tours business.假設(shè)的野生動(dòng)物園旅游業(yè)務(wù)的星型樣本示例。The underlying structure in the data warehouse is commonly referred to as the star schema — it classifies information as either a dimension or fact (i.e., measure). The fact table stores observations or events (i.e. sales, orders, stock balances, etc.) The dimension tables contain descriptive information about those facts (i.e. dates, locations, etc.)
數(shù)據(jù)倉庫中的基礎(chǔ)結(jié)構(gòu)通常稱為星型模式 -將信息分類為維或事實(shí)(即度量)。 事實(shí)表存儲(chǔ)觀察或事件(即銷售,訂單,庫存余額等)。 維度表包含有關(guān)這些事實(shí)的描述性信息(即日期,位置等)。
There are three different types of fact tables: (1) transactional for records at the standardized grain, (2) periodic for records that fall within a given time frame, (3) cumulative for records that fall within a given business process.
事實(shí)表有三種不同類型:(1)標(biāo)準(zhǔn)化記錄的事務(wù)性記錄;(2)屬于給定時(shí)間范圍的記錄是周期性的;(3)屬于給定業(yè)務(wù)流程的記錄是累積的。
In addition to the star schema, there’s also the option to arrange data into the snowflake schema. The difference here is that each dimension is normalized.
除了星型模式外,還可以選擇將數(shù)據(jù)排列到雪花模式中 。 此處的區(qū)別在于每個(gè)維度均已標(biāo)準(zhǔn)化。
Normalization is a database design technique for creating records that contain an atomic level of information.
規(guī)范化是一種數(shù)據(jù)庫設(shè)計(jì)技術(shù),用于創(chuàng)建包含原子級(jí)別信息的記錄。
However, the snowflake schema adds unnecessary complexity to the dimension model — usually the star schema will suffice.
但是,雪花模式會(huì)給維模型增加不必要的復(fù)雜性-通常星型就足夠了。
增強(qiáng)性能并調(diào)整尺寸 (Enhancing Performance and Adjusting for Size)
In addition to understanding how to structure the data, the person designing the data warehouse should also be familiar with how to improve performance.
除了了解如何構(gòu)造數(shù)據(jù)之外,設(shè)計(jì)數(shù)據(jù)倉庫的人員還應(yīng)該熟悉如何提高性能。
One performance-enhancing technique is to create a clustered index on the data in the order it is typically queried. So for example, we might choose to organize the fact table by TourDate descending, so the tours that are coming up next will be shown first in the table. Setting up a clustered index reorders the way the records are physically stored, promoting speed of retrieval. In addition to an optional, single clustered index, a table can also have multiple non-clustered indices that won’t impact how the table is physically stored, but rather create additional copies in memory.
一種性能增強(qiáng)技術(shù)是按照通常被查詢的順序在數(shù)據(jù)上創(chuàng)建聚簇索引 。 因此,例如,我們可能選擇按TourDate降序組織事實(shí)表,因此接下來要顯示的游覽將首先顯示在表中。 設(shè)置聚簇索引將對(duì)記錄的物理存儲(chǔ)方式進(jìn)行重新排序,從而提高了檢索速度。 除了可選的單個(gè)聚集索引之外,一個(gè)表還可以具有多個(gè)非聚集索引,這些索引不會(huì)影響表的物理存儲(chǔ)方式,而是會(huì)在內(nèi)存中創(chuàng)建其他副本。
Another performance enhancement involves splitting up very large tables into multiple smaller parts. This is called partitioning. By splitting a large table into smaller, individual tables, queries that need access to only a fraction of the data can run faster. Partitioning can be either vertical (splitting up columns) or horizontal (splitting up rows). Here’s a link where you can download an .rtf file containing partitioning script for SQL along with other database architecture resources like a project launch and management checklist.
另一個(gè)性能增強(qiáng)涉及將非常大的表拆分為多個(gè)較小的部分。 這稱為分區(qū) 。 通過將大表拆分為較小的單個(gè)表,只需要訪問一部分?jǐn)?shù)據(jù)的查詢可以運(yùn)行得更快。 分區(qū)可以是垂直的(拆分列)或水平的(拆分行)。 這是一個(gè)鏈接 ,您可以在其中下載.rtf文件,其中包含SQL 分區(qū)腳本以及其他數(shù)據(jù)庫體系結(jié)構(gòu)資源,例如項(xiàng)目啟動(dòng)和管理清單 。
Yes, I will snag your free resources and helpful tools. Photo by Slawek K on Unsplash是的,我會(huì)抓住您的免費(fèi)資源和有用的工具。 Slawek K在Unsplash上的照片Taking total database size into account is another a crucial component of tuning performance. Estimating the size of the resulting database when designing a data warehouse will help align performance with application requirements according to service level agreement (SLA). Moreover, it will provide insight into the budgeted demand for physical disk space or cost of cloud storage.
考慮數(shù)據(jù)庫的總大小是調(diào)優(yōu)性能的另一個(gè)關(guān)鍵組成部分。 在設(shè)計(jì)數(shù)據(jù)倉庫時(shí),估計(jì)結(jié)果數(shù)據(jù)庫的大小將有助于根據(jù)服務(wù)水平協(xié)議(SLA)使性能與應(yīng)用程序要求保持一致。 此外,它將提供對(duì)物理磁盤空間或云存儲(chǔ)成本的預(yù)算需求的洞察力。
To conduct this calculation, simply aggregate the size of each table, which depends largely on the indexes. If database size is significantly larger than expected, you may need to normalize aspects of the database. Conversely, if your database ends up smaller, you can get away with more denormalization, which will increase query performance.
要進(jìn)行此計(jì)算,只需匯總每個(gè)表的大小,這在很大程度上取決于索引 。 如果數(shù)據(jù)庫大小明顯大于預(yù)期,則可能需要規(guī)范化數(shù)據(jù)庫的各個(gè)方面。 相反,如果數(shù)據(jù)庫最終變小,則可以避免更多的非規(guī)范化,這將提高查詢性能。
相關(guān)數(shù)據(jù)存儲(chǔ)選項(xiàng) (Related Data Storage Options)
The data in a data warehouse can be reorganized into smaller databases to suit the needs of the organization. For example, a business unit might create a data mart, with information specific to their department. This read-only info source provides clarity and accessibility for business users who might be a little further from the technical details of data architecture. Here’s a planning strategy to deploy when creating a data mart.
可以將數(shù)據(jù)倉庫中的數(shù)據(jù)重組為較小的數(shù)據(jù)庫,以滿足組織的需求。 例如,一個(gè)業(yè)務(wù)部門可能創(chuàng)建一個(gè)數(shù)據(jù)集市 ,其中包含其部門特定的信息。 該只讀信息源為業(yè)務(wù)用戶提供了清晰性和可訪問性,他們可能與數(shù)據(jù)體系結(jié)構(gòu)的技術(shù)細(xì)節(jié)有些距離。 這是創(chuàng)建數(shù)據(jù)集市時(shí)要部署的計(jì)劃策略 。
Similarly, an operational data store (ODS) can be set up for operational reporting. The Master Data Management (MDM) system stores information about unique business assets (i.e., customers, suppliers, employees, products, etc.)
同樣,可以為運(yùn)營(yíng)報(bào)告設(shè)置運(yùn)營(yíng)數(shù)據(jù)存儲(chǔ)(ODS) 。 主數(shù)據(jù)管理(MDM)系統(tǒng)存儲(chǔ)有關(guān)獨(dú)特業(yè)務(wù)資產(chǎn)(即客戶,供應(yīng)商,員工,產(chǎn)品等)的信息。
Read about the risks of overutilizing data visualization tools for business intelligence.
了解過度使用數(shù)據(jù)可視化工具進(jìn)行商業(yè)智能的風(fēng)險(xiǎn) 。
處理大數(shù)據(jù) (Working with Big Data)
To handle big data, a data architect might chose to implement a tool such as Apache Hadoop. Hadoop was based on the MapReduce technique developed by Google to index the world wide web and was released to the public in 2006. In contrast to the highly structured environment of the data warehouse, where information has already been validated upstream to conform to business rules, Hadoop is a software library that accepts a variety of data types and allows for distributed processing across clusters of computers. Hadoop is often used to process streaming data.
為了處理大數(shù)據(jù),數(shù)據(jù)架構(gòu)師可能選擇實(shí)現(xiàn)諸如Apache Hadoop之類的工具。 Hadoop基于Google開發(fā)的MapReduce技術(shù)來索引萬維網(wǎng),并于2006年向公眾發(fā)布。與高度結(jié)構(gòu)化的數(shù)據(jù)倉庫環(huán)境相反,在數(shù)據(jù)倉庫中,信息已經(jīng)在上游進(jìn)行了驗(yàn)證,可以符合業(yè)務(wù)規(guī)則, Hadoop是一個(gè)軟件庫,它接受各種數(shù)據(jù)類型,并允許跨計(jì)算機(jī)集群進(jìn)行分布式處理。 Hadoop通常用于處理流數(shù)據(jù)。
GIPHY.GIPHY 。While Hadoop is able to quickly process streaming data, it struggles with query speed, complexity of queries, security, and orchestration. In recent years, Hadoop has been falling out of favor as cloud-based solutions (e.g., Amazon Kinesis) have risen to prominence — offering the same gains in terms of speed for processing unstructured data while integrating with other tools in the cloud ecosystem that address these potential weaknesses.
盡管Hadoop能夠快速處理流數(shù)據(jù),但它在查詢速度,查詢復(fù)雜性,安全性和編排方面遇到了困難。 近年來,隨著基于云的解決方案(例如Amazon Kinesis )的興起,Hadoop不再受到青睞-在處理非結(jié)構(gòu)化數(shù)據(jù)的速度方面與在解決方案中與云生態(tài)系統(tǒng)中其他解決方案集成在一起的速度方面,收益相同這些潛在的弱點(diǎn)。
Read more about how to approach the implementation of “new” database technologies.
閱讀有關(guān)如何實(shí)施“新”數(shù)據(jù)庫技術(shù)的更多信息 。
提取,轉(zhuǎn)換,加載(ETL) (Extract, Transform, Load (ETL))
Extraction, transformation, and load define the process of moving the data out of its original location (E), doing some form of transformation (T), then loading it (L) into the data warehouse. Rather than approach the ETL pipeline in an ad hoc, piecemeal fashion, database architect should look to implement a systematic approach that takes into account best practices around design considerations, operational issues, failure points, and recovery methods. See also this helpful resource for setting up an ETL pipeline.
提取 , 轉(zhuǎn)換和加載定義了以下過程:將數(shù)據(jù)移出其原始位置(E),進(jìn)行某種形式的轉(zhuǎn)換(T),然后將其加載(L)到數(shù)據(jù)倉庫中。 數(shù)據(jù)庫架構(gòu)師應(yīng)該采取一種系統(tǒng)的方法 ,該方法考慮設(shè)計(jì)方面的考慮,操作問題,故障點(diǎn)和恢復(fù)方法方面的最佳做法,而不是臨時(shí)地,零散地處理ETL管道。 另請(qǐng)參閱此有用的資源來建立ETL管道 。
Documentation for ETL includes creating source-to-target mapping: the set of transformation instructions on how to convert the structure and content of data in the source system to the structure and content of the target system. Here’s a sample template for this step.
ETL的文檔包括創(chuàng)建源到目標(biāo)的映射:一組有關(guān)如何將源系統(tǒng)中數(shù)據(jù)的結(jié)構(gòu)和內(nèi)容轉(zhuǎn)換為目標(biāo)系統(tǒng)的結(jié)構(gòu)和內(nèi)容的轉(zhuǎn)換說明。 這是此步驟的示例模板 。
Your organization might also consider ELT — loading the data without any transformations, then using the power of the destination system (usually a cloud-based tool) to conduct the transform step.
您的組織還可能考慮使用ELT-在不進(jìn)行任何轉(zhuǎn)換的情況下加載數(shù)據(jù),然后使用目標(biāo)系統(tǒng)(通常是基于云的工具)的強(qiáng)大功能來執(zhí)行轉(zhuǎn)換步驟。
將數(shù)據(jù)移出倉庫 (Getting Data Out of the Warehouse)
Once the data warehouse is set up, users should be able to easily query data out of the system. A little education might be required to optimize queries, focusing on:
一旦建立了數(shù)據(jù)倉庫,用戶就應(yīng)該能夠輕松地從系統(tǒng)中查詢數(shù)據(jù)。 可能需要一些教育以優(yōu)化查詢,重點(diǎn)在于:
Tuning a complex query
調(diào)優(yōu)復(fù)雜的查詢
Using an execution plan
使用執(zhí)行計(jì)劃
Understanding join mechanisms
了解聯(lián)接機(jī)制
Understand memory / disk / IO usage considerations
了解內(nèi)存/磁盤/ IO使用注意事項(xiàng)
Using parallelism
使用并行
Writing hierarchical queries
編寫層次查詢
資料封存 (Data Archiving)
Pixabay on Pexels上的Pexels.Pixabay攝 。Finally, let’s talk about optimizing your organization’s data archiving strategy. Archived data remains important to the organization and is of particular interest to data scientists looking to conduct regression using historical trends.
最后,讓我們談?wù)剝?yōu)化組織的數(shù)據(jù)歸檔 戰(zhàn)略。 歸檔數(shù)據(jù)對(duì)組織仍然很重要,并且對(duì)于希望利用歷史趨勢(shì)進(jìn)行回歸的數(shù)據(jù)科學(xué)家特別感興趣。
The data architect should plan for this demand by relocating historical data that is no longer actively used into a separate storage system with higher latency but also robust search capabilities. Moving the data to a less costly storage tier is an obvious benefit of this process. The organization can also gain from removing write access from the archived data, protecting it from modification.
數(shù)據(jù)架構(gòu)師應(yīng)通過將不再有效使用的歷史數(shù)據(jù)重新定位到具有更高延遲但還具有強(qiáng)大搜索功能的單獨(dú)存儲(chǔ)系統(tǒng)中,來規(guī)劃此需求。 將數(shù)據(jù)移動(dòng)到成本較低的存儲(chǔ)層是此過程的明顯好處。 該組織還可以從刪除存檔數(shù)據(jù)的寫訪問權(quán)限中受益,從而保護(hù)其免受修改。
摘要 (Summary)
This article covers tried and true practices for setting up a data warehouse. Let me know how you’re using this information in your work by dropping a comment.
本文介紹了建立數(shù)據(jù)倉庫的可靠實(shí)踐。 通過添加評(píng)論,讓我知道您在工作中如何使用此信息。
Pixabay on Pexels上的Pexels.Pixabay攝If you found this article helpful, follow me on Medium, LinkedIn, and Twitter for more ideas to advance your data science skills.
如果您認(rèn)為本文很有幫助 ,請(qǐng)?jiān)贛edium , LinkedIn和Twitter上關(guān)注我,以獲取更多提高您的數(shù)據(jù)科學(xué)技能的想法。
翻譯自: https://towardsdatascience.com/data-warehouse-68ec63eecf78
算法競(jìng)賽訓(xùn)練指南代碼倉庫
總結(jié)
以上是生活随笔為你收集整理的算法竞赛训练指南代码仓库_数据仓库综合指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 太阳能全智能测控仪怎么加热(太阳系中心的
- 下一篇: 蝙蝠侠遥控器pcb_通过蝙蝠侠从Circ