Storage Systems
- 參考: C o m p u t e r A r i c h i t e c t u r e ( 6 th ? E d i t i o n ) Computer\ Arichitecture\ (6\th\ Edition) Computer?Arichitecture?(6th?Edition)
目錄
- Bus
- Disk Storage
- Use Arrays of Small Disks?
- RAID
- RAID 0: Striping
- RAID 1: Disk Mirroring/Shadowing
- RAID 2: 位交叉式海明編碼陣列
- RAID 3: Bit-interleaved Parity Disk
- RAID 4: Block-interleaved Parity Disk
- RAID 5: Block-interleaved Distributed Parity
- RAID 6: 雙維奇偶校驗(yàn)獨(dú)立存取盤陣列
- RAID 的實(shí)現(xiàn)
- Storage Environment
- Direct Attached Storage (DAS)
- Network Attached Storage (NAS)
- Storage Area Network (SAN)
- Memory (存儲(chǔ)系統(tǒng)): 內(nèi)存
- Storage Systems (存貯系統(tǒng)): 外存 (持久性、非易失性)
Bus
- I/O buses tap into the processor-memory bus via bus adaptors: 適配器用于速度匹配(做緩存)、做接口
Main components of Intel Chipset: Pentium 4
- Northbridge (接高速設(shè)備的適配器): Handles memory, Graphics
- Southbridge (接低速設(shè)備的適配器): I/O, PCI bus, Disk controllers, USB controllers, Audio, Serial I/O, Interrupt controller, Timers
IMC(Integrated Memory Controller)
- 可以看到,CPU 集成度越來越高: Memory Controller 被集成到了 CPU 內(nèi)部,北橋消失了。同時(shí) L1 和 L2 Cache 被集成到了每個(gè) Core 里,L3 Cache 被四個(gè)核共享,也被集成到了 CPU 里
- QPI (Quick Path Interconnect)——“快速通道互聯(lián)”,支持多條系統(tǒng)總線連接,取代前端總線 (FSB)
下一步把 Memory 也集成進(jìn) CPU…
The move from Parallel to Serial I/O
- Parallel I/O (ISA bus, PCI, SCSI, IDE)
- Parallel bus clock rate limited by clock skew across long bus (~100MHz)
- High power to drive large number of loaded bus lines
- Central bus arbiter (總線仲裁器) adds latency to each transaction, sharing limits throughput
- Expensive parallel connectors and backplanes/cables (all devices pay costs)
- Dedicated Point-to-point Serial Links (Ethernet, Infiniband, PCI Express, SATA, USB, Firewire)
- Point-to-point links run at multi-gigabit speed using advanced clock/signal encoding (requires lots of circuitry at each end)
- Lower power since only one well-behaved load
- Multiple simultaneous transfers
- Cheap cables and connectors (trade greater endpoint transistor cost for lower physical wiring cost), customize bandwidth per device using multiple links in parallel
- Examples: 硬盤接口: IDE (并行) → \rightarrow → SATA (串行)
Disk Storage
- Storage emphasizes reliability and scalability (可擴(kuò)展性) as well as cost-performance (性價(jià)比)
- What is “Software king” that determines which HW features actually used?
- Compiler for processor
- Operating System for storage
Flash: The future of disks? (固態(tài)硬盤)
- Flash drive advantages: Lower power (no moving parts), Much faster seek time, 100X IOs per second (no moving parts), Greater reliability (no moving parts), Lower noise (no moving parts) (數(shù)據(jù)不移動(dòng)時(shí)表現(xiàn)好)
- Flash disadvantages: Cost (20-100x disk cost/GB), Slow writes with current design (competitive with disks), write endurance (耐久度不行,某一個(gè)位置寫的次數(shù)多就壞了) - not an issue for most applications since use write-leveling to spread wear around blocks on chip (通過軟件來處理該問題)
Disk Figure of Metric: Areal Density
- Bits recorded along a track; Metric is Bits Per Inch (BPI)
- Number of tracks per surface; Metric is Tracks Per Inch (TPI)
- bit density per unit area; Metric is Bits Per Square Inch: Areal Density = BPI × TPI = \textrm{BPI} \times \textrm{TPI} =BPI×TPI
Disk Drive Performance
- Disk Service Time: Time taken by a disk to complete an I/O request is sum of
- Seek Time (尋道時(shí)間), Rotational Latency, Data Transfer Rate(MB/s)
Utilization vs. Response time
利用率和響應(yīng)時(shí)間
- 利用率 (I/O 請(qǐng)求頻率) 越高,響應(yīng)時(shí)間越長
反映存儲(chǔ)外設(shè)可靠性能的參數(shù)
- Reliability 系統(tǒng)可靠性: 系統(tǒng)從初始狀態(tài)開始一直提供服務(wù)的能力
- 用平均無故障時(shí)間 MTTF (Mean Time to Failure) 來衡量
- Availability 系統(tǒng)可用性: 系統(tǒng)正常工作時(shí)間在連續(xù)兩次正常服務(wù)間隔時(shí)間中所占的比率
- 用 MTTF MTTF + MTTR \frac{\textrm{MTTF}}{\textrm{MTTF} +\textrm{MTTR}} MTTF+MTTRMTTF? (Mean Time To Repair, 平均修復(fù)時(shí)間)來衡量 (修復(fù) → \rightarrow → 數(shù)據(jù)恢復(fù))
- MTTF + MTTR = MTBF(Mean Time Between Failure, 平均故障間隔時(shí)間)
- Dependability 系統(tǒng)可信性: 多大程度上可以合理地認(rèn)為服務(wù)是可靠的
- 可信性不可度量
Use Arrays of Small Disks?
Replace Small Number of Large Disks with Large Number of Small Disks!
- Disk Arrays have potential for large data and I/O rates, high MB per cu. ft., high MB per KW, but what about reliability?
Array Reliability
- Reliability of N N N disks = Reliability of 1 Disk ÷ N ÷ \ N ÷?N
- Arrays (without redundancy) too unreliable to be useful!
RAID
Redundant Arrays of (Inexpensive) Disks; 廉價(jià)磁盤冗余陣列
- Files are “striped” across multiple disks (將數(shù)據(jù)以條帶化的形式存儲(chǔ)在很多磁盤上)
- Redundancy yields high data availability 可用性 (Disks will still fail)
- Availability: service still provided to user, even if some components failed
- Contents reconstructed from data redundantly stored in the array
- Capacity penalty to store redundant info
- Bandwidth penalty to update redundant info
RAID 0: Striping
數(shù)據(jù)條帶化
- RAID 0: 非冗余磁盤陣列,無冗余信息;
- 將數(shù)據(jù)分成條帶 (stripe),以條帶為單位交叉地分布存放到各個(gè)磁盤中,形成一個(gè)容量更大,能并行工作的磁盤 (圖中 Stripe0, Stripe1… 為按順序排列的條帶,其大小稱為條帶寬度)
- 所有磁盤可以并行讀,因此性能很高;但不提供數(shù)據(jù)冗余,只要其中任一磁盤故障,整個(gè)系統(tǒng)都無法正常工作
- 適用于需要高帶寬磁盤訪問的場(chǎng)合
RAID 1: Disk Mirroring/Shadowing
- Each disk is fully duplicated onto its “mirror”: Very high availability can be achieved
- Bandwidth sacrifice on write: Logical write = two physical writes (并行寫入磁盤及其鏡像盤,且不需要計(jì)算校驗(yàn)信息,因此寫入速度比級(jí)別更高的 RAID 都快)
- Reads may be optimized: 從 RAID 1 讀取數(shù)據(jù)時(shí),磁盤及其鏡像盤可獨(dú)立地同時(shí)工作,由最先讀出數(shù)據(jù)的磁盤提供數(shù)據(jù)
- Most expensive solution: 100% capacity overhead
RAID 2: 位交叉式海明編碼陣列
- 每個(gè)數(shù)據(jù)盤存放數(shù)據(jù)字的一位,按位交叉存放,即 Disk0 存放所有數(shù)據(jù)字的第 0 位,Disk1 存放第 1 位… 各個(gè)數(shù)據(jù)盤上的相應(yīng)位計(jì)算海明 Hamming 校驗(yàn)碼,編碼位被存放在多個(gè)校驗(yàn)(Ecc)磁盤的對(duì)應(yīng)位上
- 從數(shù)據(jù)盤讀數(shù)據(jù)時(shí),也要讀出 Hamming 碼,用于判斷數(shù)據(jù)是否有錯(cuò)并加以糾正 (Hamming 碼可以糾正 1 位錯(cuò)誤、檢測(cè)兩位錯(cuò)誤)
- 需要多個(gè)磁盤來存放海明校驗(yàn)碼信息,冗余磁盤數(shù)量與數(shù)據(jù)磁盤數(shù)量的對(duì)數(shù)成正比( log ? 2 m \log_2m log2?m, m m m 為數(shù)據(jù)盤的個(gè)數(shù))
RAID 3: Bit-interleaved Parity Disk
位交叉奇偶校驗(yàn)盤陣列
- 當(dāng)某個(gè)磁盤發(fā)生故障時(shí),磁盤控制器本身就能發(fā)現(xiàn)哪個(gè)磁盤出錯(cuò),因此不需要采用復(fù)雜的 Hamming 碼,使用奇偶校驗(yàn)即可
- Logically, a single high capacity, high transfer rate disk: good for large transfers 單盤容錯(cuò)并行傳輸 (細(xì)粒度磁盤陣列,即條帶寬度較小 (1 個(gè)字節(jié)或 1 位)。因此對(duì)于絕大多數(shù) I/O 請(qǐng)求都需要磁盤陣列中所有磁盤為之服務(wù),因此能獲得很高的數(shù)據(jù)傳輸率)
- 1 / N 1/N 1/N capacity cost for parity if N N N data disks and 1 1 1 parity disk
- Wider arrays reduce capacity costs, but decreases reliability/availability
RAID3 讀寫特點(diǎn)
- 假定:有 4 個(gè)數(shù)據(jù)盤和一個(gè)冗余盤
- 讀出數(shù)據(jù),一共需要 5 次磁盤讀操作 (同時(shí)讀 4 個(gè)數(shù)據(jù)盤和一個(gè)冗余盤)
- 寫數(shù)據(jù)需要 3 次磁盤讀和 2 次磁盤寫操作
RAID 4: Block-interleaved Parity Disk
塊交叉奇偶校驗(yàn)磁盤陣列
Inspiration for RAID 4
- 在 RAID 3 中,一次磁盤訪問將對(duì)磁盤陣列中的所有磁盤進(jìn)行操作。RAID 4 希望使用較少的磁盤參與操作,以使磁盤陣列可以并行進(jìn)行多個(gè)數(shù)據(jù)的磁盤操作
- RAID 4 數(shù)據(jù)以塊交叉的方式存于各盤, 奇偶校驗(yàn)信息存在一臺(tái)專用盤上 (parity disk),冗余代價(jià)與 RAID 3 相同 (采用粗粒度的磁盤陣列,即采用比較大的條帶(塊)為單位進(jìn)行交叉存放和計(jì)算奇偶校驗(yàn));訪問數(shù)據(jù)的方法與 RAID 3 不同
- Small read: every block has an error detection field——每個(gè)磁盤獨(dú)立的進(jìn)行讀操作;Allows independent reads to different disks simultaneously (只有磁盤出現(xiàn)故障時(shí),才會(huì)讀校驗(yàn)盤,進(jìn)行數(shù)據(jù)重建)
- To catch errors on read, rely on error detection field vs. the parity disk
- Large write: 寫入操作時(shí),由于要重新計(jì)算校驗(yàn)碼,因此幾乎要訪問所有磁盤
- Small read: every block has an error detection field——每個(gè)磁盤獨(dú)立的進(jìn)行讀操作;Allows independent reads to different disks simultaneously (只有磁盤出現(xiàn)故障時(shí),才會(huì)讀校驗(yàn)盤,進(jìn)行數(shù)據(jù)重建)
RAID 5: Block-interleaved Distributed Parity
Inspiration for RAID 5
- Small writes (write to one disk): since P has old sum, compare old data to new data, add the difference to P
Small Write Algorithm
- 1 Logical Write = 2 Physical Reads + 2 Physical Writes
Problems of Disk Arrays: Small Writes
- Small writes are limited by Parity Disk:
- Write to D 0 D_0 D0?, D 5 D_5 D5? both also write to P disk (因此還是不能同時(shí)寫 D 0 D_0 D0? 和 D 5 D_5 D5?)
- Write to D 0 D_0 D0?, D 5 D_5 D5? both also write to P disk (因此還是不能同時(shí)寫 D 0 D_0 D0? 和 D 5 D_5 D5?)
RAID 5: High I/O Rate Interleaved Parity
塊交叉分布式奇偶校驗(yàn)盤陣列
- 為了解決上面的問題,把校驗(yàn)信息分布到磁盤陣列中的各個(gè)磁盤上,無專用冗余盤,每一行數(shù)據(jù)塊的校驗(yàn)塊被依次錯(cuò)開、循環(huán)地存放到不同盤中,使奇偶校驗(yàn)信息均勻分布在所有磁盤上
- Independent writes possible because of interleaved parity
- Independent writes possible because of interleaved parity
RAID 6: 雙維奇偶校驗(yàn)獨(dú)立存取盤陣列
Inspiration:
- Recovering from 2 failures
RAID6 特點(diǎn)
- 雙維奇偶校驗(yàn)獨(dú)立存取盤陣列: 在 RAID5 的基礎(chǔ)上增加了一個(gè)獨(dú)立的校驗(yàn)信息,放在另一個(gè)校驗(yàn)盤中,寫入數(shù)據(jù)要訪問 1 個(gè)數(shù)據(jù)盤和 2 個(gè)冗余盤,可容忍雙盤出錯(cuò)
- 數(shù)據(jù)以塊交叉方式存于各盤,檢、糾錯(cuò)信息均勻分布在所有磁盤上
RAID 的實(shí)現(xiàn)
- 軟件方式:陣列管理軟件由主機(jī)來實(shí)現(xiàn)
- 優(yōu)點(diǎn):成本低;
- 缺點(diǎn):過多地占用主機(jī)時(shí)間,帶寬指標(biāo)上不去
- 陣列卡方式:把 RAID 管理軟件固化在 I/O 控制卡上,從而可不占用主機(jī)時(shí)間,一般用于工作站和 PC 機(jī)
- 子系統(tǒng)方式:這是一種基于通用接口總線的開放式平臺(tái),可用于各種主機(jī)平臺(tái)和網(wǎng)絡(luò)系統(tǒng)
Storage Environment
Direct Attached Storage (DAS)
直連
- Servers connect directly to the disk array typically via a SCSI interface.
Network Attached Storage (NAS)
網(wǎng)絡(luò)附加存儲(chǔ)——網(wǎng)絡(luò)上的文件系統(tǒng)
- Server 用來提供服務(wù),有另外一套專門的體系負(fù)責(zé)存儲(chǔ)
- NAS Devices access the disks in an array via direct connection or through external connectivity
Storage Area Network (SAN)
存儲(chǔ)區(qū)域網(wǎng)絡(luò)——網(wǎng)絡(luò)上的磁盤
- Servers access the disk array through a dedicated network designated as SAN (consists of Fibre Channel switches) (專門構(gòu)建一個(gè)網(wǎng)絡(luò)進(jìn)行存儲(chǔ)介質(zhì)和服務(wù)器之間的交互)
總結(jié)
以上是生活随笔為你收集整理的Storage Systems的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 工作那些事(十三)再次失业
- 下一篇: 红米note2 刷机 注意问题: