當(dāng)前位置：首頁 >

Databricks 第5篇：Databricks文件系统（DBFS）

發(fā)布時間：2023/12/14 43 豆豆

生活随笔收集整理的這篇文章主要介紹了 Databricks 第5篇：Databricks文件系统（DBFS）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Databricks 文件系統(tǒng) (DBFS，Databricks File System) 是一個裝載到 Azure Databricks 工作區(qū)的分布式文件系統(tǒng)，可以在 Azure Databricks 群集上使用。一個存儲對象是一個具有特定格式的文件，不同的格式具有不同的讀取和寫入的機(jī)制。

DBFS 是基于可縮放對象存儲的抽象，可以根據(jù)用戶的需要動態(tài)增加和較少存儲空間的使用量，Azure Databricks中裝載的DBFS具有以下優(yōu)勢：

裝載(mount)存儲對象，無需憑據(jù)即可無縫訪問數(shù)據(jù)。
使用目錄和文件語義(而不是存儲 URL)與對象存儲進(jìn)行交互。
將文件保存到對象存儲，因此在終止群集后不會丟失數(shù)據(jù)。

Azure Databricks是一個分布式的計算系統(tǒng)，Cluster提供算力資源，包括CPU、內(nèi)存和網(wǎng)絡(luò)等資源，而DBFS提供數(shù)據(jù)和文件的存儲空間、對文件的讀寫能力，它是Azure Databricks中一個非常重要基礎(chǔ)設(shè)施。

一，DBFS根

DBFS 中默認(rèn)的存儲位置稱為?DBFS 根(root)，以下 DBFS 根位置中存儲了幾種類型的數(shù)據(jù)：

/FileStore：導(dǎo)入的數(shù)據(jù)文件、生成的繪圖以及上傳的庫
/databricks-datasets：示例公共數(shù)據(jù)集，用于學(xué)習(xí)Spark或者測試算法。
/databricks-results：通過下載查詢的完整結(jié)果生成的文件。
/tmp：存儲臨時數(shù)據(jù)的目錄
/user：存儲各個用戶的文件
/mnt：(默認(rèn)是不可見的)裝載(掛載)到DBFS的文件，寫入裝載點(diǎn)路徑(/mnt)中的數(shù)據(jù)存儲在DBFS根目錄之外。

在新的工作區(qū)中，DBFS 根具有以下默認(rèn)文件夾：

DBFS 根還包含不可見且無法直接訪問的數(shù)據(jù)，包括裝入點(diǎn)元數(shù)據(jù)(mount point metadata)和憑據(jù)(credentials?)以及某些類型的日志。

DBFS還有兩個特殊根位置是：FileStore 和 Azure Databricks Dataset。

FileStore是一個用于存儲文件的存儲空間，可以存儲的文件有多種格式，主要包括csv、parquet、orc和delta等格式。
Dataset是一個示例數(shù)據(jù)集，用戶可以通過該示例數(shù)據(jù)集來測試算法和Spark。

訪問DBFS，通常是通過pysaprk.sql 模塊、dbutils和SQL。

二，使用pyspark.sql模塊訪問DBFS

使用pyspark.sql模塊時，通過相對路徑"/temp/file"?引用parquet文件，以下示例將parquet文件foo寫入 DBFS?/tmp?目錄。

#df.write.format("parquet").save("/tmp/foo",mode="overwrite") df.write.parquet("/tmp/foo",mode="overwrite")

并通過Spark API讀取文件中的內(nèi)容：

#df = spark.read.format("parquet").load("/tmp/foo") df = spark.read.parquet("/tmp/foo")

三，使用SQL 訪問DBFS

對于delta格式和parquet格式的文件，可以在SQL中通過 delta.`file_path`? 或 parquet.`file_path`來訪問DBFS：

select * from delta.`/tmp/delta_file`select * from parquet.`/tmp/parquet_file`

注意，文件的格式必須跟擴(kuò)展的命令相同，否則報錯；文件的路徑不是通過單引號括起來的，而是通過 `` 來實(shí)現(xiàn)的。

四，使用dbutils訪問DBFS

dbutils.fs?提供與文件系統(tǒng)類似的命令來訪問 DBFS 中的文件。?本部分提供幾個示例，說明如何使用?dbutils.fs?命令在 DBFS 中寫入和讀取文件。

1，查看DBFS的目錄

在python環(huán)境中，可以通過dbutils.fs來查看路徑下的文件：

display(dbutils.fs.ls("dbfs:/foobar"))

2，讀寫數(shù)據(jù)

在 DBFS 根中寫入和讀取文件，就像它是本地文件系統(tǒng)一樣。

# create folder dbutils.fs.mkdirs("/foobar/")# write data dbutils.fs.put("/foobar/baz.txt", "Hello, World!")# view head dbutils.fs.head("/foobar/baz.txt")# remove file dbutils.fs.rm("/foobar/baz.txt")# copy file dbutils.fs.cp("/foobar/a.txt","/foobar/b.txt")

3，命令的幫助文檔

dbutils.fs.help()

dbutils.fs 主要包括兩跟模塊：操作文件的fsutils和裝載文件的mount

fsutils

cp(from: String, to: String, recurse: boolean = false): boolean?-> Copies a file or directory, possibly across FileSystemshead(file: String, maxBytes: int = 65536): String?-> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8ls(dir: String): Seq?-> Lists the contents of a directorymkdirs(dir: String): boolean?-> Creates the given directory if it does not exist, also creating any necessary parent directoriesmv(from: String, to: String, recurse: boolean = false): boolean?-> Moves a file or directory, possibly across FileSystemsput(file: String, contents: String, overwrite: boolean = false): boolean?-> Writes the given String out to a file, encoded in UTF-8rm(dir: String, recurse: boolean = false): boolean?-> Removes a file or directory

mount

mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean?-> Mounts the given source directory into DBFS at the given mount pointmounts: Seq?-> Displays information about what is mounted within DBFSrefreshMounts: boolean?-> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent informationunmount(mountPoint: String): boolean?-> Deletes a DBFS mount point

參考文檔：

Databricks 文件系統(tǒng) (DBFS)

總結(jié)

以上是生活随笔為你收集整理的Databricks 第5篇：Databricks文件系统（DBFS）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python双下划线什么意思_pytho
下一篇：栅格布局一般怎么用_Bootstrap每