當前位置：首頁 >

tensorflow打印模型结构_社区分享 | 详解 TensorFlow 中 Placement 的最后一道防线 — Placer 算法...

發布時間：2024/1/23 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 tensorflow打印模型结构_社区分享 | 详解 TensorFlow 中 Placement 的最后一道防线 — Placer 算法... 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本文作者王思宇，阿里巴巴算法專家，從事深度學習算法平臺建設，TensorFlow 分布式架構設計與大規模分布式性能優化工作，開源 TensorFlow 項目 contributor。

本文轉自：互聯網西門二少 (id: ximen_yushao)

注：建議在閱讀本文時同時梳理代碼~

1. 問題引入

在使用 TensorFlow 構建模型時，為了能夠使用 GPU 的 Device，你可能會用到下面的這樣的寫法。

with tf.device('/gpu:0'):
a = tf.get_variable(.....)
b = .......
c = .......

那么，上面代碼中的 a、b 和 c 就真的一定會放在 GPU:0 上嗎？如果 c 不存在 GPU 上的實現會怎么樣？進一步地，有沒有其他約束會讓用戶的設置失效？

事實上，當你打開 session config 的 log_device_placement 選項后，仔細逐個檢查每個 Op 被放置的位置，你會發現某些 Op 并沒有如你所愿被你控制，而是被“悄悄地”放到別的 Device 上了。

這并不是 Bug，而是 Placer 算法模塊發揮了保護作用。Placer 算法是 TensorFlow 中 Placement 設置的最后一道防線。它工作在 TensorFlow 底層，在盡可能滿足用戶訴求的前提下，暗中糾正部分不合理的 Placement。

且聽我從設計初衷與源碼上，為你娓娓道來。

2. Placement 設計初衷

受限于單個 Device 的計算能力和存儲大小，模型分片是重要的需求點之一。它的本質是將模型和相關的計算切分到不同的 Device，如此不但可以解決單個 Device 放不下大模型的問題，還有可能帶來計算加速的收益。

在深度學習框架方面，顯然在 TensorFlow 上做模型分片比 Caffe 更加容易，這主要得益于 TensorFlow 的 Placement 機制。Placement 是 TensorFlow 引入的特有概念，它指定某個 Op 與具體 Device 的綁定關系，因此模型分片問題實際上就是該模型上每個 Op 的 Placement 問題。

在 Python 層面，一共存在兩個 API 與 Placement 相關的接口，它們不但廣泛存在于框架代碼中，還可以被用戶拿來直接使用。

但是用戶指定 Placement 信息存在一定的不可靠性，它與 Op 的實際情況往往存在一定的矛盾，這就需要 TensorFlow 中的 Placer 模塊來解決。

3. Placer 功能描述

Python 構完圖之后，請你把 GraphDef 打印出來，我們要關注每一個 Node 的 NodeDef 結構(如下圖)，這里有兩個地方和 Placement 相關。

device 屬性：它顯示指定了這個 Node 應該被放在何種 Device 上，它由用戶通過 with tf.device 指定。
字符串標記 loc:@xxxx：這是 Placement 的約束條件，隱式指明該 Node 的 Placement 應該和哪些 Node 保持一致。xxxx 代表某個 Group 的名字，該 Node 應該和 Group 名為 xxxx 內的所有 Node 的 Placement 保持一致。

可以想象，以上兩個信息可能會出現矛盾的情形。

Placer 不但要處理二者的矛盾，還要通過一些規則盡可能避免因 Placement 不當帶來的性能問題。每個 Node 在經過 Placer 處理后都會得到最終的 Placement 信息，它將重新覆蓋 NodeDef 中的 device 屬性內容。

所以，通俗地講，Placer 的功能就是推斷并填入所有 NodeDef 的 device 屬性。

4. 一些前驅內容

梳理邏輯時難免會碰到一些為解決這個問題專門設立的名詞和經典的算法，所以建議在閱讀 Placer 模塊相關內容之前先確認已經弄清楚下面的東西，避免走一些彎路。

顯式 Placement：指用戶通過 with tf.device 直接指定的 Placement 信息，它將寫入上一小節中 NodeDef 中的 device 屬性。
隱式 Placement：指間接指定的 Placement 信息，這個信息與上一小節中 NodeDef 中的 loc:@xxxx 對應。上一節說到，xxxx 是一個 Group 的名字，該 Group 內所有的 Node 都要求具有相同的 Placement 信息，這個 Group 被叫做 Colocation Group，屬于一種約束 (Constraint) 條件。
Find-Union 算法：并查集算法，Placer 內最重要的算法。TensorFlow 通過 Find-Union 算法高效地處理了 Node 的 Colocation 問題。簡單而言，邏輯上，多個具有相同 Colocation Group 的 Node 應該被“并”到同一個組中，從而“查”某個 Node 的 Placement 信息時，可以更快速地獲取整組的信息。在實現時，如何設計更好的數據結構，并高效地實施“并”和“查”兩個過程，是并查集算法的核心。

5. Placer決策基本原則

Placer 會根據會對 Graph 進行一定程度的分析，并結合用戶的要求對每個 Node 的 Placement 進行微調，微調的原則可以概括為下面四點：

盡可能滿足用戶要求 (User Requirement First)：每個 Node 的 Placement 會盡量滿足用戶的要求
盡可能使用計算更快的設備 (High Performance Device)：若某個 Node 的 Placement 沒有被用戶指定，則優先分配計算更快的設備
保證程序可運行 (Runable)：若某個 Node 不存在用戶要求的 Placement 相關實現版本，會退而求其次選擇其它實現版本，保障程序可以用
盡可能考慮近鄰特性 (Near When Possible)：在做 Placement 的微調時考慮節點的近鄰特性，盡可能減少無意義的拷貝

6. 原則原理詳細展開

1. 盡可能滿足用戶要求 (User Requirement First)

用戶要求分為兩種，一種是顯示指定，表現為在 Node 中設置的 device 信息；另一種是隱式指定，表現為 loc:@xxxx 屬性，即 Colocation Group。

Placer 會根據用戶這兩方面的要求并結合實際情況做 Placement 信息補全和微調。

文章開頭的截圖展示了某個 Node 的 NodeDef 信息，它表明類型為 MatMul 的 Op 被用戶顯示指定放到 '/device:GPU:0' 上，同時希望放入名為 global_step 的 Colocation Group 中。

NodeDef 中的 device 屬性和 loc:@xxxx 屬性分別由下面兩個 Python 級別的 API 引入，它們都由用戶來控制，有些被用在高層 API 內部封裝中。

# device attributes
@tf_export("device")
def device(device_name_or_function):

# colocation attributes
@tf_export("colocate_with")
def colocate_with(op, ignore_existing=False):
2. 盡可能使用更快的計算設備 (High Performance Device)

如果某個 Node 的 device 屬性中不含 device_type(即 GPU 或 CPU)，那么 Placer 必須決定使用何種 Device。每種 Device 注冊到 TensorFlow 中時都帶有優先級，通常高優先級的 Device 具有更好的計算性能。

當某個 Op 具有多種 Device 實現時，Placer 將選取優先級最高的 Device 實現版本，通過設置 device_type 為所有實現版本中最高優先級的 Device 來實現這種選取。

3. 保證程序可運行 (Runable)

這是通過 Soft Placement 機制保證的(在 session config 里可以設置)。

如果某個 Node 被顯示指定精確放在某 Device 上，但系統中卻沒有該 Device 上的實現版本，那么為了保證程序可用，Soft Placement 將發揮作用，它將忽略 device type，在系統中按照 Device 優先級選取另一個可用的實現版本重新改寫 Placement。

舉例而言，假設某 Node 的 op 是 SparseToDense，device_type 被指定為 GPU，但目前 SparseToDense 在 TensorFlow 中只有 CPU 的實現，那么 Soft Placement 將改寫該 Node 的 device_type 為 CPU。?

4. 盡可能考慮近鄰特性 (Near When Possible)這塊就比較復雜了，但我們要抓住重點，你就不會亂：關注三類特殊的 Op 類型，他們的特殊性，決定了其近鄰是需要特殊處理的，分別是：

Generator 類 Op：入度為 0，出度為 1 的 Op
MetaData 類 Op：直接在 Tensor 的元數據 MetaData 上操作，不改變 Tensor 本身的內容，比如 Reshape)
Ref 類或 Resource 類：例如 Variable 這種可能發生賦值的 Op(或者叫左值)

在 Placer 中使用以下三種啟發式規則來分別應對上面三種特殊的 Op。

若某個 Node 是 GeneratorNode，將其與 Consumer 與其放在同一個 Device 上可以防止無意義的跨 Device 拷貝。這一步在算法中被稱之為啟發式規則 A；

若某個 Node 是 MetaDataNode，將其與 Producer 放在相同的 Device上也可以防止無意義的跨 Device 拷貝。這一步在算法中被稱為啟發式規則 B；

若某個 Node 的輸入是 Reference type 或者是 Reource type，那么盡量將其與輸入放在同一個 Colocation Group中(比如 Variable，對其 assign 等操作肯定直接在 Variable 所在之地執行即可，如果 Variable 在 A 處，對其的 assign 在 B 處，顯然是不合理的)。算法中沒有為這個步驟起名字，為了方便我們稱之為啟發式規則 C。

7. Placer 決策總體流程

總體流程分為四個步驟，下圖展示了宏觀層面的流程圖。其中最后兩個步驟相對較為復雜，下一節中將會細化其流程圖。?

8. Placer 分布詳解與關鍵代碼

注意！本節看源碼的時候，要注重結構，而不是每個細節都去糾纏。

第一步 — 根據外部指定 Colocation 聚合 Group

一般情況下，沒有被用戶指定 Colocation Group 信息的 Node 會被單獨放入一個 Group 中作為唯一的成員，并以該 Node 的 Name 作為 Group 的名字，所以 Graph 中每個 Node 都會有自己的 Colocation Group。

從邏輯上來說，合并多個 Group 是非常簡單的問題，但是這個場景中的 Group 不僅是 Node 的集合，還包含若干屬性，比如某個 Group 的 possible device 表示這個 Group 可用的所有 Device 集合。

因此我們需要一種數據結構和算法，幫助我們在合并兩個 Group 時很方便地生成新 Group 及相關屬性(方便 Union)，并且能夠根據某個 Node 快速查看所屬 Group 的所有屬性(快速 Find)，這就是 Find-Union 的優勢所在。

Find-Union 算法原理將不在這里描述，這里只給出代碼中 Find-Union 用到的基本數據結構 — Member，它用來描述 Group 的基本信息。在閱讀下段代碼注釋前，需要對 Find-Union 中的樹形結構含義有基本的理解。

// Represents a node in the disjoint node set forest, and the
// accumulated constraints on the device used by that node.
struct Member {
Member() = default;
// The id of the node that is the parent of this one, or its own
// id if it is a root. parent <= 0 indicates that this member is invalid.
int parent = -1;

// A proxy for the depth of the tree that is used to prefer
// connecting smaller trees to larger trees when merging disjoint
// sets.
int rank = 0;

// The intersection of all device types supported by this node,
// and those of all of its children, in priority order
// of the preferred device.
DeviceTypeVector supported_device_types;

// The merged form of the device requested for this node, with
// those of all of its children.
DeviceNameUtils::ParsedName device_name;

// If this node is a root, stores a list of Devices to which this node
// and all of its children have been assigned, or nullptr if this
// has not yet been computed.
std::vector possible_devices;
};

下面的代碼是處理這一步驟的核心代碼。首先創建 ColocationGraph 對象，這是一個處理 Colocation Group 的工具類，里面使用了 Find-Union 算法對 Group 進行聚合。

在調用 InitiailizeMembers 對 Find-Union 算法的基本數據結構進行初始化之后，就直接調用 ColocationAllNodes 根據用戶指定的所有 colocation 信息進行聚合。

ColocationGraph colocation_graph(
graph_, devices_,
options_ == nullptr || options_->config.allow_soft_placement(),
default_device_);

TF_RETURN_IF_ERROR(colocation_graph.InitializeMembers());

// 1. First add all of the nodes. Note that steps (1) and (2)
// requires two passes over the nodes because the graph (and hence
// the constraints) may not be acyclic.
TF_RETURN_IF_ERROR(colocation_graph.ColocateAllNodes());
第二步 — 應用啟發式規則 C(處理 Ref 類 Op Placement)

這一步將對 Colocation Group 進行調整。在遍歷 Graph 的每個 Node 時，需要根據 Node input 來決定是否將該 Node 所在的 Group 與 Source Node 所在的 Group 合并。

如果 Node 的 input 是 Reference type 或者 DT_RESOURCE(關于 DT_RESOURCE 一般會在使用 ResourceVariable 時才會碰到。ResourceVariable 與 Variable 相比具有很多新特性，這些特性是 TF2.0 中主推的內容。關于它的優勢我們不在這里展開，只對其 Op 的類型做一個說明。

Variable 在 C++ 層面的 Op 類型是 VariableV2，而 ResourceVariable 在 C++ 層面的 Op 類型為 VarHandleOp。后者產生的 Tensor 就是一種 DT_RESOURCE)，那么就嘗試做合并。在合并之前需要做必要的可行性檢查，適當地主動報錯。比如在合并時除了要考慮這一對節點的連接以外，還需要考慮這個 Node 的其他輸入是否屬于 Reference type 或者 DT_RESOURCE。這一部分的代碼比較長，但邏輯比較簡單，這里不再展示。

第三步 — 應用啟發式規則 B(處理 MetaData 類的 Op Placement)

從這一步開始，Placer 才開始真正的為每個 Node 分配 Device，下面的流程圖中展示了這一步驟。

如果當前的 Node 的 device 屬性中已經有值，那么 Placer 將不再對其做重復的 assign 操作，直接跳過這個 Node；

如果當前 Node 是 GeneratorNode，先將其放入一個名為 second_pass 的 vector 中；

如果不是以上兩種情況，那么該Node正是這一步驟需要處理的對象。先從該 Node 所在的 Colocation Group 中獲取可用的 Devices(獲取會受到 Soft Placement 的影響)作為候選。如果該 node 是 MetaData node，那么會嘗試應用啟發式規則 B，否則，將分配候選集中優先級最高的 Device。

int assigned_device = -1;

// Heuristic B: If the node only operates on metadata, not data,
// then it is desirable to place that metadata node with its
// input.
if (IsMetadata(node)) {
// Make sure that the input device type is in the list of supported
// device types for this node.
const Node* input = (*node->in_edges().begin())->src();
// TODO(vrv): if the input is empty, consider postponing this
// node's assignment to the second pass, so that we handle the
// case where a metadata node's input comes from a backedge
// of a loop.
if (CanAssignToDevice(input->assigned_device_name(), *devices)) {
assigned_device = input->assigned_device_name_index();
}
}

// Provide the default, if necessary.
if (assigned_device == -1) {
assigned_device = graph_->InternDeviceName((*devices)[0]->name());
}

AssignAndLog(assigned_device, node);
第四步 — 應用啟發式規則 A(處理 Generator 類的 Op Placement)

這一步將對 second_pass 數組中的所有的 Node 分配 Device，下面的流程圖中展示了這一步驟。

放在 second_pass 中的代碼全部是 GeneratorNode，所以只需要應用啟發式規則 A 即可，和步驟 3 一樣，啟發式規則 A 的應用也是嘗試性的，如果實在不能滿足，會直接分配候選 Device 中優先級最高的 Device，下面是啟發式規則 A 的應用部分代碼。

int assigned_device = -1;

// Heuristic A application.
if (IsGeneratorNode(node)) {
const Node* output = (*node->out_edges().begin())->dst();
int output_device_name = output->assigned_device_name_index();

const bool consumers_on_same_device = std::all_of(
node->out_edges().begin(), node->out_edges().end(),
[output_device_name](const Edge* e) {
return e->dst()->assigned_device_name_index() == output_device_name;
});

if (consumers_on_same_device &&
CanAssignToDevice(output->assigned_device_name(), *devices)) {
assigned_device = output_device_name;
}
}

// Provide the default, if necessary.
if (assigned_device == -1) {
assigned_device = graph_->InternDeviceName((*devices)[0]->name());
}

AssignAndLog(assigned_device, node);

至此，所有 Node 的 Placement 信息都已經分配并微調完畢。

9. 總結

經過 Placer 處理的 GraphDef 解決了顯式和隱式 Placement 信息的所有沖突，可謂是最后一道防線。

在 Placer 之后，GraphDef 將被送入 GraphPartitioner 模塊中根據每個 Node 的 device 做子圖切分，并插入 Send，Recv 以及必要的 ControlFlow 節點。因此，此步必不可少。

我們也可以看出，Placer 模塊的核心是對 Placement 進行微調，由于啟發式規則相對簡單，性能問題并未完全解決。甚至，我們馬上可以想到，在分布式模式下，粗糙的 Placement 方案會讓作業性能變得非常差，因為它會引入計算之外的通信開銷。

TensorFlow 高度靈活的 Placement 控制接口，讓模型并行的策略設計方面具備相當大的想象空間，這也是 DL 系統層面研究的熱點之一。而將 Placement 策略自動化，并隱藏到框架中，似乎是用戶十分關心的問題。這不但可以提高框架的易用性，讓用戶完全專注在模型算法層面，也可以讓初學者用戶避免寫出性能較差的程序。

但是自動搜索 Placement 最佳策略的難度非常大，因為它要考慮集群通信的帶寬，以及每個 Op 的計算量，是一個與硬件和環境高度聯系的復雜問題。不僅如此，通常深度學習模型含有成千上萬個 Node，這使得方案的搜索空間巨大無比。

對于這個問題的解決辦法，目前是百家爭鳴。如果你對策略感興趣，我這里給你推薦一篇 Google 發表的論文，它利用強化學習搜索更好的分片策略。有興趣的同學可以參考這篇 ICML 的論文：Device Placement Optimization with Reinforcement Learning。

Device Placement Optimization with Reinforcement Learninghttps://arxiv.org/abs/1706.04972

— 推薦閱讀 —

總結

以上是生活随笔為你收集整理的tensorflow打印模型结构_社区分享 | 详解 TensorFlow 中 Placement 的最后一道防线 — Placer 算法...的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python解释器调用_python入门
下一篇： skywalking 安装_如何使用sk

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

tensorflow打印模型结构_社区分享 | 详解 TensorFlow 中 Placement 的最后一道防线 — Placer 算法...

總結