如何不部署Keras / TensorFlow模型
While the most articles about deep learning are focusing at the modeling part, there are also few about how to deploy such models to production. Some of them say “production”, but they often simply use the un-optimized model and embed it into a Flask web server. In this post, I will explain why using this approach is not scaling well and wasting resources.
盡管有關(guān)深度學(xué)習(xí)的大多數(shù)文章都集中在建模部分,但關(guān)于如何將此類模型部署到生產(chǎn)環(huán)境的文章也很少。 他們中的一些人說(shuō)“生產(chǎn)”,但他們通常只是使用未優(yōu)化的模型并將其嵌入Flask Web服務(wù)器中。 在這篇文章中,我將解釋為什么使用這種方法不能很好地?cái)U(kuò)展并浪費(fèi)資源。
“生產(chǎn)”方式 (The “production” approach)
If you search for how to deploy TensorFlow, Keras or Pytorch models into production there are a lot of good tutorials, but sometimes you come across very simple simple examples, claiming production ready. These examples often use the raw keras model, a Flask web server and containerize it into a docker container. These examples use Python to serve predictions. The code for these “production” Flask webservers look like this:
如果您搜索如何將TensorFlow,Keras或Pytorch模型部署到生產(chǎn)環(huán)境中,則有很多不錯(cuò)的教程,但是有時(shí)您會(huì)遇到非常簡(jiǎn)單的示例,聲稱可以進(jìn)行生產(chǎn)。 這些示例通常使用原始keras模型,Flask Web服務(wù)器并將其容器化到docker容器中。 這些示例使用Python進(jìn)行預(yù)測(cè)。 這些“生產(chǎn)” Flask Web服務(wù)器的代碼如下所示:
from flask import Flask, jsonify, requestfrom tensorflow import keras
app = Flask(__name__)
model = keras.models.load_model("model.h5")
@app.route("/", methods=["POST"])
def index():
data = request.json
prediction = model.predict(preprocess(data))
return jsonify({"prediction": str(prediction)})
Furthermore, they often show how to containerize the Flask server and bundle it with your model into docker. These approaches also claim that they can easily scale by increasing the number of docker instances.
此外,他們經(jīng)常展示如何容器化Flask服務(wù)器并將其與模型捆綁到docker中。 這些方法還聲稱,它們可以通過(guò)增加docker實(shí)例數(shù)量來(lái)輕松擴(kuò)展。
Now let us recap what happens here and why it is not “production” grade.
現(xiàn)在讓我們回顧一下這里發(fā)生的事情以及為什么它不是“生產(chǎn)”等級(jí)。
沒(méi)有優(yōu)化模型 (Not optimizing models)
First usually the model is used as it is, which means the Keras model from the example was simply exported by model.save(). The model includes all the parameters and gradients, which are necessary to train the model, but not required for inference. Also, the model is neither pruned nor quantized. As an effect using not optimized models have a higher latency, need more compute and are larger in terms of file size.
首先通常是按原樣使用模型,這意味著示例中的Keras模型只是通過(guò)model.save()導(dǎo)出的。 該模型包括所有參數(shù)和梯度,這些參數(shù)和梯度是訓(xùn)練模型所必需的,但不是推理所必需的。 而且,該模型既不修剪也不量化。 結(jié)果,使用未優(yōu)化的模型會(huì)導(dǎo)致較高的延遲,需要更多的計(jì)算并且文件大小也會(huì)更大。
Example with B5 Efficientnet:
B5 Efficientnet的示例:
- h5 keras model: 454 MByte h5 keras模型:454 MByte
- Optimized tensorflow model (no quantization): 222 MByte 優(yōu)化的張量流模型(無(wú)量化):222 MByte
使用Flask和Python API (Using Flask and the Python API)
The next problem is that plain Python and Flask is used to load the model and serve predictions. Here are a lot of problems.
下一個(gè)問(wèn)題是使用普通的Python和Flask加載模型并提供預(yù)測(cè)。 這里有很多問(wèn)題。
First let’s look at the worst thing you can do: loading the model for each request. In the code example from above, the model is used when the script is called, but in other tutorials they moved this part into the predict function. What that does is loading the model every single time you make a prediction. Please do not do that.
首先讓我們看一下您可以做的最壞的事情:為每個(gè)請(qǐng)求加載模型。 在上面的代碼示例中,在調(diào)用腳本時(shí)使用了模型,但是在其他教程中,他們將這一部分移至了預(yù)測(cè)函數(shù)中。 這樣做是每次您進(jìn)行預(yù)測(cè)時(shí)都加載模型。 請(qǐng)不要那樣做。
That being said, let’s look at Flask. Flask includes a powerful and easy-to-use web server for development. On the official website, you can read the following:
話雖如此,讓我們看一下Flask。 Flask包括一個(gè)功能強(qiáng)大且易于使用的Web服務(wù)器,用于開(kāi)發(fā)。 在官方網(wǎng)站上 ,您可以閱讀以下內(nèi)容:
While lightweight and easy to use, Flask’s built-in server is not suitable for production as it doesn’t scale well.
Flask輕巧易用,但內(nèi)置的服務(wù)器擴(kuò)展性不好,因此不適合生產(chǎn) 。
That said, you can use Flask as a WSGI app in e.g. Google App Engine. However, many tutorials are not using Google App Engine or NGIX, they just use it as it is and put it into a docker container. But even when they use NGIX or any other web servers, they usually turn off multi threading completely.
也就是說(shuō),您可以在例如Google App Engine中將Flask用作WSGI應(yīng)用程序。 但是,許多教程并未使用Google App Engine或NGIX,而是直接使用它并將其放入docker容器中。 但是,即使他們使用NGIX或任何其他Web服務(wù)器,也通常會(huì)完全關(guān)閉多線程。
Let’s look a bit deeper into the problem here. If you use a TensorFlow, it handles compute resources (CPU, GPU) for you. If you load a model and call predict, TensorFlow uses the compute resources to make these predictions. While this happens, the resource is in-use aka locked. When your web server only serves one single request at the time, you are fine, as the model was loaded in this thread and predict is called from this thread. But once you allow more than one requests at the time, your web server stops working, because you can simply not access a TensorFlow model from different threads. That being said, in this setup you can not process more than one request at once. Doesn’t really sound like scalable, right?
讓我們?cè)谶@里更深入地研究問(wèn)題。 如果您使用TensorFlow,它將為您處理計(jì)算資源(CPU,GPU)。 如果您加載模型并調(diào)用預(yù)測(cè),TensorFlow將使用計(jì)算資源進(jìn)行這些預(yù)測(cè)。 發(fā)生這種情況時(shí),資源在使用中也被鎖定。 當(dāng)您的Web服務(wù)器當(dāng)時(shí)僅服務(wù)一個(gè)請(qǐng)求時(shí),就可以了,因?yàn)槟P鸵鸭虞d到該線程中,并且從該線程中調(diào)用了predict。 但是一旦您一次允許多個(gè)請(qǐng)求,您的Web服務(wù)器就會(huì)停止工作,因?yàn)槟緹o(wú)法從其他線程訪問(wèn)TensorFlow模型。 話雖這么說(shuō),在這種設(shè)置中您不能一次處理多個(gè)請(qǐng)求。 聽(tīng)起來(lái)真的不是可擴(kuò)展的,對(duì)嗎?
Example:
例:
- Flask development web server: 1 simultaneous request Flask開(kāi)發(fā)Web服務(wù)器:1個(gè)同時(shí)請(qǐng)求
- TensorFlowX Model server: parallelism configurable TensorFlowX模型服務(wù)器:可配置并行性
使用docker擴(kuò)展“低負(fù)載”實(shí)例 (Scaling “l(fā)ow-load” instances with docker)
Ok, the web server does not scale, but what about scaling the number of web servers? In a lot of examples this approach is the solution to the scaling problem of single instances. There is not much to say about it, it works sure. But scaling this way wastes money, resources and energy. It’s like having a truck and putting in one single parcel and once there are more parcels, you get another truck, instead of using the existing truck smarter.
好的,Web服務(wù)器無(wú)法擴(kuò)展,但是如何擴(kuò)展Web服務(wù)器的數(shù)量呢? 在許多示例中,此方法是解決單個(gè)實(shí)例的縮放問(wèn)題的方法。 沒(méi)什么可說(shuō)的,它可以正常工作。 但是以這種方式擴(kuò)展會(huì)浪費(fèi)金錢,資源和能量。 這就像擁有一輛卡車并放入一個(gè)包裹,一旦有更多的包裹,您將獲得另一輛卡車,而不是更智能地使用現(xiàn)有的卡車。
Example latency:
延遲示例:
- Flask Serving like shown above: ~2s per image 上圖所示的燒瓶投放:每張圖片約2秒
- Tensorflow model server (no batching, no GPU): ~250ms per image Tensorflow模型服務(wù)器(無(wú)批處理,無(wú)GPU):每個(gè)圖像約250ms
- Tensorflow model server (no batching, GPU): ~100ms per image Tensorflow模型服務(wù)器(無(wú)批處理,GPU):每個(gè)圖像約100毫秒
不使用GPU / TPU (Not using GPUs/TPUs)
GPUs made deep learning possible as they can do operations massively in parallel. When using docker containers to deploy deep learning models to production, the most examples do NOT utilize GPUs, they don’t even use GPU instances. The prediction time for each request is magnitudes slower on CPU machines, so latency is a big problem. Even with powerful CPU instances you will not achieve comparable results to the small GPU instances.
GPU使深度學(xué)習(xí)成為可能,因?yàn)樗鼈兛梢圆⑿羞M(jìn)行大規(guī)模操作。 當(dāng)使用Docker容器將深度學(xué)習(xí)模型部署到生產(chǎn)環(huán)境時(shí),大多數(shù)示例不使用GPU,甚至不使用GPU實(shí)例。 在CPU機(jī)器上,每個(gè)請(qǐng)求的預(yù)測(cè)時(shí)間要慢得多,因此延遲是一個(gè)大問(wèn)題。 即使使用功能強(qiáng)大的CPU實(shí)例,您也無(wú)法獲得與小型GPU實(shí)例相當(dāng)?shù)慕Y(jié)果。
Just a side note: In general it is possible to use GPUs in docker, if the host has the correct driver installed. Docker is completely fine for scaling up instances, but scale up the correct instances.
附帶說(shuō)明:通常,如果主機(jī)安裝了正確的驅(qū)動(dòng)程序,則可以在docker中使用GPU。 Docker可以很好地?cái)U(kuò)展實(shí)例,但是可以擴(kuò)展正確的實(shí)例。
Example costs:
費(fèi)用示例:
- 2 CPU instances (16 Core, 32GByte, a1.4xlarge): 0,816 $/h 2個(gè)CPU實(shí)例(16核,32GB,a1.4xlarge):0,816 $ / h
- 1 GPU instance (32G RAM, 4 Cores, Tesla M60, g3s.xlarge): 0,75 $/h 1個(gè)GPU實(shí)例(32G RAM,4核,Tesla M60,g3s.xlarge):0,75 $ / h
已經(jīng)解決了 (It’s already solved)
As you can see, loading trained model and putting it into Flask docker containers is not an elegant solution. If you want deep learning in production, start from the model, then think about servers and finally about scaling instances.
如您所見(jiàn),加載經(jīng)過(guò)訓(xùn)練的模型并將其放入Flask docker容器中并不是一個(gè)很好的解決方案。 如果要在生產(chǎn)中進(jìn)行深度學(xué)習(xí),請(qǐng)從模型開(kāi)始,然后考慮服務(wù)器,最后考慮擴(kuò)展實(shí)例。
優(yōu)化模型 (Optimize the model)
Unfortunately optimizing a model for inference is not that straight forward as it should be. However, it can easily reduce inference time by multiples, so it’s worth it without doubts. The first step is freezing the weights and removing all the trainings overhead. This can be achieved with TensorFlow directly but requires you to convert your model into either an estimator or into a Tensorflow graph (SavedModel format), if you came from a Keras model. TensorFlow itself has a tutorial for this. To further optimize, the next step is to apply model pruning and quantization, where insignificant weights are removed and model size is reduced.
不幸的是,為推理優(yōu)化模型并不是應(yīng)該的。 但是,它可以輕松地將推理時(shí)間減少幾倍,因此毫無(wú)疑問(wèn)是值得的。 第一步是凍結(jié)重量并消除所有訓(xùn)練開(kāi)銷。 這可以直接用TensorFlow來(lái)實(shí)現(xiàn),但是如果您來(lái)自Keras模型,則需要將模型轉(zhuǎn)換為估算器或Tensorflow圖(SavedModel格式)。 TensorFlow本身對(duì)此有一個(gè)教程 。 為了進(jìn)一步優(yōu)化,下一步是應(yīng)用模型修剪和量化 ,刪除不重要的權(quán)重并減小模型大小。
使用模型服務(wù)器 (Use model servers)
When you have an optimized model, you can look at different model servers, meant for deep learning models in production. For TensorFlow and Keras TensorFlowX offers the tensorflow model server. There are also others like TensorRT, Clipper, MLFlow, DeepDetect.
擁有優(yōu)化的模型后,您可以查看不同的模型服務(wù)器,這些服務(wù)器用于生產(chǎn)中的深度學(xué)習(xí)模型。 對(duì)于TensorFlow和Keras, TensorFlowX提供了tensorflow模型服務(wù)器 。 還有其他一些像TensorRT,Clipper,MLFlow,DeepDetect。
TensorFlow model server offers several features. Serving multiple models at the same time, while reducing the overhead to a minimum. It allows you to version your models, without downtime when deploying a new version, while still being able to use the old version. It also has an optional REST API endpoint additionally to the gRPC API. The throughput is magnitudes higher than using a Flask API, as it is written in C++ and uses multi-threading. Additionally, you can even enable batching, where the server batches multiple single predictions into a batch for very high load settings. And finally, you can put it into a docker container and scale even more.
TensorFlow模型服務(wù)器提供了多種功能。 同時(shí)為多個(gè)模型提供服務(wù),同時(shí)將開(kāi)銷降至最低。 它允許您對(duì)模型進(jìn)行版本控制,而在部署新版本時(shí)不會(huì)停機(jī),同時(shí)仍可以使用舊版本。 除了gRPC API外,它還具有可選的REST API端點(diǎn)。 與使用Flask API相比,吞吐量要高出許多,因?yàn)樗怯肅 ++編寫的并且使用多線程。 另外,您甚至可以啟用批處理,其中服務(wù)器將多個(gè)單個(gè)預(yù)測(cè)批處理為非常高的負(fù)載設(shè)置的批處理。 最后,您可以將其放入docker容器并進(jìn)一步擴(kuò)展。
Hint: tensorflow_model_server is available on every AWS-EC2 Deep learning AMI image, with TensorFlow 2 it’s called tensorflow2_model_server.
提示:在每個(gè)AWS-EC2深度學(xué)習(xí)AMI映像上都可以使用tensorflow_model_server,對(duì)于TensorFlow 2,它稱為tensorflow2_model_server。
使用GPU實(shí)例 (Use GPU instances)
And lastly, I would recommend using GPUs or TPUs for inference environments. The latency and throughput are much higher with such accelerators, while saving energy and money. Note that it is only being utilized if your software stack can utilize the power of GPUs (optimized model + model server). In AWS you can look into Elastic Inference or just use a GPU instance with Tesla M60 (g3s.xlarge).
最后,我建議在推理環(huán)境中使用GPU或TPU。 使用此類加速器時(shí),延遲和吞吐量要高得多,同時(shí)可以節(jié)省能源和金錢。 請(qǐng)注意,只有在您的軟件堆棧可以利用GPU(優(yōu)化的模型+模型服務(wù)器)的功能時(shí),才可以使用它。 在AWS中,您可以研究Elastic Inference或僅將GPU實(shí)例與Tesla M60(g3s.xlarge)一起使用。
Originally posted on digital-thnking.de
最初發(fā)布在digital-thnking.de
翻譯自: https://towardsdatascience.com/how-to-not-deploy-keras-tensorflow-models-4fa60b487682
總結(jié)
以上是生活随笔為你收集整理的如何不部署Keras / TensorFlow模型的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 做梦梦到花什么预兆
- 下一篇: 对食材的敬畏之心极致产品_这些数据科学产