优化着色器信息加载,或查看Yer数据!
A story about a million shader variants, optimizing using Instruments and looking at the data to optimize some more.
關于一百萬個著色器變體的故事,使用樂器進行優化并查看數據以進行更多優化。
錯誤報告 (The Bug Report)
The bug report I was looking into was along the lines of?“when we put these shaders into our project, then building a game becomes much slower – even if shaders aren’t being used”.
我正在研究的錯誤報告大致是“當我們將這些著色器放入我們的項目中時,即使不使用著色器,構建游戲也會變得慢得多” 。
Indeed it was. Quick look revealed that for?ComplicatedReasons(tm)?we load information about?all shaders?during the game build – that explains why the slowdown was happening even if shaders were not actually used.
確實是這樣。 快速瀏覽顯示,對于ComplicatedReasons(tm),我們在游戲構建期間會加載有關所有著色器的信息,這解釋了為什么即使實際上未使用著色器也會出現速度下降的原因。
This issue must be fixed! There’s probably no really good reason we must know about all the shaders for a game build. But to fix it, I’ll need to pair up with someone who knows anything about game data build pipeline, our data serialization and so on. So that will be someday in the future.
必須解決此問題! 我們必須了解游戲構建的所有著色器的確沒有充分的理由。 但是要解決此問題,我需要與對游戲數據構建管道,我們的數據序列化等一無所知的人結對。 所以這將是將來的一天。
Meanwhile… another problem was that loading the “information for a shader” was slow in this project. Did I say slow? It was?very slow.
同時,另一個問題是,在此項目中,加載“著色器信息”的速度很慢。 我說的慢嗎? 太慢了 。
That’s a good thing to look at. Shader data is not only loaded while building the game; it’s also loaded when the shader is needed for the first time (e.g. clicking on it in Unity’s project view); or when we actually have a material that uses it etc. All these operations were quite slow in this project.
這是一件好事。 著色器數據不僅在構建游戲時加載; 第一次需要著色器時也會加載它(例如,在Unity的項目視圖中單擊它); 或當我們實際上有使用它的材料等時。在此項目中,所有這些操作都很慢。
Turns out this particular shader had massive internal variant count. In Unity, what looks like “a single shader” to the user often has?many variants inside?(to handle different lights, lightmaps, shadows, HDR and whatnot – typical ubershader setup). Usually shaders have from a few dozen to a few thousand variants. This shader had?1.9 million. And there were about ten shaders like that in the project.
事實證明,此特定著色器具有大量內部變體計數。 在Unity中,對用戶來說看起來像“單一著色器”的內部通常有許多變體 (以處理不同的燈光,光照貼圖,陰影,HDR和其他(典型的ubershader設置))。 通常,著色器有幾十種到幾千種。 該著色器有190萬個 。 項目中大約有十個這樣的著色器。
設置 (The Setup)
Let’s create several shaders with different variant counts for testing: 27 thousand, 111 thousand, 333 thousand and 1 million variants. I’ll call them 27k, 111k, 333k and 1M respectively. For reference, the new “Standard” shader in Unity 5.0 has about 33 thousand internal variants. I’ll do tests on MacBook Pro (2.3 GHz Core i7) using 64 bit Release build.
讓我們創建幾個具有不同變體計數的著色器進行測試:2.7萬個,11.1萬個,33.3萬個和100萬個變體。 我將它們分別稱為27k,111k,333k和1M。 作為參考,Unity 5.0中新的“標準”著色器具有約3.3萬個內部變體。 我將使用64位Release版本在MacBook Pro(2.3 GHz Core i7)上進行測試。
Things I’ll be measuring:
我要衡量的事情:
Import time. How much time it takes to reimport the shader in Unity editor.?Since Unity 4.5?this doesn’t do much of actual shader?compilation; it just extracts information about shader snippets that need compiling, and the variants that are there, etc.
導入時間。 在Unity編輯器中重新導入著色器需要花費多少時間。 從Unity 4.5開始 ,實際著色器的編譯工作不多; 它只是提取有關需要編譯的著色器代碼片段以及其中存在的變體等信息。
Imported data size. How large is the imported shader data (serialized representation of actual shader asset; i.e. files that live in?Library/metadata?folder of a Unity project).
導入的數據大小。 導入的著色器數據的大小(實際著色器資產的序列化表示;即,位于Unity項目的Library/metadata文件夾中的文件)。
So the data is:
所以數據是:
Shader Import Load Size27k 420ms 120ms 6.4MB111k 2013ms 492ms 27.9MB333k 7779ms 1719ms 89.2MB1M 16192ms 4231ms 272.4MB Shader Import Load Size27k 420ms 120ms 6.4MB111k 2013ms 492ms 27.9MB333k 7779ms 1719ms 89.2MB1M 16192ms 4231ms 272.4MB| 1 23456 | Shader?? Import????Load????Size27k????420ms?? 120ms????6.4MB111k?? 2013ms?? 492ms?? 27.9MB333k?? 7779ms??1719ms?? 89.2MB1M??16192ms??4231ms??272.4MB |
| 1 2 3 4 5 6 | Shader?? Import???? Load???? Size 27k ???? 420ms ?? 120ms ???? 6.4MB 111k ?? 2013ms ?? 492ms ?? 27.9MB 333k ?? 7779ms ?? 1719ms ?? 89.2MB 1M ?? 16192ms ?? 4231ms ?? 272.4MB |
輸入樂器 (Enter Instruments)
Last time?we used xperf to do some profiling. We’re on a Mac this time, so let’s use?Apple Instruments. Just like xperf, Instruments can show a lot of interesting data. We’re looking at the most simple one, “Time Profiler”?(though profiling Zombies is very tempting!). You pick that instrument, attach to the executable, start recording, and get some results out.
上一次,我們使用xperf進行了分析。 這次我們使用的是Mac,因此讓我們使用Apple Instruments 。 就像xperf一樣,Instruments可以顯示很多有趣的數據。 我們正在尋找最簡單的工具,即“ Time Profiler” (盡管對Zombies進行概要分析非常誘人!) 。 您選擇該樂器,將其附加到可執行文件,開始錄音,然后得出一些結果。
??You then select the time range you’re interested in, and expand the stack trace. Protip: Alt-Click?(ok ok, Option-Click you Mac peoples)?expands full tree.
然后,選擇您感興趣的時間范圍,并展開堆棧跟蹤。 提示:按住Alt鍵單擊(確定,Mac族中按住Option鍵單擊),會展開整個樹。
So far the whole stack is just going deep into Cocoa stuff. “Hide System Libraries” is very helpful with that:
到目前為止,整個堆棧都只是深入到可可粉中。 “隱藏系統庫”在以下方面非常有幫助:
Another very useful feature is inverting the call tree, where the results are presented from the heaviest “self time” functions (we won’t be using that here though).
另一個非常有用的功能是反轉調用樹,其中的結果是通過最重的“自拍時間”函數顯示的(盡管這里我們不會使用它)。
When hovering over an item, an arrow is shown on the right (see image above). Clicking on that does “focus on subtree”, i.e. ignores everything outside of that item, and time percentages are shown relative to the item. Here we’ve focused on?ShaderCompilerPreprocess?(which does majority of shader “importing” work).
將鼠標懸停在項目上時,右側會顯示一個箭頭(請參見上圖)。 單擊該按鈕會“專注于子樹”,即忽略該項目以外的所有內容,并顯示相對于該項目的時間百分比。 在這里,我們集中于ShaderCompilerPreprocess (它完成了大部分著色器“導入”工作)。
Looks like we’re spending a lot of time appending to strings. That usually means strings did not have enough storage buffer reserved and are causing a lot of memory allocations. Code change:
看起來我們在字符串上花了很多時間。 這通常意味著字符串沒有預留足夠的存儲緩沖區,并導致大量內存分配。 代碼更改:
This small change has cut down shader importing time by 20-40%!?Very nice!
這個小小的變化將著色器的導入時間減少了20-40%! 非常好!
I did a couple other small tweaks from looking at this profiling data – none of them resulted in any signifinant benefit though.
通過查看此概要分析數據,我做了其他一些小調整-盡管這些調整都沒有帶來任何顯著的好處。
Profiling shader load time also says that most of the time ends up being spent on loading editor related data that is arrays of arrays of strings and so on:
分析著色器的加載時間還表示,大部分時間最終都花在了加載與編輯器相關的數據上,這些數據是字符串數組的數組,等等:
I could have picked functions from the profiler results, went though each of them and optimized, and perhaps would have achieved a solid 2-3x improvement over initial results. Very often that’s enough to be proud!
我本可以從探查器結果中挑選出功能,逐一分析并進行優化,也許可以比初始結果提高2-3倍。 很多時候,這足以令人感到驕傲!
However…
然而…
退后一步 (Taking a step back)
Or like?Mike Acton?would say, ”look at your data!” (check his CppCon2014?slides?or?video). Another saying is also applicable: ”think!”
或就像Mike Acton所說的那樣,“ 看看您的數據! ”(查看他的CppCon2014 幻燈片或視頻 )。 另一句話也適用:“ 思考! ”
Why?do we have this problem to begin with?
為什么我們要從這個問題開始呢?
For example, in 333k variant shader case, we end up sending 610560 lines of shader variant information between shader compiler process & editor, with macro strings in each of them. In total we’re sending 91 megabytes of data over RPC pipe during shader import.
例如,在333k變量著色器的情況下,我們最終在著色器編譯器進程和編輯器之間發送610560行著色器變量信息,并且每個變量中都包含宏字符串。 在著色器導入期間,我們總共通過RPC管道發送了91兆字節的數據。
One possible area for improvement: the data we send over and store in imported shader data is a small set of macro strings repeated over and over and over again. Instead of sending or storing the strings, we could just send the set of strings used by a shader once, assign numbers to them, and then send & store the full set as lists of numbers (or fixed size bitmasks). This should cut down on the amount of string operations we do (massively cut down on number of small allocations), size of data we send, and size of data we store.
有一個可能需要改進的地方:我們反復發送并存儲在導入的著色器數據中的數據是一小組一遍又一遍地重復的宏字符串。 無需發送或存儲字符串,我們只需發送一次著色器使用的字符串集,為它們分配數字,然后將完整的集合發送并存儲為數字列表(或固定大小的位掩碼)。 這樣可以減少我們執行的字符串操作的數量(大大減少小分配的數量),發送的數據大小和存儲的數據大小。
Another possible approach: right now we have source data in shader that indicate which variants to generate. This data is very small: just a list of on/off features, and some built-in variant lists (“all variants to handle lighting in forward rendering”). We do the full combinatorial explosion of that in the shader compiler process, send the full set over to the editor, and the editor stores that in imported shader data.
另一種可能的方法:現在,我們在著色器中具有源數據,指示要生成哪些變體。 該數據非常小:只有開/關功能列表以及一些內置的變體列表(“用于處理正向渲染中所有照明的所有變體”)。 我們在著色器編譯器過程中進行完整的組合分解,將完整的集發送給編輯器,然后編輯器將其存儲在導入的著色器數據中。
But the way we do the “explosion of source data into full set” is?always the same. We could just send the source data from shader compiler to the editor (a very small amount!), and furthermore, just store that in imported shader data. We can rebuild the full set when needed at any time.
但是我們“將源數據分解為完整數據集”的方式始終相同 。 我們可以將源數據從著色器編譯器發送到編輯器(非常少!),此外,只需將其存儲在導入的著色器數據中即可。 我們可以在需要時隨時重建完整集。
變更資料 (Changing the data)
So let’s try to do that. First let’s deal with RPC only, without changing serialized shader data. A few commits later…
因此,讓我們嘗試這樣做。 首先,讓我們僅處理RPC,而不更改序列化著色器數據。 稍后再提交…
This made shader importing over?twice as fast!
這使著色器的導入速度提高了兩倍 !
Shader Import27k 419ms -> 200ms111k 1702ms -> 791ms333k 5362ms -> 2530ms1M 16784ms -> 8280ms Shader Import27k 419ms -> 200ms111k 1702ms -> 791ms333k 5362ms -> 2530ms1M 16784ms -> 8280ms| 1 23456 | Shader?? Import27k????419ms ->??200ms111k?? 1702ms ->??791ms333k?? 5362ms -> 2530ms1M??16784ms -> 8280ms |
| 1 2 3 4 5 6 | Shader?? Import 27k ???? 419ms - & gt ; ?? 200ms 111k ?? 1702ms - & gt ; ?? 791ms 333k ?? 5362ms - & gt ; 2530ms 1M ?? 16784ms - & gt ; 8280ms |
Let’s do the other part too; where we change serialized shader variant data representation. Instead of storing full set of possible variants, we only store data needed to generate the full set:
我們也要做另一部分。 在此更改序列化著色器變體數據表示形式。 除了存儲全套可能的變量之外,我們僅存儲生成全套所需的數據:
Shader Import Load Size27k 200ms -> 285ms 103ms -> 396ms 6.4MB -> 55kB111k 791ms -> 1229ms 426ms -> 1832ms 27.9MB -> 55kB333k 2530ms -> 3893ms 1410ms -> 5892ms 89.2MB -> 56kB1M 8280ms -> 12416ms 4498ms -> 18949ms 272.4MB -> 57kB Shader Import Load Size27k 200ms -> 285ms 103ms -> 396ms 6.4MB -> 55kB111k 791ms -> 1229ms 426ms -> 1832ms 27.9MB -> 55kB333k 2530ms -> 3893ms 1410ms -> 5892ms 89.2MB -> 56kB1M 8280ms -> 12416ms 4498ms -> 18949ms 272.4MB -> 57kB| 1 23456 | Shader?? Import??????????????Load???????????????? Size27k????200ms ->?? 285ms????103ms ->????396ms???? 6.4MB -> 55kB111k????791ms ->??1229ms????426ms ->?? 1832ms????27.9MB -> 55kB333k?? 2530ms ->??3893ms?? 1410ms ->?? 5892ms????89.2MB -> 56kB1M?? 8280ms -> 12416ms?? 4498ms ->??18949ms?? 272.4MB -> 57kB |
| 1 2 3 4 5 6 | Shader?? Import?????????????? Load???????????????? Size 27k ???? 200ms - & gt ; ?? 285ms ???? 103ms - & gt ; ???? 396ms ???? 6.4MB - & gt ; 55kB 111k ???? 791ms - & gt ; ?? 1229ms ???? 426ms - & gt ; ?? 1832ms ???? 27.9MB - & gt ; 55kB 333k ?? 2530ms - & gt ; ?? 3893ms ?? 1410ms - & gt ; ?? 5892ms ???? 89.2MB - & gt ; 56kB 1M ?? 8280ms - & gt ; 12416ms ?? 4498ms - & gt ; ?? 18949ms ?? 272.4MB - & gt ; 57kB |
Everything seems to work, and the serialized file size got massively decreased. But, both importing and loading got slower?! Clearly I did something stupid. Profile!
一切似乎正常,序列化的文件大小大大減少。 但是,導入和加載都變慢了嗎? 顯然我做了一些愚蠢的事情。 個人資料!
Right. So after importing or loading the shader (from now a small file on disk), we generate the full set of shader variant data. Which right now is resulting in a lot of string allocations, since it is generating arrays of arrays of strings or somesuch.
對。 因此,在導入或加載著色器之后(從磁盤上的一個小文件開始),我們將生成完整的著色器變體數據集。 現在哪個會導致大量的字符串分配,因為它正在生成字符串數組或類似的數組。
But we don’t really need the strings at this point; for example after loading the shader we only need the internal representation of “shader variant key” which is a fairly small bitmask. A couple of tweaks to fix that, and we’re at:
但是現在我們真的不需要字符串了。 例如,在加載著色器后,我們只需要“著色器變體鍵”的內部表示,這是一個相當小的位掩碼。 為了解決這個問題,我們進行了一些調整,我們位于:
Shader Import Load27k 42ms 7ms111k 47ms 27ms333k 94ms 76ms1M 231ms 225ms Shader Import Load27k 42ms 7ms111k 47ms 27ms333k 94ms 76ms1M 231ms 225ms| 1 23456 | Shader??Import????Load27k????42ms???? 7ms111k????47ms????27ms333k????94ms????76ms1M?? 231ms?? 225ms |
| 1 2 3 4 5 6 | Shader?? Import???? Load 27k ???? 42ms ???? 7ms 111k ???? 47ms ???? 27ms 333k ???? 94ms ???? 76ms 1M ?? 231ms ?? 225ms |
Look at that! Importing a 333k variant shader got?82 times?faster; loading its metadata got?22 times?faster, and the imported file size got?over a thousand times?smaller!
看那個! 導入333k變體著色器的速度提高了82倍 ; 加載元數據的速度提高了22倍 ,而導入的文件大小卻縮小了1000倍 !
One final look at the profiler, just because:
最后看一下分析器,原因僅在于:
Weird, time is spent in memory allocation but there shouldn’t be any at this point in that function; we aren’t creating any new strings there. Ahh, implicit?std::string?to?UnityStr?(our own string class with better memory reporting) conversion operators?(long story…). Fix that, and we’ve got another 2x improvement:
奇怪的是,時間花在了內存分配上,但是該函數此時不應該有任何時間。 我們沒有在那里創建任何新的字符串。 啊,將std::string隱式轉換為UnityStr (我們自己的字符串類,具有更好的內存報告)轉換運算符( UnityStr ……) 。 修復此問題,我們又有了2倍的改進:
Shader Import Load27k 42ms 5ms111k 44ms 18ms333k 53ms 46ms1M 130ms 128ms Shader Import Load27k 42ms 5ms111k 44ms 18ms333k 53ms 46ms1M 130ms 128ms| 1 23456 | Shader??Import????Load27k????42ms???? 5ms111k????44ms????18ms333k????53ms????46ms1M?? 130ms?? 128ms |
| 1 2 3 4 5 6 | Shader?? Import???? Load 27k ???? 42ms ???? 5ms 111k ???? 44ms ???? 18ms 333k ???? 53ms ???? 46ms 1M ?? 130ms ?? 128ms |
The code could still be optimized further, but there ain’t no easy fixes left I think. And at this point I’ll have more important tasks to do…
該代碼仍可以進一步優化,但是我認為沒有簡單的修復方法。 在這一點上,我將有更多重要的工作要做...
我們有什么 (What we’ve got)
So in total, here’s what we have so far:
因此,總的來說,這就是我們目前所擁有的:
Shader Import Load Size27k 420ms-> 42ms (10x) 120ms-> 5ms (24x) 6.4MB->55kB (119x)111k 2013ms-> 44ms (46x) 492ms-> 18ms (27x) 27.9MB->55kB (519x)333k 7779ms-> 53ms (147x) 1719ms-> 46ms (37x) 89.2MB->56kB (this is getting)1M 16192ms->130ms (125x) 4231ms->128ms (33x) 272.4MB->57kB (ridiculous!) Shader Import Load Size27k 420ms-> 42ms (10x) 120ms-> 5ms (24x) 6.4MB->55kB (119x)111k 2013ms-> 44ms (46x) 492ms-> 18ms (27x) 27.9MB->55kB (519x)333k 7779ms-> 53ms (147x) 1719ms-> 46ms (37x) 89.2MB->56kB (this is getting)1M 16192ms->130ms (125x) 4231ms->128ms (33x) 272.4MB->57kB (ridiculous!)| 1 23456 | Shader?? Import????????????????Load???????????????? Size27k????420ms-> 42ms (10x)????120ms->??5ms (24x)????6.4MB->55kB (119x)111k?? 2013ms-> 44ms (46x)????492ms-> 18ms (27x)?? 27.9MB->55kB (519x)333k?? 7779ms-> 53ms (147x)??1719ms-> 46ms (37x)?? 89.2MB->56kB (this is getting)1M??16192ms->130ms (125x)??4231ms->128ms (33x)??272.4MB->57kB (ridiculous!) |
| 1 2 3 4 5 6 | Shader?? Import???????????????? Load???????????????? Size 27k ???? 420ms - & gt ; 42ms ( 10x ) ???? 120ms - & gt ; ?? 5ms ( 24x ) ???? 6.4MB - & gt ; 55kB ( 119x ) 111k ?? 2013ms - & gt ; 44ms ( 46x ) ???? 492ms - & gt ; 18ms ( 27x ) ?? 27.9MB - & gt ; 55kB ( 519x ) 333k ?? 7779ms - & gt ; 53ms ( 147x ) ?? 1719ms - & gt ; 46ms ( 37x ) ?? 89.2MB - & gt ; 56kB ( this is getting ) 1M ?? 16192ms - & gt ; 130ms ( 125x ) ?? 4231ms - & gt ; 128ms ( 33x ) ?? 272.4MB - & gt ; 57kB ( ridiculous ! ) |
And a fairly small pull request to achieve all this (~400 lines of code changed, ~400 new added – out of which half were new unit tests I did to feel safer before I started changing things):
而實現這一切的請求很小(更改了約400行代碼,增加了約400行新代碼–其中一半是新的單元測試,在我開始進行更改之前,我確實感到更加安全):
Overall I’ve probably spent something like 8 hours on this – hard to say exactly since I did some breaks and other things. Also I was writing down notes & making sceenshots for the blog too :) The fix/optimization is already in?Unity 5.0 beta 20?by the way.
總體而言,我可能在此上花費了大約8個小時-確切地說,因為我做了一些休息和其他事情。 另外,我也正在為博客寫下筆記并做截圖:)順便說一下,修復/優化已在Unity 5.0 beta 20中進行。
結論 (Conclusion)
Apple’s Instruments?is a nice profiling tool (and unlike xperf, the UI is not intimidating…).
Apple的Instruments是一個不錯的分析工具(與xperf不同,UI并不令人生畏……)。
However,?Profiler Is Not A Replacement For Thinking!?I could have just looked at the profiling results and tried to optimize “what’s at top of the profiler” one by one, and maybe achieved 2-3x better performance. But by thinking about the?actual problem?and?why it happens, I got a way, way better result.
但是, Profiler不能代替思維! 我本可以查看分析結果,然后嘗試一個一個地優化“探查器頂部的功能”,也許可以將性能提高2-3倍。 但是,通過考慮實際問題及其發生的原因 ,我得到了更好的結果。
Happy thinking!
思考愉快!
翻譯自: https://blogs.unity3d.com/2015/01/18/optimizing-shader-info-loading-or-look-at-yer-data/
總結
以上是生活随笔為你收集整理的优化着色器信息加载,或查看Yer数据!的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 手把手教你肢解钓鱼网站
- 下一篇: 企业办公网安全问题及其解决方案