配对交易方法_COVID下的自适应配对交易,一种强化学习方法
配對交易方法
Abstract
抽象
This is one of the articles of A.I. Capital Management’s Research Article Series, with the intro article here. This one is about applying RL on market neutral strategies, specifically, optimizing a simple pair trading strategy with RL agent being the capital allocator on per trade basis, while leaving the entrance/exit signal untouched. The goal is to optimize an existing signal’s sequential trade size allocation while letting the agent adapt its actions to market regimes/conditions.
這是AI Capital Management研究文章系列的文章之一, 此處有介紹性文章。 這是關于將RL應用于市場中立策略,具體而言,以RL代理為每筆交易的資本分配者來優化簡單的成對交易策略,同時保持進/出信號不變。 目的是優化現有信號的順序交易規模分配,同時讓代理人使其行動適應市場制度/條件。
Author: Marshall Chang is the founder and CIO of A.I. Capital Management, a quantitative trading firm that is built on Deep Reinforcement Learning’s end-to-end application to momentum and market neutral trading strategies. The company primarily trades the Foreign Exchange markets in mid-to-high frequencies.
作者: Marshall Chang是AI Capital Management的創始人兼CIO,這是一家定量交易公司,其建立在Deep Reinforcement Learning在動量和市場中性交易策略的端到端應用程序的基礎上。 該公司主要以中高頻交易外匯市場。
Overview
總覽
Pairs trading is the foundation of market neutral strategy, which is one of the most sought-after quantitative trading strategies because it does not profit from market directions, but from the relative returns between a pair of assets, avoiding systematic risk and the Random Walk complexity. The profitability of market neutral strategies lie within the assumed underlying relationship between pairs of assets, however, when such relationship no longer withhold, often during volatile regime-shifting times such as this year with COVID-19, returns generally diminishes for such strategies. In fact, according to HFR (Hedge Fund Research, Inc.), the HFRX Equity Hedge Index, by the end of July, 2020, reported a YTD return of -9.74%[1]; its close relative, the HFRX Relative Value Arbitrage Index, reported a YTD return of -0.85%. There is no secret that for market neutral quants, or perhaps any quants, the challenge is not just to find profitable signals, but more in how to quickly detect and adapt complex trading signals during regime-shifting times.
交易對是市場中立策略的基礎,這是最搶手的定量交易策略之一,因為它不從市場方向獲利,而是從一對資產之間的相對收益中獲利,避免了系統性風險和隨機游走的復雜性。 市場中立策略的獲利能力處于資產對之間假定的基本關系之內,但是,當這種關系不再保留時,通常在動蕩的政權轉換時期(例如今年的COVID-19),這種策略的收益通常會減少。 實際上,根據HFR(Hedge Fund Research,Inc.)的數據,截至2020年7月,HFRX股票對沖指數的年初至今回報率為-9.74% [1] ; 其近親HFRX相對價值套利指數的年初至今回報率為-0.85%。 毫無疑問,對于市場中立的量化指標,或者也許對任何量化指標而言,面臨的挑戰不僅是尋找有利可圖的信號,而且還在于如何在政權轉換期間快速檢測和適應復雜的交易信號。
Within the field of market neutral trading, most research have been focusing on uncovering correlations and refining signals, often using proprietary alternative data purchased at high costs to find an edge. However, optimization of capital allocation at trade size and portfolio level is often neglected. We found that lots of pair trading signals, though complex, still utilizes fixed entry thresholds and linear allocations. With the recent advancement of complex models and learning algorithms such as Deep Reinforcement Learning (RL), these class of algorithm is yearning for innovation with non-linear optimization.
在市場中立交易領域,大多數研究都集中在發現相關性和提煉信號上,通常使用高成本購買的專有替代數據來尋找優勢。 但是, 經常忽略在貿易規模和投資組合水平上進行資本配置的優化 。 我們發現,許多配對交易信號盡管很復雜,但仍利用固定的進入門檻和線性分配。 隨著復雜模型和學習算法(例如深度強化學習(RL))的最新發展,這類算法正渴望通過非線性優化進行創新。
Methodology — AlphaSpread RL Solution
方法論— AlphaSpread RL解決方案
To address the detection and adaptation of pair trading strategies through regime shifting times, our unique approach is to solve trade allocation optimization with sequential agent-based solution directly trained on top of existing signal generalization process, with clear tracked improvement and limited overhead of deployment.
為了解決通過制度轉移時間來發現和調整配對交易策略的問題,我們獨特的方法是通過在現有信號泛化過程之上直接訓練的,基于順序代理的解決方案來解決貿易分配優化問題,該解決方案具有明確的跟蹤改進和有限的部署開銷。
Internally named as AlphaSpread, this project demonstrates RL sequential trade size allocation’s ROI (Return on Investment) improvement over standard linear trade size allocation on 1 pair spread trading of U.S. S&P 500 equities. We take the existing pair trading strategy with standard allocation per trade as baseline, train RL allocator represented by a deep neural network model in our customized Spread Trading Gym environment, then test on out-of-sample data and aim to outperform baseline ending ROI.
該項目內部稱為AlphaSpread ,展示了美國標準普爾500股票一對對價差交易中RL順序交易規模分配的ROI(投資回報)??相對于標準線性交易規模分配的提高。 我們將現有的配對交易策略(以每筆交易的標準分配作為基準)作為基準,在定制的Spread Trading Gym環境中訓練由深度神經網絡模型表示的RL分配器,然后對樣本外數據進行測試,并力求超越基準期末ROI。
Specifically, we select cointegrated pairs based on their stationary spreads our statistical models. Cointegrated pairs are usually within the same industry, but we also include cross sectional pairs that show strong cointegration. The trading signal are generated by reaching pre-defined threshold of z-score on residues predicted by the statistical model using daily close prices. The baseline for this example allocates fixed 50% of overall portfolio to each trading signal, whereas the RL allocator output 0–100% allocation for each trading signal sequentially based on current market condition represented by a lookback of z-score.
具體來說,我們根據統計模型的固定價差選擇協整對。 協整對通常位于同一行業,但我們還包括橫截面對,它們顯示出很強的協整性。 交易信號是通過使用每日收盤價通過統計模型預測的殘差達到z分數的預定閾值而生成的。 此示例的基準將固定總資產組合的50%分配給每個交易信號,而RL分配器根據Z分數回溯表示的當前市場狀況依次為每個交易信號輸出0-100%的分配。
AlphaSpread — In the video, the red NAV is a signal’s performance into the COVID months, the green one is the same strategy with our RL allocator. We learned that our RL agent can pick up regime shifts early on and allocate accordingly to avoid huge downturns. AlphaSpread —在視頻中,紅色的NAV是信號進入COVID月份的性能,綠色的是與RL分配器相同的策略。 我們了解到,我們的RL特工可以盡早提起政權轉移,并進行相應分配,以免出現大幅下滑。Results Summary
結果匯總
We summarize our RL approach’s pairs trading ROI against baseline linear allocation for 107 U.S. Equity pairs traded. The ROI is calculated with ending NAV of testing period against each pair’s $100,000 starting capital. The result is from back-testing on out-of-sample data between 2018 to April 2020 (COVID-19 months included). The RL allocators are trained with data between 2006 and 2017. In both cases fees are not consider in the testing. We have achieved on average 9.82% per pair ROI improvement over baseline approach, with maximum of 55.62% and minimum of 0.08%.
我們針對107個美國股票對總結了RL方法的對投資回報率與基準線性分配的對。 相對于每對100,000美元的啟動資金,以測試期末的資產凈值來計算ROI。 結果是對2018年至2020年4月(包括COVID-19個月)之間的樣本外數據進行了回測。 RL分配器接受了2006年至2017年之間的數據培訓。在兩種情況下,測試均不考慮費用。 與基線方法相比,我們平均每對ROI提升了9.82%,最大為55.62%,最小為0.08%。
In other words, with limited model tuning, this approach is able to result in generalized improvement of ROI through early detecting of regime-shifting and the accordingly capital allocation adaptation by the RL allocator agent.
換句話說,通過有限的模型調整,該方法能夠通過早期檢測到體制轉移以及相應地由RL分配器代理進行資本分配適應,來全面提高ROI。
A snapshot of pair trading strategies’ ROIs, comparing base allocation and RL allocation配對交易策略ROI的快照,比較基本分配和RL分配Discussions of Generalization
泛化討論
The goal of this project is to demonstrate out-of-sample generalization of the underlying improvements on a very simple one-pair trading signals, hence providing guidance on adapting such methodology on large scale complex market neutral strategies to be deployed. Below is a discussion of the 3 goals we set out to achieve in this experiment.
該項目的目的是展示對非常簡單的一對交易信號的潛在改進的樣本外概括,從而為在將要部署的大規模復雜市場中立策略上采用這種方法提供指導 。 以下是我們在本實驗中要實現的3個目標的討論。
Repeatability — This RL framework consists of customized pairs trading RL environment used to accurately train and test RL agents, RL training algorithms including DQN, DDPG and Async Actor Critics, RL automatic training roll out mechanism that integrates memory prioritized replay, dynamic model tuning, exploration/exploitation and etc., enabling repeatability for large datasets with minimum customization and hand tuning. The advantage of running RL compared with other machine learning algorithm is that it is an end-to-end system from training data generalization, reward function design, model and learning algorithm selection to output a sequential policy. A well-tuned system requires minimum maintenance and the retraining / readapting of models to new data is done in the same environment.
重復性 -此RL框架包括用于準確訓練和測試RL代理的定制交易RL環境配對交易,包括DQN,DDPG和Async Actor Critics在內的RL訓練算法,集成了內存優先重放,動態模型調整,探索的RL自動訓練推出機制/ exploitation等,從而以最小的自定義和手動調整實現大型數據集的可重復性。 與其他機器學習算法相比,運行RL的優勢在于它是一個從訓練數據概括,獎勵函數設計,模型和學習算法選擇到輸出順序策略的端到端系統。 調整良好的系統需要最少的維護,并且在同一環境中對新數據進行模型的重新訓練/重新適應。
Sustainability — Under the one-pair trading example, the pairs cointegration test and RL training were done using data from 2006 and 2017, and then trained agents run testing from 2018 to early 2020. The training and testing data split are roughly 80:20. With RL automatic training roll out, we can generalize sustainable improvements over baseline return for more than 2 years across hundreds of pairs. The RL agent learns to allocate according to the lookback of z-scores representing the pathway of the pair’s cointegration as well as volatility and is trained with exploration / exploitation to find policy that maximize ending ROI. Compared with traditional supervised and unsupervised learning with static input — output, RL algorithms has built-in robustness for generalization in that it directly learns state-policy values with a reward function that reflects realized P/L. The RL training targets are always non-static in that the training experience improves as the agent interacts with the environment and improves its policy, hence the reinforcement of good behavior and vice versa.
可持續性 —在一對交易示例中,使用2006年和2017年的數據進行貨幣對協整測試和RL訓練,然后由受過訓練的代理商從2018年到2020年初進行測試。訓練和測試數據的劃分大致為80:20。 借助RL自動培訓的推出,我們可以在數百年中對超過2年的基線回報進行可持續改進。 RL代理學習根據代表該貨幣對的協整和波動性的z分數的回溯進行分配,并經過探索/開發訓練,以找到可最大程度提高最終投資回報率的策略。 與具有靜態輸入輸出的傳統有監督和無監督學習相比,RL算法具有內置的泛化魯棒性,因為它可以直接使用反映已實現P / L的獎勵函數來學習狀態策略值。 RL培訓目標始終是非靜態的,因為隨著代理人與環境的互動并改善其政策,培訓經驗將得到改善,從而加強良好行為,反之亦然。
Scalability — Train and deploy large scale end-to-end Deep RL trading algorithms is still its infancy in quant trading, but we believe it is the future of alpha in our field, as RL has demonstrated dramatic improvement over traditional ML in the game space (AlphaGo, Dota etc.). This RL framework is well versed to apply to different pair trading strategies that is deployed by market neutral funds. With experience running RL system in multiple avenues of quant trading, we can customize environment, training algorithms and reward function to effectively solve unique tasks in portfolio optimization, powered by RL’s agent based sequential learning that traditional supervised and unsupervised learning models cannot achieve.
可擴展性 —訓練和部署大規模的端到端深度RL交易算法仍是定量交易的起步階段,但我們相信這是我們領域alpha的未來,因為RL在游戲領域已證明優于傳統ML。 (AlphaGo,Dota等)。 此RL框架非常適合應用于市場中立基金部署的不同對交易策略。 憑借在多種數量交易渠道上運行RL系統的經驗,我們可以自定義環境,訓練算法和獎勵功能,以有效地解決投資組合優化中的獨特任務,這是基于RL基于代理的順序學習提供的,而傳統的監督和無監督學習模型則無法實現。
Key Take Away
鑰匙拿走
If the signal makes money, it makes money with linear allocation (always trade x unit). But when it doesn’t, obviously we want to redo the signal, let it adapt to new market conditions. However, sometimes that’s not easy to do, and a quick fix might be a RL agent/layer on top of existing signal process. In our case, we let the agent observe a dataset that represents volatility of the spreads, and decide on the pertinent allocation based on past trades and P/L.
如果信號賺錢,它就會線性分配(總是以x單位交易)賺錢。 但是,如果不這樣做,顯然我們要重做信號,使其適應新的市場條件。 但是,有時這并不容易做到,快速解決方案可能是在現有信號處理之上的RL代理/層。 在我們的案例中,我們讓代理商觀察代表價差波動性的數據集,并根據過去的交易和損益決定相關的分配。
Background and More Details
背景和更多詳細信息
Signal Generalization Process — We first run a linear regression on both assets’ past look back price history (2006–2017 daily price), then we do OLS test to obtain the residual, with which we run unit root test (Augmented Dickey–Fuller test) to check the existence of cointegration. In this example, we set the p-value threshold at 0.5% to reject unit root hypothesis, which results in a universe of 2794 S&P 500 pairs that pass the test. Next phrase is how we set the trigger conditions. First, we normalize the residual to get a vector that follows assumed standard normal distribution. Most tests use two sigma level reaches 95% which is relatively difficult to trigger. To generate enough trading for each pair, we set our threshold at one sigma. After normalization, we obtain a white noise follows N(0,1), and set +/- 1 as the threshold. Overall, the signal generation process is very straight forward. If the normalized residual gets above or below threshold, we long the bearish one and short the bullish one, and vice versa. We only need to generate trading signal of one asset, and the other one should be the opposite direction
信號概括過程 —我們首先對兩種資產的過去回溯價格歷史記錄(2006-2017年每日價格)進行線性回歸,然后進行OLS測試以獲取殘差,然后進行單位根檢驗(Augmented Dickey-Fuller檢驗) )檢查協整的存在。 在此示例中,我們將p值閾值設置為0.5%,以拒絕單位根假設,這導致2794個標準普爾500對貨幣對通過了測試。 接下來的短語是我們如何設置觸發條件。 首先,我們對殘差進行歸一化以獲得遵循假定標準正態分布的向量。 大多數測試使用兩個西格瑪水平達到95%,這相對難以觸發。 為了為每個貨幣對產生足夠的交易,我們將閾值設置為一個西格瑪。 歸一化后,我們獲得跟隨N(0,1)的白噪聲,并將+/- 1設置為閾值。 總體而言,信號生成過程非常簡單。 如果歸一化殘差高于或低于閾值,則我們做多看跌期權,而做空看漲期權,反之亦然。 我們只需要生成一種資產的交易信號,而另一種應該是相反的方向
Deep Reinforcement Learning — The RL training regimes starts with running an exploration to exploitation linear annealed policy to generate training data by running the training environment, which in this case runs the same 2006–2017 historical data as with the cointegration. The memory is stored in groups of
深度強化學習 — RL培訓制度從運行探索線性退火策略開始,通過運行培訓環境來生成培訓數據,在這種情況下,該環境運行與協整相同的2006–2017歷史數據。 內存按以下組存儲
State, Action, Reward, next State, next Action (SARSA)
狀態,動作,獎勵,下一狀態,下一動作(SARSA)
Here we use a mixture of DQN and Policy Gradient learning target, in that our action outputs are continuous (0–100%) yet sample inefficient (within hundreds of trades per pair due to daily frequency). Our training model updates iteratively with
在這里,我們混合使用DQN和Policy Gradient學習目標,因為我們的行動輸出是連續的(0–100%),但樣本效率低下(由于每日交易頻率,每對交易有數百筆交易)。 我們的訓練模型會迭代更新
Q(State) = reward + Q-max (next States, next Actions)
Q(狀態)=獎勵+ Q-最大值(下一個狀態,下一個動作)
Essentially, RL agent is learning the q value of continuous-DQN but trained with policy gradient on the improvements of each policy, hence avoiding the sample inefficiency (Q learning is guaranteed to converge to training global optimal) and tendency to stuck in local minimum too quickly (avoiding all 0 or 1 outputs for PG). Once the warm-up memories are stored, we train the model (in this case is a 3-layer dense net outputting single action) with the memory data as agent continues to interact with the environment and roll out older memories.
本質上,RL代理正在學習連續DQN的q值,但是在每種策略的改進上通過策略梯度進行了訓練,因此避免了樣本效率低下(保證Q學習收斂到訓練全局最優值)和陷入局部最小值的趨勢。快速(避免PG的所有0或1輸出)。 一旦存儲了預熱的內存,隨著代理繼續與環境交互并推出較舊的內存,我們將使用內存數據訓練模型(在這種情況下為3層密集網絡輸出單個動作)。
Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.
Sutton RS,Barto AG。 強化學習:簡介。 麻省理工學院出版社; 2018。
HFRX? Indices Performance Tables. (n.d.). Retrieved August 03, 2020, from https://www.hedgefundresearch.com/family-indices/hfrx
HFRX?指數性能表。 (nd)。 于2020年8月3日從 https://www.hedgefundresearch.com/family-indices/hfrx 檢索
翻譯自: https://towardsdatascience.com/adaptive-pair-trading-under-covid-19-a-reinforcement-learning-approach-ff17e6a8f0d6
配對交易方法
總結
以上是生活随笔為你收集整理的配对交易方法_COVID下的自适应配对交易,一种强化学习方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 计算机视觉知识基础_我见你:计算机视觉基
- 下一篇: 设计数据密集型应用程序_设计数据密集型应