當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【P9】Point to the Expression：Solving Algebraic Word Problems using the Expression-Pointer Transformer

發(fā)布時(shí)間：2023/12/14 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了【P9】Point to the Expression：Solving Algebraic Word Problems using the Expression-Pointer Transformer 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Point to the Expression: Solving Algebraic Word Problems using the Expression-Pointer Transformer Model

Abstract
1 Introduction
- 1.1 任務(wù)介紹
- 1.2 兩個(gè)問題
2 Related work
- 2.1 expression fragmentation 表達(dá)式分片問題
- 2.2 operand-context separation 操作數(shù)-上下文分離問題
3 EPT: Expression-Pointer Transformer
- 3.1 Input vector of EPT’s decoder
- 3.2 Output layer of EPT’s decoder
4 Experimental Setup
- 4.1 Metric and Datasets
- 4.2 Baseline and ablated models
- 4.3 Implementation details
5 Result and Discussion
- 5.1 Comparison study 比較研究
- 5.2 Ablation study 消融研究
6 Conclusion

Proceedings of the 2020 Conference on EMNLP, pages 3768–3779,November 16–20, 2020.

Abstract

針對(duì) NLP 中的代數(shù)詞問題（algebraic word problems），已有的研究提出使用“Op（operator運(yùn)算符/operand操作數(shù)）” tokens作為輸入/輸出的單元。這樣的模型需要解決兩個(gè)問題：

expression fragmentation

operand-context separation

對(duì)此，本文提出一個(gè)純神經(jīng)模型 Expression-Pointer Transformer（EPT），使用(1)“Expression” token和(2)operand-context pointers來生成解方程。

文章貢獻(xiàn)主要有：

提出EPT，解決上述兩個(gè)問題；

性能與手工設(shè)計(jì)特征的模型相當(dāng)，比現(xiàn)有純神經(jīng)模型有40%的性能提升。

1 Introduction

1.1 任務(wù)介紹

現(xiàn)有神經(jīng)模型與基于手工設(shè)計(jì)特征的模型有相當(dāng)大的性能差距。

1.2 兩個(gè)問題

expression fragmentation 表達(dá)式分片問題
該問題（上圖左加粗虛線框）是指expression tree（表示方程式的計(jì)算結(jié)構(gòu)）的分割。

問題出現(xiàn)
將Op而不是整個(gè)expression tree作為模型的輸入/輸出單元，就會(huì)出現(xiàn)此問題。例如：圖1(a)，使用Op tokens作為模型輸入，將樹結(jié)構(gòu)分解為運(yùn)算符（“ $×\times$ ”）和操作數(shù)（“ $x_1$ ” 和 “ $2$ ”）
解決
本文則使用“Expression” token （ $\times (x 1, 2)$ ），可以顯式的捕捉樹結(jié)構(gòu)作為一個(gè)整體，如圖1?

operand-context separation 操作數(shù)-上下文分離問題
該問題是指operand（操作數(shù)）和與operand相關(guān)的數(shù)字之間被切斷聯(lián)系——operand與上下文分離

問題出現(xiàn)
代數(shù)詞 problem 中陳述的數(shù)字代入抽象符號(hào)以進(jìn)行概括時(shí)，會(huì)出現(xiàn)此問題。例如：圖1(b)，使用Op token時(shí)，數(shù)字8變?yōu)槌橄蠓?hào)“ $N_1$ ”。
解決
當(dāng)使用“Expression” token時(shí)，數(shù)字8并沒有被轉(zhuǎn)化為符號(hào)。而是建立一個(gè)指針，指向數(shù)字8在代數(shù)詞問題中出現(xiàn)的位置。因此，使用這樣的“operand-context pointer”可以使模型預(yù)測operand時(shí)利用其上下文信息，如圖1?所示。

2 Related work

2.1 expression fragmentation 表達(dá)式分片問題

研究人員試圖通過使用兩步過程或使用單步seq-to-seq模型來反映operator和operand之間的關(guān)系信息。

兩步過程（早期）
- step1：通過對(duì)預(yù)定義的模板進(jìn)行分類來選擇operator
  step2：將operand應(yīng)用于在第一步中選擇的模板。
- 其他模型首先選擇operand，然后在第二步中用operator構(gòu)造表達(dá)式樹。
單步seq-to-seq模型（近期）——學(xué)習(xí)operator和operand之間的隱式關(guān)系
這些seq2seq方法在生成operator時(shí)考慮了operand的關(guān)系信息，但是仍未解決在生成operand時(shí)缺少operator的關(guān)系信息的問題。
- Chiang和Chen（2019）構(gòu)建了一個(gè)seq2seq模型，該模型使用堆棧上的push/pop動(dòng)作來生成operator/operand tokens。
- Amini等（2019）建立了一個(gè)seq2seq模型，以在生成所需的operand tokens后立即生成operator token。

2.2 operand-context separation 操作數(shù)-上下文分離問題

構(gòu)建手工特征來獲取單詞的語義內(nèi)容
- 例如給定數(shù)字的單位或數(shù)字之間的依賴關(guān)系。
- 缺點(diǎn)：設(shè)計(jì)手工輸入特征非常耗時(shí)，并且需要領(lǐng)域?qū)I(yè)知識(shí)。
采用分布式表示和神經(jīng)模型來自動(dòng)學(xué)習(xí)operand的數(shù)字上下文
- Huang 使用了一個(gè)pointer-generator網(wǎng)絡(luò)，該網(wǎng)絡(luò)可以指向給定數(shù)學(xué)問題中number的上下文。缺點(diǎn)是性能無法與使用手工特征的最新模型相媲美。
- 本文通過添加額外的指針（可以利用operand和相鄰的Expression tokens的上下文信息），可以提高純神經(jīng)模型的性能。

3 EPT: Expression-Pointer Transformer

總體采用encoder-decoder架構(gòu)：

encoder：預(yù)訓(xùn)練模型ALBERT
- input： tokenized word problem
- output： ALBERT編碼器的隱狀態(tài)向量（表示給定問題的數(shù)字上下文）
decoder：Transformer Decoder
- input：Expression tokens和ALBERT編碼器的隱狀態(tài)向量
- output：Expression tokens

3.1 Input vector of EPT’s decoder

symbolmeaningdimension

$vi\mathbf{v}_{i}$	The input vector of $i$ th Expression token	D
$fi\mathbf{f}_{i}$	operator embedding	D
$aij\mathbf{a}_{i j}$	the $j$ th operand embedding of $i$ th Expression	D

$vi\mathbf{v}_{i}$
$vi=FFin?(Concat?(fi,ai1,ai2,?,aip))\mathbf{v}_{i}=\mathrm{FF}_{\text {in }}\left(\text { Concat }\left(\mathbf{f}_{i}, \mathbf{a}_{i 1}, \mathbf{a}_{i 2}, \cdots, \mathbf{a}_{i p}\right)\right)$ 其中， $FF?\mathrm{FF}_{*}$ 表示前饋線性層，而 $Concat?(?)\text { Concat }(\cdot)$ 表示括號(hào)內(nèi)所有向量的級(jí)聯(lián)
$fi\mathbf{f}_{i}$
$fi=LN?f(cfEf(fi)+PE(i))\mathbf{f}_{i}=\operatorname{LN}_{\mathrm{f}}\left(c_{\mathrm{f}} \mathrm{E}_{\mathrm{f}}\left(f_{i}\right)+\mathrm{PE}(i)\right)$ 其中， $E?(?)\mathrm{E}_{*}(\cdot)$ 表示嵌入向量的查找表， $c?(?)\mathrm{c}_{*}(\cdot)$ 表示標(biāo)量參數(shù)， $LN?(?)\mathrm{LN}_{*}(\cdot)$ 表示層歸一化， $PE?(?)\mathrm{PE}_{*}(\cdot)$ 表示位置編碼。
$aij\mathbf{a}_{i j}$
為了反映operand的上下文信息， $aij\mathbf{a}_{i j}$ 有三種可能的來源（sources）：
- problem-dependent numbers
  即代數(shù)問題中提供的數(shù)字（如表1中的“20”）。為了計(jì)算一個(gè)number的 $aij\mathbf{a}_{i j}$ ，重用對(duì)應(yīng)于該number tokens的編碼器隱狀態(tài)向量，如下所示:
  $aij=LNa(caunum+eaij)\mathbf{a}_{i j}=\mathrm{LN}_{\mathrm{a}}\left(c_{\mathrm{a}} \mathbf{u}_{\mathrm{num}}+\mathbf{e}_{a_{i j}}\right)$ 其中 $u?\mathrm{u}_{*}$ 為代表source的向量， $eaij\mathbf{e}_{a_{i j}}$ 為數(shù)字 $aij\mathbf{a}_{i j}$ 對(duì)應(yīng)的編碼器隱狀態(tài)向量。
- problem-independent constants
  即問題中沒有說明的預(yù)定義數(shù)字（如100經(jīng)常用于百分位數(shù)）。為計(jì)算一個(gè)常數(shù)的 $aij\mathbf{a}_{i j}$ ，使用一個(gè)查找表 $Ec\mathrm{E}_{c}$ ，如下所示：
  $aij=LNa(cauconst?+Ec(aij))\mathbf{a}_{i j}=\mathrm{LN}_{\mathrm{a}}\left(c_{\mathrm{a}} \mathbf{u}_{\text {const }}+\mathrm{E}_{\mathrm{c}}\left(a_{i j}\right)\right)$ 其中， $LNa\mathrm{LN}_{\mathrm{a}}$ 、 $cac_{\mathrm{a}}$ 在不同的源之間共享。
- the result of the prior Expression token
  即在 $i$ th Expression之前生成的Expression (如R0)。為了計(jì)算result的 $aij\mathbf{a}_{i j}$ ，使用如下的位置編碼：
  $aij=LNa(cauexpr+PE(k))\mathbf{a}_{i j}=\mathrm{LN}_{\mathrm{a}}\left(c_{\mathrm{a}} \mathbf{u}_{\mathrm{expr}}+\mathrm{PE}(k)\right)$ 其中，k是先前的Expression $aij\mathbf{a}_{i j}$ 的索引。

3.2 Output layer of EPT’s decoder

預(yù)測下一個(gè)operator $fi+1\mathbf{f}_{i+1}$ ：
$fi+1=arg?max?fσ(f∣FFout?(di))f_{i+1}=\arg \max _{f} \sigma\left(f \mid F F_{\text {out }}\left(\mathbfozvdkddzhkzd_{i}\right)\right)$
預(yù)測下一個(gè)operand $ai+1,j\mathbf{a}_{i+1,j}$ ：
(1) 輸出層會(huì)應(yīng)用operand-context pointers，這受指針網(wǎng)絡(luò) pointer networks 的啟發(fā)。在 pointer networks 中，輸出層使用對(duì)候選向量的 attention 來預(yù)測下一個(gè) token。 EPT根據(jù)operand的來源，以三種不同的方式收集the next (i+1)th Expression的候選向量：
$ekfor?the?kth?number?in?the?problem?,dkfor?the?kth?Expression?output?,Ec(x)for?a?constant?x\begin{aligned} &\mathbf{e}_{k} \quad\quad \text {for the } k \text {th number in the problem }, \\ &\mathbfozvdkddzhkzd_{k} \quad\quad \text {for the } k \text {th Expression output },\\ &\mathrm{E}_{\mathrm{c}}(x)\quad \text {for a constant } x \end{aligned}$
(2) EPT預(yù)測the next jth operand $ai+1,j\mathbf{a}_{i+1,j}$ 。
令 $A_{ij}$ 為矩陣，其行向量就是上述候選向量。然后計(jì)算key矩陣 $K_{ij}$ 上query向量 $Q_{ij}$ 的注意力來預(yù)測 $ai+1,j\mathbf{a}_{i+1,j}$ 。
$Qij=FFquery?,j(di)Kij=FFkey?,j(Aij)ai+1,j=arg?max?aσ(a∣QijKij?D)\begin{aligned} Q_{i j} &=\mathrm{FF}_{\text {query }, j}\left(\mathbfozvdkddzhkzd_{i}\right) \\ K_{i j} &=\mathrm{FF}_{\text {key }, j}\left(\mathbf{A}_{i j}\right) \\ a_{i+1, j} &=\arg \max _{a} \sigma\left(a \mid \frac{Q_{i j} K_{i j}^{\top}}{\sqrt{D}}\right) \end{aligned}$ loss：將operator的loss和其所需參數(shù)的loss相加來計(jì)算Expression的損失。所有l(wèi)oss函數(shù)都是通過cross-entropy with the label smoothing計(jì)算的。

4 Experimental Setup

4.1 Metric and Datasets

使用三個(gè)公開可用的英語代數(shù)單詞問題數(shù)據(jù)集：

ALG514 —— 高復(fù)雜度
DRAW-1K —— 高復(fù)雜度
MAWPS —— 低復(fù)雜度

4.2 Baseline and ablated models

EPT 與五個(gè)現(xiàn)有 SoTA 模型對(duì)比，這五個(gè)模型分為三種類型：使用手工特征的模型，純神經(jīng)模型，混合模型。

消融實(shí)驗(yàn)：

4.3 Implementation details

PyTorch 1.5
encoder：使用了三種不同尺寸的ALBERT模型：albert-base-v2，albert-large-v2和albert-xlarge-v2。在訓(xùn)練期間固定了編碼器的嵌入矩陣，以保留嵌入矩陣中的世界知識(shí)和穩(wěn)定整個(gè)學(xué)習(xí)過程。
decoder：堆疊了六個(gè)解碼器層，并在不同層之間共享參數(shù)以減少內(nèi)存使用。
將輸入向量的維數(shù) $D$ 設(shè)置為編碼器隱狀態(tài)向量的維數(shù)。
在訓(xùn)練階段使用 teacher forcing，在評(píng)估階段使用 3 beams 進(jìn)行 beam search。
EPT的超參數(shù)，除訓(xùn)練時(shí)期，批量大小，預(yù)熱時(shí)期和學(xué)習(xí)率外，其他參數(shù)均遵循ALBERT模型的參數(shù)。具體設(shè)置見論文。
optimizer：LAMB，使用帶 warm-up 的 linear decay

5 Result and Discussion

5.1 Comparison study 比較研究

EPT優(yōu)于現(xiàn)有的純神經(jīng)模型的一個(gè)可能的解釋是使用了operand的上下文信息。

使用symbols的四種方式是：(1)泛化常見模式，(2)表示方程中的未知數(shù)，(3)表示函數(shù)的一個(gè)參數(shù)，(4)替換任意標(biāo)記。

現(xiàn)有的神經(jīng)模型——使用symbols來提供與問題相關(guān)（problem-dependent）的數(shù)字或未知數(shù)的抽象，即通過應(yīng)用模板分類或機(jī)器學(xué)習(xí)技術(shù)，應(yīng)用了(1)和(2)。
EPT模型——使用Expression tokens處理(3)，使用operand-context pointers處理(4)。

5.2 Ablation study 消融研究

表6給出了誤差分析的結(jié)果：

在高復(fù)雜度數(shù)據(jù)集上，性能增強(qiáng)有兩種可能的解釋：

使用Expression tokens，解決了生成求解方程時(shí)的expression fragmentation問題。

使用和operand-context pointers解決了選擇操作數(shù)時(shí)operand-context separation問題。

錯(cuò)誤的例子（case 3 和 case 4）可以分為兩類：比較誤差和時(shí)間順序誤差。

6 Conclusion

本文的工作證明了“在解決代數(shù)詞問題中，降低使用手工設(shè)計(jì)特征的高昂成本”的可能性。

總結(jié)

以上是生活随笔為你收集整理的【P9】Point to the Expression：Solving Algebraic Word Problems using the Expression-Pointer Transformer的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。