词嵌入应用_神经词嵌入的法律应用
詞嵌入應(yīng)用
A fundamental issue with LegalTech is that words — the basic currency of all legal documentation — are a form of unstructured data that cannot be intuitively understood by machines. Therefore, in order to process textual documents, words have to be represented by vectors of real numbers.
LegalTech的一個(gè)基本問題是,單詞(所有法律文檔的基本貨幣)是一種非結(jié)構(gòu)化數(shù)據(jù)的形式,機(jī)器無法直觀地理解。 因此,為了處理文本文檔,單詞必須用實(shí)數(shù)向量表示。
Traditionally, methods like bag-of-words (BoW) map word tokens/n-grams to term frequency vectors, which represent the number of times a word has appeared in the document. Using one-hot encoding, each word token/n-gram is represented by a vector element and marked 0, 1, 2 etc depending on whether or the number of times that a word is present in the document. This means that if a word is absent in the corpus vocabulary, the element will be marked 0, and if it is present once, the element will be marked 1 etc.
傳統(tǒng)上,諸如單詞袋(BoW)的方法將單詞標(biāo)記/ n-gram映射到術(shù)語頻率向量,該頻率向量表示單詞在文檔中出現(xiàn)的次數(shù)。 使用一鍵編碼,每個(gè)單詞標(biāo)記/ n-gram由一個(gè)矢量元素表示,并根據(jù)單詞在文檔中出現(xiàn)的次數(shù)或次數(shù)標(biāo)記為0、1、2等。 這意味著,如果語料庫(kù)詞匯中不存在某個(gè)單詞,則該元素將被標(biāo)記為0,如果該單詞出現(xiàn)一次,則該元素將被標(biāo)記為1等。
AnalyticStepsAnalyticStepsThe problem was that this produced very sparse matrices (i.e. mostly comprising zeros) in extremely high dimensions. For instance, a corpus with 30,000 unique word tokens would require a matrix with 30,000 rows, which is extremely computationally taxing. Furthermore, this method fails to capture semantic meaning, context, and word relations, as it can only show how frequently a word exists in a document. This inability to represent semantic meaning persisted even as BoW was complemented by the TF-IDF measure, as while the latter was able to represent a measure of how important a word was to a corpus (i.e. an improvement from the plain BoW representations), it was still computed based on the frequency of a word token/n-gram appearing in the corpus.
問題在于,這會(huì)以非常高的尺寸生成非常稀疏的矩陣(即主要包含零)。 例如,具有30,000個(gè)唯一單詞標(biāo)記的語料庫(kù)將需要具有30,000行的矩陣,這在計(jì)算上非常費(fèi)力。 此外,該方法無法捕獲語義含義,上下文和單詞關(guān)系,因?yàn)樗荒茱@示單詞在文檔中的出現(xiàn)頻率。 即使在BoW補(bǔ)充了TF-IDF度量之后,這種無法表示語義的現(xiàn)象仍然存在,而TF-IDF度量可以用來度量單詞對(duì)語料庫(kù)的重要性(即對(duì)普通BoW表示的改進(jìn)),仍然是根據(jù)語料中出現(xiàn)的單詞記號(hào)/ n-gram的頻率來計(jì)算的。
In contrast, modern word embeddings (word2vec, GloVE, fasttext etc) rely on neural networks to map the semantic properties of words into dense vector representations with significantly less dimensions.
相反,現(xiàn)代單詞嵌入(word2vec,GloVE,fasttext等)依賴于神經(jīng)網(wǎng)絡(luò)將單詞的語義特性映射為尺寸明顯較小的密集矢量表示形式。
As a preliminary note, it should be said that word embeddings are premised off distributional semantics assumptions, i.e. words that are used and occur in the same contexts tend to have similar meanings. This means that the neural network learns the vector representations of the words through the contexts that the words are found in.
作為初步說明,應(yīng)該說詞嵌入是基于分布語義假設(shè)的前提,即,在相同上下文中使用和出現(xiàn)的詞往往具有相似的含義。 這意味著神經(jīng)網(wǎng)絡(luò)通過發(fā)現(xiàn)單詞的上下文來學(xué)習(xí)單詞的向量表示。
“Context” here is represented by a sliding context window of size n, where n words before the target word and n words after the target word will fall within the context window (e.g. n=2 in the example). The model will then train by using one-hot encoded context vectors to predict one-hot encoded target vectors as the context window moves down the sentence.
“上下文”在這里是由大小的滑動(dòng)上下文窗口n,其中目標(biāo)字之前n個(gè)字與目標(biāo)字后n個(gè)字將落入上下文窗口內(nèi)表示(例如n = 2的示例中)。 然后,當(dāng)上下文窗口沿句子向下移動(dòng)時(shí),該模型將通過使用一熱編碼的上下文向量進(jìn)行訓(xùn)練,以預(yù)測(cè)一熱編碼的目標(biāo)向量。
In doing so, word embeddings can capture semantic associations and linguistic contexts not captured by BoW. This article will explore the impact of neural word embeddings in legal AI technologies.
這樣,單詞嵌入可以捕獲BoW無法捕獲的語義關(guān)聯(lián)和語言環(huán)境。 本文將探討神經(jīng)詞嵌入在法律AI技術(shù)中的影響。
捕獲法律條款之間的關(guān)系 (Capturing Relationships between Legal Terms)
An important implication of word embeddings is that it captures the semantic relationship between words.
詞嵌入的一個(gè)重要含義是它捕獲了詞之間的語義關(guān)系。
The ability to map out the relationship between legal terms and objects has exciting implications in its capacity to improve our understanding of legal reasoning. An interesting direction is the potential vectorisation of judicial opinions with doc2vec to identify/cluster judges with similar belief patterns (based on conservativeness of legal opinions, precedents-cited etc).
繪制法律術(shù)語與客體之間關(guān)系的能力對(duì)提高我們對(duì)法律推理的理解能力具有令人興奮的意義。 一個(gè)有趣的方向是使用doc2vec進(jìn)行司法輿論的潛在矢量化,以識(shí)別/聚集具有相似信念模式的法官(基于法律意見的保守性,被引用的先例等)。
Another function is that word embeddings can capture implicit racial and gender biases in judicial opinions, as measured by the Word Embedding Association Test (WEAT). Word embeddings are powerful because they can represent societal biases in mathematical or diagrammatical form. For instance, Bolukbasi (2016) showed that word embeddings trained on Google News articles exhibited significant gender bias, which can be geometrically captured by a direction in the word embedding (i.e. gender-neutral words are linearly separable from gender-neutral words).
另一個(gè)功能是,單詞嵌入可以捕獲司法觀點(diǎn)中隱含的種族和性別偏見,這是通過單詞嵌入關(guān)聯(lián)測(cè)試(WEAT)測(cè)得的。 詞嵌入功能強(qiáng)大,因?yàn)樗鼈兛梢砸詳?shù)學(xué)或圖表形式表示社會(huì)偏見。 例如,Bolukbasi(2016)表明,在Google新聞文章上訓(xùn)練的詞嵌入表現(xiàn)出明顯的性別偏見,可以通過詞嵌入的方向以幾何方式捕獲(即,與性別無關(guān)的詞與與性別無關(guān)的詞可線性分離)。
Source: https://www.semanticscholar.org/paper/Man-is-to-Computer-Programmer-as-Woman-is-to-Word-Bolukbasi-Chang/ccf6a69a7f33bcf052aa7def176d3b9de495beb7)來源: https : //www.semanticscholar.org/paper/Man-is-to-Computer-Programmer-as-Woman-is-to-Word-Bolukbasi-Chang/ccf6a69a7f33bcf052aa7def176d3b9de495beb7)As such, word embeddings may reflect vector relationships like “man is to programmer as woman is to home-maker”, as the word “man” in the Google News corpora co-occurs more frequently alongside words like “programmer” or “engineer”, while the word “woman” will appear more frequently beside “homemaker” or “nurse”.
因此,詞嵌入可能反映矢量關(guān)系,例如“男人是程序員,女人是家庭主婦”,因?yàn)镚oogle新聞?wù)Z料庫(kù)中的“男人”一詞與“程序員”或“工程師”等詞同時(shí)出現(xiàn),而“女人”一詞會(huì)在“家庭主婦”或“護(hù)士”旁邊更頻繁地出現(xiàn)。
Applied to the legal domain, we can tabulate WEAT scores across judicial opinions, and preliminary research in this field has shown interesting trends, such as (i) male judges showing higher gender bias (i.e. higher WEAT scores) than female judges and (ii) white judges showing lower race bias than black judges. More remains to be explored in this domain.
應(yīng)用于法律領(lǐng)域,我們可以將WEAT分?jǐn)?shù)匯總到各個(gè)司法意見中,并且該領(lǐng)域的初步研究顯示出有趣的趨勢(shì),例如(i)男法官表現(xiàn)出比女性法官更高的性別偏見(即WEAT得分更高),以及(ii)白人法官的種族偏見低于黑人法官。 在這個(gè)領(lǐng)域還有更多的探索。
加強(qiáng)法律研究 (Improving Legal Research)
Source: http://mlwiki.org/index.php/Information_Retrieval資料來源: http : //mlwiki.org/index.php/Information_RetrievalWord embeddings also has fundamental implications for improving the technology behind legal research platforms (known in machine learning parlance as Legal Information Retrieval (LIR) systems).
單詞嵌入對(duì)于改善法律研究平臺(tái)(在機(jī)器學(xué)習(xí)中稱為法律信息檢索(LIR)系統(tǒng))背后的技術(shù)也具有根本意義。
Currently, most LIR systems (e.g. Westlaw and LexisNexis) are still boolean-indexed systems primarily utilising keyword search functionality.
當(dāng)前,大多數(shù)LIR系統(tǒng)(例如Westlaw和LexisNexis)仍然是主要使用關(guān)鍵字搜索功能的布爾索引系統(tǒng)。
This means that the system looks for literal matches or variants of the query keywords, usually by using string-based algorithms to measure the similarity between two text strings. However, these types of searches fail to understand the intent behind the solicitor’s query, meaning that search results are often under-inclusive (i.e. missing relevant information that does not contain the keyword, but perhaps variants of it) or over-inclusive (i.e. returning irrelevant information that comprises the keyword).
這意味著系統(tǒng)通常通過使用基于字符串的算法來測(cè)量?jī)蓚€(gè)文本字符串之間的相似性,來查找查詢關(guān)鍵字的字面匹配或變體。 但是,這些類型的搜索無法理解律師查詢的意圖,這意味著搜索結(jié)果通常包含范圍不足(即缺少不包含關(guān)鍵字但可能包含其變體的相關(guān)信息)或包含范圍過大(即返回包含關(guān)鍵字的無關(guān)信息)。
Word embeddings, however, enhance the potential of commercially available semantic search LIR . As it allows practitioners to mathematically capture semantic similarity, word embeddings can help LIR systems find results that are not only exact matches of the query string, but also results that might be relevant or semantically close, but differ in certain words.
然而,詞嵌入增強(qiáng)了商業(yè)上可用的語義搜索LIR的潛力。 由于它使從業(yè)人員可以在數(shù)學(xué)上捕獲語義相似性,因此單詞嵌入可以幫助LIR系統(tǒng)查找結(jié)果,這些結(jié)果不僅是查詢字符串的精確匹配項(xiàng),還可以是可能相關(guān)或在語義上接近但在某些單詞上有所不同的結(jié)果。
Source: https://www.ontotext.com/knowledgehub/fundamentals/what-is-semantic-search/資料來源: https : //www.ontotext.com/knowledgehub/fundamentals/what-is-semantic-search/For instance, Landthaler shows that effective results can be produced by first summing up the word vectors in each search phrase into a search phrase vector. The document is then sequentially parsed by a window of size n (where n = the number of words in the search phrase) and the cosine similarity of the search phrase vector and accumulated vectors is calculated. This will not only be able to return exact keyword-matching results, but also semantically-related search results, which provides more intuitive results.
例如,Landthaler表示,可以通過首先將每個(gè)搜索短語中的單詞向量求和為搜索短語向量來產(chǎn)生有效結(jié)果。 然后,通過大小為n (其中n =搜索短語中的單詞數(shù))的窗口順序解析文檔,并計(jì)算搜索短語向量和累積向量的余弦相似度。 這樣不僅可以返回準(zhǔn)確的關(guān)鍵字匹配結(jié)果,還可以返回語義相關(guān)的搜索結(jié)果,從而提供更直觀的結(jié)果。
Source: https://www.lawtechnologytoday.org/2017/11/legal-consumer-research/資料來源: https : //www.lawtechnologytoday.org/2017/11/legal-consumer-research/This is especially important since research shows that participants using boolean-indexed search LIR systems which search for the exact matches of query terms on full-text legal documents can have recall rates as low as 20% (i.e. only 20% of relevant documents are retrieved by the LIR). However, on average, these participants estimate their retrieval rates to be up to 75%, which is significantly higher. This means that solicitors can often overlook relevant precedents or case law that may bolster their case, just because the LIR system prioritises string similarity over semantic similarity. Word embeddings hence have the potential to significantly address this shortfall.
這一點(diǎn)尤其重要,因?yàn)檠芯勘砻?#xff0c;使用布爾索引搜索LIR系統(tǒng)的參與者在全文法律文件中搜索查詢?cè)~的精確匹配項(xiàng)時(shí),召回率可低至20%(即,僅檢索到20%的相關(guān)文件)由LIR)。 但是,平均而言,這些參與者估計(jì)他們的檢索率高達(dá)75%,這要高得多。 這意味著律師通常會(huì)忽略可能會(huì)增強(qiáng)其案例的相關(guān)先例或判例法,僅因?yàn)長(zhǎng)IR系統(tǒng)將字符串相似度放在語義相似度之上。 因此,單詞嵌入有可能顯著解決這一不足。
結(jié)論和未來方向 (Conclusion and Future Directions)
Overall, the field of neural word embeddings is fascinating. Not only is the ability to mathematically capture semantic context and word relations academically intriguing, word embeddings have also been a hugely important driver behind many LegalTech products in the market.
總體而言,神經(jīng)詞嵌入領(lǐng)域令人著迷。 數(shù)學(xué)上不僅能夠以數(shù)學(xué)方式捕獲語義上下文和單詞關(guān)系的能力,而且單詞嵌入也已成為市場(chǎng)上許多LegalTech產(chǎn)品背后的重要推動(dòng)力。
However, word embeddings are not without limitations, and ML practitioners sometimes turn to newer pre-trained language modelling techniques (e.g. ULMFit, ELMo, OpenAI transformer, and BERT) to overcome some of the inherent problems with word embeddings (e.g. presuming monosemy). Nevertheless, word embeddings remain one of the most fascinating NLP topics today, and the move from sparse, frequency-based vector representations to denser semantically-representative vectors is a crucial step in advancing the NLP subdomain and the field of legal AI.
但是,單詞嵌入并非沒有局限性,并且ML從業(yè)人員有時(shí)會(huì)轉(zhuǎn)向較新的經(jīng)過預(yù)先訓(xùn)練的語言建模技術(shù)(例如ULMFit,ELMo,OpenAI轉(zhuǎn)換器和BERT)來克服單詞嵌入的一些固有問題(例如,假定為單義)。 盡管如此,詞嵌入仍然是當(dāng)今最引人入勝的NLP主題之一,從稀疏的基于頻率的向量表示向密集的語義表示向量的轉(zhuǎn)變是推進(jìn)NLP子域和合法AI領(lǐng)域的關(guān)鍵一步。
翻譯自: https://towardsdatascience.com/legal-applications-of-neural-word-embeddings-556b7515012f
詞嵌入應(yīng)用
總結(jié)
以上是生活随笔為你收集整理的词嵌入应用_神经词嵌入的法律应用的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2022 生成模型进展有多快,新论文盘点
- 下一篇: 美国人口普查年收入比赛_训练网络对收入进