ANTLR巨型教程
解析器是功能強(qiáng)大的工具,使用ANTLR,您可以編寫可用于多種不同語言的各種解析器。
在本完整的教程中,我們將要:
- 解釋基礎(chǔ) :解析器是什么,解析器可以用于什么
- 了解如何設(shè)置要從Javascript,Python,Java和C#中使用的ANTLR
- 討論如何測試解析器
- 展示ANTLR中最先進(jìn),最有用的功能 :您將學(xué)到解析所有可能的語言所需的一切
- 顯示大量示例
也許您已經(jīng)閱讀了一些過于復(fù)雜或過于局部的教程,似乎以為您已經(jīng)知道如何使用解析器。 這不是那種教程。 我們只希望您知道如何編碼以及如何使用文本編輯器或IDE。 而已。
在本教程的最后:
- 您將能夠編寫一個解析器以識別不同的格式和語言
- 您將能夠創(chuàng)建構(gòu)建詞法分析器和解析器所需的所有規(guī)則
- 您將知道如何處理遇到的常見問題
- 您將了解錯誤,并且將知道如何通過測試語法來避免錯誤。
換句話說,我們將從頭開始,到結(jié)束時,您將學(xué)到所有可能需要了解ANTLR的知識。
ANTLR Mega Tutorial Giant目錄列表
什么是ANTLR?
ANTLR是解析器生成器,可幫助您創(chuàng)建解析器的工具。 解析器獲取一段文本并將其轉(zhuǎn)換為一個組織化的結(jié)構(gòu) ,例如抽象語法樹(AST)。 您可以將AST看作是描述代碼內(nèi)容的故事,也可以看作是通過將各個部分放在一起而創(chuàng)建的邏輯表示。
歐氏算法的AST的圖形表示
獲得AST所需要做的事情:
因此,您需要首先為要分析的事物定義一個詞法分析器和解析器語法。 通常,“事物”是一種語言,但它也可以是數(shù)據(jù)格式,圖表或任何以文本表示的結(jié)構(gòu)。
正則表達(dá)式不夠嗎?
如果您是典型的程序員,您可能會問自己: 為什么我不能使用正則表達(dá)式 ? 正則表達(dá)式非常有用,例如當(dāng)您想在文本字符串中查找數(shù)字時,它也有很多限制。
最明顯的是缺乏遞歸:除非您為每個級別手動編碼,否則您無法在另一個表達(dá)式中找到一個(正則)表達(dá)式。 很快就無法維持的事情。 但是更大的問題是它并不是真正可擴(kuò)展的:如果您只將幾個正則表達(dá)式放在一起,就將創(chuàng)建一個脆弱的混亂,將很難維護(hù)。
使用正則表達(dá)式不是那么容易
您是否嘗試過使用正則表達(dá)式解析HTML? 這是一個可怕的想法,因為您冒著召喚克蘇魯?shù)奈kU,但更重要的是, 它實際上并沒有奏效 。 你不相信我嗎 讓我們看一下,您想要查找表的元素,因此嘗試像這樣的常規(guī)擴(kuò)展: <table>(.*?)</table> 。 輝煌! 你做到了! 除非有人向其表中添加諸如style或id類的屬性。 沒關(guān)系,您執(zhí)行<table.*?>(.*?)</table> ,但實際上您關(guān)心表中的數(shù)據(jù),因此您需要解析tr和td ,但是它們已滿標(biāo)簽。
因此,您也需要消除這種情況。 而且甚至有人甚至敢使用<!—我的評論&gtl->之類的評論。 注釋可以在任何地方使用,并且使用正則表達(dá)式不容易處理。 是嗎?
因此,您禁止Internet使用HTML中的注釋:已解決問題。
或者,您也可以使用ANTLR,對您而言似乎更簡單。
ANTLR與手動編寫自己的解析器
好的,您確信需要一個解析器,但是為什么要使用像ANTLR這樣的解析器生成器而不是構(gòu)建自己的解析器呢?
ANTLR的主要優(yōu)勢是生產(chǎn)率
如果您實際上一直在使用解析器,因為您的語言或格式在不斷發(fā)展,則您需要能夠保持步伐,而如果您必須處理實現(xiàn)a的細(xì)節(jié),則無法做到這一點。解析器。 由于您不是為了解析而解析,因此您必須有機(jī)會專注于實現(xiàn)目標(biāo)。 而ANTLR使得快速,整潔地執(zhí)行此操作變得更加容易。
其次,定義語法后,您可以要求ANTLR生成不同語言的多個解析器。 例如,您可以使用C#獲得一個解析器,而使用Javascript獲得一個解析器,以在桌面應(yīng)用程序和Web應(yīng)用程序中解析相同的語言。
有人認(rèn)為,手動編寫解析器可以使其更快,并且可以產(chǎn)生更好的錯誤消息。 這有些道理,但以我的經(jīng)驗,ANTLR生成的解析器總是足夠快。 如果確實需要,您可以調(diào)整語法并通過處理語法來提高性能和錯誤處理。 只要對語法感到滿意,就可以這樣做。
目錄還是可以的
兩個小注意事項:
- 在本教程的配套存儲庫中,您將找到所有帶有測試的代碼,即使我們在本文中沒有看到它們
- 示例將使用不同的語言,但是知識通常適用于任何語言
設(shè)定
初學(xué)者
中級
高級
結(jié)束語
設(shè)定
在本節(jié)中,我們準(zhǔn)備使用ANTLR的開發(fā)環(huán)境:解析器生成器工具,每種語言的支持工具和運(yùn)行時。
1.設(shè)置ANTLR
ANTLR實際上由兩個主要部分組成:用于生成詞法分析器和解析器的工具,以及運(yùn)行它們所需的運(yùn)行時。
語言工程師將只需要您使用該工具,而運(yùn)行時將包含在使用您的語言的最終軟件中。
無論您使用哪種語言,該工具始終是相同的:這是開發(fā)計算機(jī)上所需的Java程序。 盡管每種語言的運(yùn)行時都不同,但是開發(fā)人員和用戶都必須可以使用它。
該工具的唯一要求是您已經(jīng)安裝了至少Java 1.7 。 要安裝Java程序,您需要從官方站點下載最新版本,當(dāng)前版本為:
http://www.antlr.org/download/antlr-4.6-complete.jar使用說明
在Linux / Mac OS上執(zhí)行說明
// 1. sudo cp antlr-4.6-complete.jar /usr/local/lib/ // 2. and 3. // add this to your .bash_profile export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH" // simplify the use of the tool to generate lexer and parser alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH" org.antlr.v4.Tool' // simplify the use of the tool to test the generated code alias grun='java org.antlr.v4.gui.TestRig'在Windows上執(zhí)行說明
// 1. Go to System Properties dialog > Environment variables -> Create or append to the CLASSPATH variable // 2. and 3. Option A: use doskey doskey antlr4=java org.antlr.v4.Tool $* doskey grun =java org.antlr.v4.gui.TestRig $* // 2. and 3. Option B: use batch files // create antlr4.bat java org.antlr.v4.Tool %* // create grun.bat java org.antlr.v4.gui.TestRig %* // put them in the system path or any of the directories included in %path%典型工作流程
使用ANTLR時,首先要編寫語法 ,即擴(kuò)展名為.g4的文件,其中包含要分析的語言規(guī)則。 然后,您可以使用antlr4程序來生成程序?qū)嶋H使用的文件,例如詞法分析器和解析器。
antlr4 <options> <grammar-file-g4>運(yùn)行antlr4時可以指定幾個重要選項。
首先,您可以指定目標(biāo)語言,以Python或JavaScript或任何其他不同于Java的目標(biāo)(默認(rèn)語言)生成解析器。 其他的用于生成訪問者和偵聽器(不要擔(dān)心,如果您不知道這些是什么,我們將在后面進(jìn)行解釋)。
缺省情況下,僅生成偵聽器,因此要創(chuàng)建訪問者,請使用-visitor命令行選項,如果不想生成-no-listener則使用-no-listener listener。 也有相反的選項-no-visitor和-listener ,但它們是默認(rèn)值。
antlr4 -visitor <Grammar-file>您可以使用一個名為TestRig (的小實用工具來優(yōu)化語法測試TestRig (盡管,如我們所見,它通常是grun的別名)。
grun <grammar-name> <rule-to-test> <input-filename(s)>文件名是可選的,您可以代替分析在控制臺上鍵入的輸入。
如果要使用測試工具,則即使您的程序是用另一種語言編寫的,也需要生成Java解析器。 這可以通過選擇與antlr4不同的選項來antlr4 。
手動測試語法初稿時,Grun非常有用。 隨著它變得更加穩(wěn)定,您可能希望繼續(xù)進(jìn)行自動化測試(我們將看到如何編寫它們)。
Grun還有一些有用的選項: -tokens ,顯示檢測到的令牌, -gui生成AST的圖像。
2. Javascript設(shè)置
您可以將語法與Javascript文件放在同一文件夾中。 包含語法的文件必須具有與語法相同的名稱,該名稱必須在文件頂部聲明。
在下面的示例中,名稱為Chat ,文件為Chat.g4 。
通過使用ANTLR4 Java程序指定正確的選項,我們可以創(chuàng)建相應(yīng)的Javascript解析器。
antlr4 -Dlanguage=JavaScript Chat.g4請注意,該選項區(qū)分大小寫,因此請注意大寫的“ S”。 如果您輸入有誤,則會收到類似以下的消息。
error(31): ANTLR cannot generate Javascript code as of version 4.6ANTLR可以與node.js一起使用,也可以在瀏覽器中使用。 對于瀏覽器,您需要使用webpack或require.js 。 如果您不知道如何使用兩者之一,可以查閱官方文檔尋求幫助或閱讀網(wǎng)絡(luò)上的antlr教程。 我們將使用node.js ,只需使用以下標(biāo)準(zhǔn)命令即可為之安裝ANTLR運(yùn)行時。
npm install antlr43. Python設(shè)置
有了語法后,請將其放在與Python文件相同的文件夾中。 該文件必須具有與語法相同的名稱,該名稱必須在文件頂部聲明。 在下面的示例中,名稱為Chat ,文件為Chat.g4 。
通過使用ANTLR4 Java程序指定正確的選項,我們可以簡單地創(chuàng)建相應(yīng)的Python解析器。 對于Python,您還需要注意Python的版本2或3。
antlr4 -Dlanguage=Python3 Chat.g4PyPi提供了運(yùn)行時,因此您可以使用pio進(jìn)行安裝。
pip install antlr4-python3-runtime同樣,您只需要記住指定正確的python版本。
4. Java設(shè)定
要使用ANTLR設(shè)置我們的Java項目,您可以手動執(zhí)行操作。 或者您可以成為文明的人并使用Gradle或Maven。
另外,您可以在IDE中查看ANTLR插件。
4.1使用Gradle進(jìn)行Java設(shè)置
這就是我通常設(shè)置Gradle項目的方式。
我使用Gradle插件調(diào)用ANTLR,也使用IDEA插件生成IntelliJ IDEA的配置。
dependencies {antlr "org.antlr:antlr4:4.5.1"compile "org.antlr:antlr4-runtime:4.5.1"testCompile 'junit:junit:4.12' }generateGrammarSource {maxHeapSize = "64m"arguments += ['-package', 'me.tomassetti.mylanguage']outputDirectory = new File("generated-src/antlr/main/me/tomassetti/mylanguage".toString()) } compileJava.dependsOn generateGrammarSource sourceSets {generated {java.srcDir 'generated-src/antlr/main/'} } compileJava.source sourceSets.generated.java, sourceSets.main.javaclean{delete "generated-src" }idea {module {sourceDirs += file("generated-src/antlr/main")} }我將語法放在src / main / antlr /下 ,并且gradle配置確保它們在與程序包相對應(yīng)的目錄中生成。 例如,如果我希望解析器位于包me.tomassetti.mylanguage中,則必須將其生成到generate-src / antlr / main / me / tomassetti / mylanguage中 。
此時,我可以簡單地運(yùn)行:
# Linux/Mac ./gradlew generateGrammarSource# Windows gradlew generateGrammarSource然后我從語法中生成了詞法分析器和解析器。
然后我也可以運(yùn)行:
# Linux/Mac ./gradlew idea# Windows gradlew idea我已經(jīng)準(zhǔn)備好要打開一個IDEA項目。
4.2使用Maven進(jìn)行Java設(shè)置
首先,我們將在POM中指定需要antlr4-runtime作為依賴項。 我們還將使用Maven插件通過Maven運(yùn)行ANTLR。
我們還可以指定是否使用ANTLR來生成訪問者或偵聽器。 為此,我們定義了幾個相應(yīng)的屬性。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion>[..]<properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><antlr4.visitor>true</antlr4.visitor><antlr4.listener>true</antlr4.listener></properties> <dependencies><dependency><groupId>org.antlr</groupId><artifactId>antlr4-runtime</artifactId><version>4.6</version></dependency>[..]</dependencies><build><plugins>[..]<!-- Plugin to compile the g4 files ahead of the java filesSee https://github.com/antlr/antlr4/blob/master/antlr4-maven-plugin/src/site/apt/examples/simple.apt.vmExcept that the grammar does not need to contain the package declaration as stated in the documentation (I do not know why)To use this plugin, type:mvn antlr4:antlr4In any case, Maven will invoke this plugin before the Java source is compiled--><plugin><groupId>org.antlr</groupId><artifactId>antlr4-maven-plugin</artifactId><version>4.6</version> <executions><execution><goals><goal>antlr4</goal></goals> </execution></executions></plugin>[..]</plugins></build> </project>現(xiàn)在,您必須將語法的* .g4文件放在src/main/antlr4/me/tomassetti/examples/MarkupParser.
編寫完語法后,您只需運(yùn)行mvn package ,所有奇妙的事情就會發(fā)生:ANTLR被調(diào)用,它會生成詞法分析器和解析器,并將它們與其余代碼一起編譯。
// use mwn to generate the package mvn package如果您從未使用過Maven,則可以查看Java目標(biāo)的官方ANTLR文檔或Maven網(wǎng)站來入門。
使用Java開發(fā)ANTLR語法有一個明顯的優(yōu)勢:有多個IDE的插件,這是該工具的主要開發(fā)人員實際使用的語言。 因此,它們是org.antlr.v4.gui.TestRig類的工具,可以輕松地集成到您的工作流中,如果您想輕松地可視化輸入的AST,這些工具將非常有用。
5. C#設(shè)置
支持.NET Framework和Mono 3.5,但不支持.NET Core。 我們將使用Visual Studio創(chuàng)建我們的ANTLR項目,因為由C#目標(biāo)的同一作者為Visual Studio創(chuàng)建了一個不錯的擴(kuò)展,稱為ANTLR語言支持 。 您可以通過進(jìn)入工具->擴(kuò)展和更新來安裝它。 當(dāng)您構(gòu)建項目時,此擴(kuò)展將自動生成解析器,詞法分析器和訪問者/偵聽器。
此外,該擴(kuò)展名將允許您使用眾所周知的菜單添加新項目來創(chuàng)建新的語法文件。 最后但并非最不重要的一點是,您可以在每個語法文件的屬性中設(shè)置用于生成偵聽器/訪問者的選項。
另外,如果您更喜歡使用編輯器,則需要使用常規(guī)的Java工具生成所有內(nèi)容。 您可以通過指示正確的語言來做到這一點。 在此示例中,語法稱為“電子表格”。
antlr4 -Dlanguage=CSharp Spreadsheet.g4請注意,CSharp中的“ S”為大寫。
您仍然需要項目的ANTLR4運(yùn)行時,并且可以使用良好的nu'nuget安裝它。
初學(xué)者
在本節(jié)中,我們?yōu)槭褂肁NTLR奠定了基礎(chǔ):什么是詞法分析器和解析器,在語法中定義它們的語法以及可用于創(chuàng)建它們的策略。 我們還將看到第一個示例,以展示如何使用所學(xué)知識。 如果您不記得ANTLR的工作原理,可以回到本節(jié)。
6.詞法分析器
在研究解析器之前,我們需要首先研究詞法分析器,也稱為令牌化器。 它們基本上是解析器的第一個墊腳石,當(dāng)然ANTLR也允許您構(gòu)建它們。 詞法分析器將各個字符轉(zhuǎn)換為令牌 (解析器用來創(chuàng)建邏輯結(jié)構(gòu)的原子)。
想象一下,此過程適用于自然語言,例如英語。 您正在閱讀單個字符,將它們放在一起直到形成一個單詞,然后將不同的單詞組合成一個句子。
讓我們看下面的示例,并想象我們正在嘗試解析數(shù)學(xué)運(yùn)算。
437 + 734詞法分析器掃描文本,然后找到“ 4”,“ 3”,“ 7”,然后找到空格“”。 因此,它知道第一個字符實際上代表一個數(shù)字。 然后,它找到一個“ +”符號,因此知道它代表一個運(yùn)算符,最后找到另一個數(shù)字。
它怎么知道的? 因為我們告訴它。
/** Parser Rules*/operation : NUMBER '+' NUMBER ;/** Lexer Rules*/NUMBER : [0-9]+ ;WHITESPACE : ' ' -> skip ;這不是一個完整的語法,但是我們已經(jīng)可以看到詞法分析器規(guī)則全部為大寫,而解析器規(guī)則全部為小寫。 從技術(shù)上講,關(guān)于大小寫的規(guī)則僅適用于其名稱的第一個字符,但通常為了清楚起見,它們?nèi)紴榇髮懟蛐憽?
規(guī)則通常按以下順序編寫:首先是解析器規(guī)則,然后是詞法分析器規(guī)則,盡管在邏輯上它們是按相反的順序應(yīng)用的。 同樣重要的是要記住, 詞法分析器規(guī)則是按照它們出現(xiàn)的順序進(jìn)行分析的 ,它們可能是不明確的。
典型的例子是標(biāo)識符:在許多編程語言中,它可以是任何字母字符串,但是某些組合(例如“ class”或“ function”)被禁止,因為它們表示一個class或function 。 因此,規(guī)則的順序通過使用第一個匹配項來解決歧義,這就是為什么首先定義標(biāo)識關(guān)鍵字(例如類或函數(shù))的令牌,而最后一個用于標(biāo)識符的令牌的原因。
規(guī)則的基本語法很容易: 有一個名稱,一個冒號,該規(guī)則的定義和一個終止分號
NUMBER的定義包含一個典型的數(shù)字范圍和一個“ +”符號,表示允許一個或多個匹配項。 這些都是我認(rèn)為您熟悉的非常典型的指示,否則,您可以閱讀有關(guān)正則表達(dá)式的語法的更多信息。
最后,最有趣的部分是定義WHITESPACE令牌的詞法分析器規(guī)則。 這很有趣,因為它顯示了如何指示ANTLR忽略某些內(nèi)容。 考慮一下忽略空白如何簡化解析器規(guī)則:如果我們不能說忽略WHITESPACE,則必須將其包括在解析器的每個子規(guī)則之間,以便用戶在所需的地方放置空格。 像這樣:
operation : WHITESPACE* NUMBER WHITESPACE* '+' WHITESPACE* NUMBER;注釋通常也是如此:它們可以出現(xiàn)在任何地方,并且我們不想在語法的每個部分中都專門處理它們,因此我們只是忽略它們(至少在解析時)。
7.創(chuàng)建語法
現(xiàn)在,我們已經(jīng)了解了規(guī)則的基本語法,下面我們來看看定義語法的兩種不同方法:自頂向下和自底向上。
自上而下的方法
這種方法包括從以您的語言編寫的文件的一般組織開始。
文件的主要部分是什么? 他們的順序是什么? 每個部分中包含什么?
例如,Java文件可以分為三個部分:
- 包裝聲明
- 進(jìn)口
- 類型定義
當(dāng)您已經(jīng)知道要為其設(shè)計語法的語言或格式時,此方法最有效。 具有良好理論背景的人或喜歡從“大計劃”入手的人可能會首選該策略。
使用這種方法時,首先要定義代表整個文件的規(guī)則。 它可能會包括其他規(guī)則,以代表主要部分。 然后,您定義這些規(guī)則,然后從最一般的抽象規(guī)則過渡到底層的實用規(guī)則。
自下而上的方法
自下而上的方法包括首先關(guān)注小元素:定義如何捕獲令牌,如何定義基本表達(dá)式等等。 然后,我們移至更高級別的構(gòu)造,直到定義代表整個文件的規(guī)則。
我個人更喜歡從底層開始,這些基本項目是使用詞法分析器進(jìn)行分析的。 然后您自然地從那里成長到結(jié)構(gòu),該結(jié)構(gòu)由解析器處理。 這種方法允許只關(guān)注語法的一小部分,為此建立語法,確保其按預(yù)期工作,然后繼續(xù)進(jìn)行下一個工作。
這種方法模仿了我們的學(xué)習(xí)方式。 此外,從實際代碼開始的好處是,在許多語言中,實際代碼實際上是相當(dāng)普遍的。 實際上,大多數(shù)語言都具有標(biāo)識符,注釋,空格等內(nèi)容。顯然,您可能需要進(jìn)行一些調(diào)整,例如HTML中的注釋在功能上與C#中的注釋相同,但是具有不同的定界符。
自底向上方法的缺點在于解析器是您真正關(guān)心的東西。 不要求您構(gòu)建一個詞法分析器,而是要求您構(gòu)建一個可以提供特定功能的解析器。 因此,如果您不了解程序的其余部分如何工作,那么從最后一部分詞法分析器開始,您可能最終會進(jìn)行一些重構(gòu)。
8.設(shè)計數(shù)據(jù)格式
為新語言設(shè)計語法是困難的。 您必須創(chuàng)建一種對用戶來說簡單而直觀的語言,同時又要明確地使語法易于管理。 它必須簡潔,清晰,自然,并且不會妨礙用戶。
因此,我們從有限的內(nèi)容開始:一個簡單的聊天程序的語法。
讓我們從對目標(biāo)的更好描述開始:
- 不會有段落,因此我們可以使用換行符作為消息之間的分隔符
- 我們要允許表情符號,提及和鏈接。 我們將不支持HTML標(biāo)簽
- 由于我們的聊天將針對??討厭的青少年,因此我們希望為用戶提供一種簡單的方法來喊叫和設(shè)置文本顏色的格式。
最終,少年們可能會大喊大叫,全是粉紅色。 多么活著的時間。
9. Lexer規(guī)則
我們首先為聊天語言定義詞法分析器規(guī)則。 請記住,詞法分析器規(guī)則實際上位于文件的末尾。
/** Lexer Rules*/fragment A : ('A'|'a') ; fragment S : ('S'|'s') ; fragment Y : ('Y'|'y') ; fragment H : ('H'|'h') ; fragment O : ('O'|'o') ; fragment U : ('U'|'u') ; fragment T : ('T'|'t') ;fragment LOWERCASE : [a-z] ; fragment UPPERCASE : [A-Z] ;SAYS : S A Y S ;SHOUTS : S H O U T S;WORD : (LOWERCASE | UPPERCASE | '_')+ ;WHITESPACE : (' ' | '\t') ;NEWLINE : ('\r'? '\n' | '\r')+ ;TEXT : ~[\])]+ ;在此示例中,我們使用規(guī)則片段 :它們是詞法分析器規(guī)則的可重用構(gòu)建塊。 您定義它們,然后在詞法分析器規(guī)則中引用它們。 如果定義它們但不將它們包括在詞法分析器規(guī)則中,則它們根本無效。
我們?yōu)橐陉P(guān)鍵字中使用的字母定義一個片段。 這是為什么? 因為我們要支持不區(qū)分大小寫的關(guān)鍵字。 除了避免重復(fù)字符的情況以外,在處理浮點數(shù)時也使用它們。 為了避免重復(fù)數(shù)字,請在點/逗號之前和之后。 如下面的例子。
fragment DIGIT : [0-9] ; NUMBER : DIGIT+ ([.,] DIGIT+)? ;TEXT令牌顯示如何捕獲所有內(nèi)容,除了波浪號('?')之后的字符以外。 我們不包括右方括號']',但是由于它是用于標(biāo)識一組字符結(jié)尾的字符,因此必須在其前面加上反斜杠'\'來對其進(jìn)行轉(zhuǎn)義。
換行規(guī)則是用這種方式制定的,因為操作系統(tǒng)實際上指示換行的方式不同,有些包括carriage return ('\r') ,有些包括newline ('\n') ,或者二者結(jié)合。
10.解析器規(guī)則
我們繼續(xù)解析器規(guī)則,這些規(guī)則是我們的程序?qū)⑴c之最直接交互的規(guī)則。
/** Parser Rules*/chat : line+ EOF ;line : name command message NEWLINE;message : (emoticon | link | color | mention | WORD | WHITESPACE)+ ;name : WORD ;command : (SAYS | SHOUTS) ':' WHITESPACE ;emoticon : ':' '-'? ')'| ':' '-'? '(';link : '[' TEXT ']' '(' TEXT ')' ;color : '/' WORD '/' message '/';mention : '@' WORD ;第一個有趣的部分是message ,與其包含的內(nèi)容有關(guān),不如說是它所代表的結(jié)構(gòu)。 我們說的是message可以是任何列出的規(guī)則中的任何順序。 這是解決空白時無需每次重復(fù)的簡單方法。 由于作為用戶,我們發(fā)現(xiàn)空白不相關(guān),因此我們看到類似WORD WORD mention ,但解析器實際上看到WORD WHITESPACE WORD WHITESPACE mention WHITESPACE 。
當(dāng)您無法擺脫空白時,處理空白的另一種方法是更高級的:詞法模式。 基本上,它允許您指定兩個詞法分析器部分:一個用于結(jié)構(gòu)化部分,另一個用于簡單文本。 這對于解析XML或HTML之類的內(nèi)容很有用。 我們將在稍后展示。
很明顯, 命令規(guī)則很明顯,您只需要注意命令和冒號這兩個選項之間不能有空格,但是之后需要一個WHITESPACE 。 表情符號規(guī)則顯示了另一種表示多種選擇的符號,您可以使用豎線字符“ |” 沒有括號。 我們僅支持帶有或不帶有中間線的兩個表情符號,快樂和悲傷。
就像我們已經(jīng)說過的那樣, 鏈接規(guī)則可能被認(rèn)為是錯誤或執(zhí)行不佳,實際上, TEXT捕獲了除某些特殊字符之外的所有內(nèi)容。 您可能只想在括號內(nèi)使用WORD和WHITESPACE,或者在方括號內(nèi)強(qiáng)制使用正確的鏈接格式。 另一方面,這允許用戶在編寫鏈接時犯錯誤,而不會使解析器抱怨。
您必須記住,解析器無法檢查語義
例如,它不知道指示顏色的WORD是否實際代表有效顏色。 也就是說,它不知道使用“ dog”是錯誤的,但是使用“ red”是正確的。 必須通過程序的邏輯進(jìn)行檢查,該邏輯可以訪問可用的顏色。 您必須找到在語法和您自己的代碼之間劃分執(zhí)行力的正確平衡。
解析器應(yīng)僅檢查語法。 因此,經(jīng)驗法則是,如果有疑問,則讓解析器將內(nèi)容傳遞給程序。 然后,在程序中,檢查語義并確保規(guī)則實際上具有正確的含義。
讓我們看一下規(guī)則顏色:它可以包含一條消息 ,它本身也可以是消息的一部分; 這種歧義將通過使用的上下文來解決。
11.錯誤與調(diào)整
在嘗試新語法之前,我們必須在文件開頭添加一個名稱。 名稱必須與文件名相同,文件擴(kuò)展名應(yīng)為.g4 。
grammar Chat;您可以在官方文檔中找到如何為您的平臺安裝所有內(nèi)容 。 安裝完所有內(nèi)容后,我們創(chuàng)建語法,編譯生成的Java代碼,然后運(yùn)行測試工具。
// lines preceded by $ are commands // > are input to the tool // - are output from the tool $ antlr4 Chat.g4 $ javac Chat*.java // grun is the testing tool, Chat is the name of the grammar, chat the rule that we want to parse $ grun Chat chat > john SAYS: hello @michael this will not work // CTRL+D on Linux, CTRL+Z on Windows > CTRL+D/CTRL+Z - line 1:0 mismatched input 'john SAYS: hello @michael this will not work\n' expecting WORD好的,它不起作用。 為什么要期待WORD ? 就在那! 讓我們嘗試使用選項-tokens它可以識別的令牌,以找出-tokens 。
$ grun Chat chat -tokens > john SAYS: hello @michael this will not work - [@0,0:44='john SAYS: hello @michael this will not work\n',<TEXT>,1:0] - [@1,45:44='<EOF>',<EOF>,2:0]因此,它只能看到TEXT令牌。 但是我們把它放在語法的末尾,會發(fā)生什么? 問題在于它總是嘗試匹配最大可能的令牌。 所有這些文本都是有效的TEXT令牌。 我們?nèi)绾谓鉀Q這個問題? 有很多方法,第一種當(dāng)然是擺脫該令牌。 但是目前,我們將看到第二個最簡單的方法。
[..]link : TEXT TEXT ;[..]TEXT : ('['|'(') ~[\])]+ (']'|')');我們更改了有問題的令牌,使其包含前面的括號或方括號。 請注意,這并不完全相同,因為它允許兩個系列的括號或方括號。 但這是第一步,畢竟我們正在這里學(xué)習(xí)。
讓我們檢查一下是否可行:
$ grun Chat chat -tokens > john SAYS: hello @michael this will not work - [@0,0:3='john',<WORD>,1:0] - [@1,4:4=' ',<WHITESPACE>,1:4] - [@2,5:8='SAYS',<SAYS>,1:5] - [@3,9:9=':',<':'>,1:9] - [@4,10:10=' ',<WHITESPACE>,1:10] - [@5,11:15='hello',<WORD>,1:11] - [@6,16:16=' ',<WHITESPACE>,1:16] - [@7,17:17='@',<'@'>,1:17] - [@8,18:24='michael',<WORD>,1:18] - [@9,25:25=' ',<WHITESPACE>,1:25] - [@10,26:29='this',<WORD>,1:26] - [@11,30:30=' ',<WHITESPACE>,1:30] - [@12,31:34='will',<WORD>,1:31] - [@13,35:35=' ',<WHITESPACE>,1:35] - [@14,36:38='not',<WORD>,1:36] - [@15,39:39=' ',<WHITESPACE>,1:39] - [@16,40:43='work',<WORD>,1:40] - [@17,44:44='\n',<NEWLINE>,1:44] - [@18,45:44='<EOF>',<EOF>,2:0]使用-gui選項,我們還可以擁有一個很好的,更易于理解的圖形表示。
空中的點表示空白。
這行得通,但不是很聰明,不錯或沒有組織。 但是不用擔(dān)心,稍后我們將看到更好的方法。 該解決方案的一個積極方面是,它可以顯示另一個技巧。
TEXT : ('['|'(') .*? (']'|')');這是令牌TEXT的等效表示形式:“。” 匹配任何字符,“ *”表示可以隨時重復(fù)前面的匹配,“?” 表示先前的比賽是非貪婪的。 也就是說,前一個子規(guī)則匹配除其后的所有內(nèi)容,從而允許匹配右括號或方括號。
中級
在本節(jié)中,我們將了解如何在程序中使用ANTLR,需要使用的庫和函數(shù),如何測試解析器等。 我們了解什么是監(jiān)聽器以及如何使用監(jiān)聽器。 通過查看更高級的概念(例如語義謂詞),我們還基于對基礎(chǔ)知識的了解。 盡管我們的項目主要使用Javascript和Python,但該概念通常適用于每種語言。 當(dāng)您需要記住如何組織項目時,可以回到本節(jié)。
12.使用Java腳本設(shè)置聊天項目
在前面的部分中,我們逐段地介紹了如何為聊天程序構(gòu)建語法。 現(xiàn)在,讓我們復(fù)制剛在Javascript文件的同一文件夾中創(chuàng)建的語法。
grammar Chat;/** Parser Rules*/chat : line+ EOF ;line : name command message NEWLINE ;message : (emoticon | link | color | mention | WORD | WHITESPACE)+ ;name : WORD WHITESPACE;command : (SAYS | SHOUTS) ':' WHITESPACE ;emoticon : ':' '-'? ')'| ':' '-'? '(';link : TEXT TEXT ;color : '/' WORD '/' message '/';mention : '@' WORD ;/** Lexer Rules*/fragment A : ('A'|'a') ; fragment S : ('S'|'s') ; fragment Y : ('Y'|'y') ; fragment H : ('H'|'h') ; fragment O : ('O'|'o') ; fragment U : ('U'|'u') ; fragment T : ('T'|'t') ;fragment LOWERCASE : [a-z] ; fragment UPPERCASE : [A-Z] ;SAYS : S A Y S ;SHOUTS : S H O U T S ;WORD : (LOWERCASE | UPPERCASE | '_')+ ;WHITESPACE : (' ' | '\t')+ ;NEWLINE : ('\r'? '\n' | '\r')+ ;TEXT : ('['|'(') ~[\])]+ (']'|')');通過使用ANTLR4 Java程序指定正確的選項,我們可以創(chuàng)建相應(yīng)的Javascript解析器。
antlr4 -Dlanguage=JavaScript Chat.g4現(xiàn)在,您將在文件夾中找到一些新文件,它們的名稱如ChatLexer.js, ChatParser.js以及* .tokens文件,這些文件都不包含我們感興趣的任何東西,除非您想了解ANTLR的內(nèi)部工作原理。
您要查看的文件是ChatListener.js ,您不會對其進(jìn)行任何修改,但是它包含我們將使用自己的偵聽器覆蓋的方法和函數(shù)。 我們不會對其進(jìn)行修改,因為每次重新生成語法時,更改都會被覆蓋。
查看它,您可以看到幾個輸入/退出函數(shù),每個解析器規(guī)則都有一對。 當(dāng)遇到與規(guī)則匹配的一段代碼時,將調(diào)用這些函數(shù)。 這是偵聽器的默認(rèn)實現(xiàn),它使您可以在派生的偵聽器上僅覆蓋所需的功能,而其余部分保持不變。
var antlr4 = require('antlr4/index');// This class defines a complete listener for a parse tree produced by ChatParser. function ChatListener() {antlr4.tree.ParseTreeListener.call(this);return this; }ChatListener.prototype = Object.create(antlr4.tree.ParseTreeListener.prototype); ChatListener.prototype.constructor = ChatListener;// Enter a parse tree produced by ChatParser#chat. ChatListener.prototype.enterChat = function(ctx) { };// Exit a parse tree produced by ChatParser#chat. ChatListener.prototype.exitChat = function(ctx) { };[..]創(chuàng)建Listener的替代方法是創(chuàng)建一個Visitor 。 主要區(qū)別在于,您既無法控制偵聽器的流程,也無法從其功能返回任何內(nèi)容,而您既可以使用訪問者來完成這兩個操作。 因此,如果您需要控制AST節(jié)點的輸入方式或從其中幾個節(jié)點收集信息,則可能需要使用訪客。 例如,這對于代碼生成很有用,在代碼生成中,創(chuàng)建新源代碼所需的一些信息散布在許多部分。 聽者和訪客都使用深度優(yōu)先搜索。
深度優(yōu)先搜索意味著當(dāng)一個節(jié)點將被訪問時,其子節(jié)點將被訪問,如果一個子節(jié)點具有自己的子節(jié)點,則在繼續(xù)第一個節(jié)點的其他子節(jié)點之前,將對其進(jìn)行訪問。 下圖將使您更容易理解該概念。
因此,對于偵聽器,在與該節(jié)點的第一次相遇時將觸發(fā)enter事件,并且在退出所有子節(jié)點之后將觸發(fā)出口。 在下圖中,您可以看到在偵聽器遇到線路節(jié)點時將觸發(fā)哪些功能的示例(為簡單起見,僅顯示與線路相關(guān)的功能)。
對于標(biāo)準(zhǔn)的訪問者,其行為將是相似的,當(dāng)然,對于每個單個節(jié)點都只會觸發(fā)單個訪問事件。 在下圖中,您可以看到訪問者遇到線路節(jié)點時將觸發(fā)哪些功能的示例(為簡單起見,僅顯示與線路相關(guān)的功能)。
請記住, 這對于訪問者的默認(rèn)實現(xiàn)是正確的 , 這是通過返回每個函數(shù)中每個節(jié)點的子代來完成的 。 如果您忽略了訪問者的方法,則有責(zé)任使訪問者繼續(xù)旅行或在此停留。
13. Antlr.js
終于到了看典型的ANTLR程序外觀的時候了。
const http = require('http'); const antlr4 = require('antlr4/index'); const ChatLexer = require('./ChatLexer'); const ChatParser = require('./ChatParser'); const HtmlChatListener = require('./HtmlChatListener').HtmlChatListener;http.createServer((req, res) => {res.writeHead(200, {'Content-Type': 'text/html', });res.write('<html><head><meta charset="UTF-8"/></head><body>');var input = "john SHOUTS: hello @michael /pink/this will work/ :-) \n";var chars = new antlr4.InputStream(input);var lexer = new ChatLexer.ChatLexer(chars);var tokens = new antlr4.CommonTokenStream(lexer);var parser = new ChatParser.ChatParser(tokens);parser.buildParseTrees = true; var tree = parser.chat(); var htmlChat = new HtmlChatListener(res);antlr4.tree.ParseTreeWalker.DEFAULT.walk(htmlChat, tree);res.write('</body></html>');res.end();}).listen(1337);在主文件的開頭,我們導(dǎo)入(使用require )必要的庫和文件, antlr4 (運(yùn)行時)和生成的解析器,以及稍后將要看到的偵聽器。
為簡單起見,我們從字符串中獲取輸入,而在實際情況下,它將來自編輯器。
第16-19行顯示了每個ANTLR程序的基礎(chǔ):您從輸入創(chuàng)建字符流,將其提供給詞法分析器,然后將其轉(zhuǎn)換為令牌,然后由解析器對其進(jìn)行解釋。
花一點時間思考一下是很有用的:詞法分析器處理輸入的字符,準(zhǔn)確地說是輸入的副本,而解析器處理解析器生成的標(biāo)記。 詞法分析器無法直接處理輸入,解析器甚至看不到字符 。
記住這一點很重要,以防您需要執(zhí)行一些高級操作(如操縱輸入)。 在這種情況下,輸入是字符串,但當(dāng)然可以是任何內(nèi)容流。
第20行是多余的,因為該選項已經(jīng)默認(rèn)為true,但是在以后的運(yùn)行時版本中可能會更改,因此最好指定它。
然后,在第21行,將樹的根節(jié)點設(shè)置為聊天規(guī)則。 您要調(diào)用解析器,指定一個通常是第一條規(guī)則的規(guī)則。 但是,實際上您可以直接調(diào)用任何規(guī)則,例如color 。
通常,一旦從解析器中獲取AST,我們就想使用偵聽器或訪問者來處理它。 在這種情況下,我們指定一個偵聽器。 我們特定的偵聽器采用一個參數(shù):響應(yīng)對象。 我們希望使用它在響應(yīng)中放入一些文本以發(fā)送給用戶。 設(shè)置好聽眾之后,我們最終與聽眾一起走到樹上。
14. HtmlChatListener.js
我們繼續(xù)看聊天項目的聽眾。
const antlr4 = require('antlr4/index'); const ChatLexer = require('./ChatLexer'); const ChatParser = require('./ChatParser'); var ChatListener = require('./ChatListener').ChatListener;HtmlChatListener = function(res) {this.Res = res; ChatListener.call(this); // inherit default listenerreturn this; };// inherit default listener HtmlChatListener.prototype = Object.create(ChatListener.prototype); HtmlChatListener.prototype.constructor = HtmlChatListener;// override default listener behavior HtmlChatListener.prototype.enterName = function(ctx) { this.Res.write("<strong>"); };HtmlChatListener.prototype.exitName = function(ctx) { this.Res.write(ctx.WORD().getText());this.Res.write("</strong> "); }; HtmlChatListener.prototype.exitEmoticon = function(ctx) { var emoticon = ctx.getText(); if(emoticon == ':-)' || emoticon == ':)'){this.Res.write("??"); }if(emoticon == ':-(' || emoticon == ':('){this.Res.write("??"); } }; HtmlChatListener.prototype.enterCommand = function(ctx) { if(ctx.SAYS() != null)this.Res.write(ctx.SAYS().getText() + ':' + '<p>');if(ctx.SHOUTS() != null)this.Res.write(ctx.SHOUTS().getText() + ':' + '<p style="text-transform: uppercase">'); };HtmlChatListener.prototype.exitLine = function(ctx) { this.Res.write("</p>"); };exports.HtmlChatListener = HtmlChatListener;After the requires function calls we make our HtmlChatListener to extend ChatListener. The interesting stuff starts at line 17.
The ctx argument is an instance of a specific class context for the node that we are entering/exiting. So for enterName is NameContext , for exitEmoticon is EmoticonContext , etc. This specific context will have the proper elements for the rule, that would make possible to easily access the respective tokens and subrules. For example, NameContext will contain fields like WORD() and WHITESPACE(); CommandContext will contain fields like WHITESPACE() , SAYS() and SHOUTS().
These functions, enter* and exit*, are called by the walker everytime the corresponding nodes are entered or exited while it's traversing the AST that represents the program newline. A listener allows you to execute some code, but it's important to remember that you can't stop the execution of the walker and the execution of the functions .
On line 18, we start by printing a strong tag because we want the name to be bold, then on exitName we take the text from the token WORD and close the tag. Note that we ignore the WHITESPACE token, nothing says that we have to show everything. In this case we could have done everything either on the enter or exit function.
On the function exitEmoticon we simply transform the emoticon text in an emoji character. We get the text of the whole rule because there are no tokens defined for this parser rule. On enterCommand , instead there could be any of two tokens SAYS or SHOUTS , so we check which one is defined. And then we alter the following text, by transforming in uppercase, if it's a SHOUT. Note that we close the p tag at the exit of the line rule, because the command, semantically speaking, alter all the text of the message.
All we have to do now is launching node, with nodejs antlr.js , and point our browser at its address, usually at http://localhost:1337/ and we will be greeted with the following image.
So all is good, we just have to add all the different listeners to handle the rest of the language. Let's start with color and message .
15. Working with a Listener
We have seen how to start defining a listener. Now let's get serious on see how to evolve in a complete, robust listener. Let's start by adding support for color and checking the results of our hard work.
HtmlChatListener.prototype.enterColor = function(ctx) { var color = ctx.WORD().getText(); this.Res.write('<span style="color: ' + color + '">'); };HtmlChatListener.prototype.exitColor = function(ctx) { this.Res.write("</span>"); }; HtmlChatListener.prototype.exitMessage = function(ctx) { this.Res.write(ctx.getText()); };exports.HtmlChatListener = HtmlChatListener;Except that it doesn't work. Or maybe it works too much: we are writing some part of message twice (“this will work”): first when we check the specific nodes, children of message , and then at the end.
Luckily with Javascript we can dynamically alter objects, so we can take advantage of this fact to change the *Context object themselves.
HtmlChatListener.prototype.exitColor = function(ctx) { ctx.text += ctx.message().text; ctx.text += '</span>'; };HtmlChatListener.prototype.exitEmoticon = function(ctx) { var emoticon = ctx.getText(); if(emoticon == ':-)' || emoticon == ':)'){ ctx.text = "??";}if(emoticon == ':-(' || emoticon == ':('){ ctx.text = "??";} }; HtmlChatListener.prototype.exitMessage = function(ctx) { var text = '';for (var index = 0; index < ctx.children.length; index++ ) {if(ctx.children[index].text != null)text += ctx.children[index].text;elsetext += ctx.children[index].getText();}if(ctx.parentCtx instanceof ChatParser.ChatParser.LineContext == false){ctx.text = text; }else{this.Res.write(text);this.Res.write("</p>");} };Only the modified parts are shown in the snippet above. We add a text field to every node that transforms its text, and then at the exit of every message we print the text if it's the primary message, the one that is directly child of the line rule. If it's a message, that is also a child of color, we add the text field to the node we are exiting and let color print it. We check this on line 30, where we look at the parent node to see if it's an instance of the object LineContext . This is also further evidence of how each ctx argument corresponds to the proper type.
Between lines 23 and 27 we can see another field of every node of the generated tree: children , which obviously it contains the children node. You can observe that if a field text exists we add it to the proper variable, otherwise we use the usual function to get the text of the node.
16. Solving Ambiguities with Semantic Predicates
So far we have seen how to build a parser for a chat language in Javascript. Let's continue working on this grammar but switch to python. Remember that all code is available in the repository . Before that, we have to solve an annoying problem: the TEXT token. The solution we have is terrible, and furthermore, if we tried to get the text of the token we would have to trim the edges, parentheses or square brackets. 所以,我們能做些什么?
We can use a particular feature of ANTLR called semantic predicates. As the name implies they are expressions that produce a boolean value. They selectively enable or disable the following rule and thus permit to solve ambiguities. Another reason that they could be used is to support different version of the same language, for instance a version with a new construct or an old without it.
Technically they are part of the larger group of actions , that allows to embed arbitrary code into the grammar. The downside is that the grammar is no more language independent , since the code in the action must be valid for the target language. For this reason, usually it's considered a good idea to only use semantic predicates, when they can't be avoided, and leave most of the code to the visitor/listener.
link : '[' TEXT ']' '(' TEXT ')';TEXT : {self._input.LA(-1) == ord('[') or self._input.LA(-1) == ord('(')}? ~[\])]+ ;We restored link to its original formulation, but we added a semantic predicate to the TEXT token, written inside curly brackets and followed by a question mark. We use self._input.LA(-1) to check the character before the current one, if this character is a square bracket or the open parenthesis, we activate the TEXT token. It's important to repeat that this must be valid code in our target language, it's going to end up in the generated Lexer or Parser, in our case in ChatLexer.py.
This matters not just for the syntax itself, but also because different targets might have different fields or methods, for instance LA returns an int in python, so we have to convert the char to a int .
Let's look at the equivalent form in other languages.
// C#. Notice that is .La and not .LA TEXT : {_input.La(-1) == '[' || _input.La(-1) == '('}? ~[\])]+ ; // Java TEXT : {_input.LA(-1) == '[' || _input.LA(-1) == '('}? ~[\])]+ ; // Javascript TEXT : {this._input.LA(-1) == '[' || this._input.LA(-1) == '('}? ~[\])]+ ;If you want to test for the preceding token, you can use the _input.LT(-1,) but you can only do that for parser rules. For example, if you want to enable the mention rule only if preceded by a WHITESPACE token.
// C# mention: {_input.Lt(-1).Type == WHITESPACE}? '@' WORD ; // Java mention: {_input.LT(1).getType() == WHITESPACE}? '@' WORD ; // Python mention: {self._input.LT(-1).text == ' '}? '@' WORD ; // Javascript mention: {this._input.LT(1).text == ' '}? '@' WORD ;17. Continuing the Chat in Python
Before seeing the Python example, we must modify our grammar and put the TEXT token before the WORD one. Otherwise ANTLR might assign the incorrect token, in cases where the characters between parentheses or brackets are all valid for WORD , for instance if it where [this](link) .
Using ANTLR in python is not more difficult than with any other platform, you just need to pay attention to the version of Python, 2 or 3.
antlr4 -Dlanguage=Python3 Chat.g4就是這樣。 So when you have run the command, inside the directory of your python project, there will be a newly generated parser and a lexer. You may find interesting to look at ChatLexer.py and in particular the function TEXT_sempred (sempred stands for sem antic pred icate).
def TEXT_sempred(self, localctx:RuleContext, predIndex:int):if predIndex == 0:return self._input.LA(-1) == ord('[') or self._input.LA(-1) == ord('(')You can see our predicate right in the code. This also means that you have to check that the correct libraries, for the functions used in the predicate, are available to the lexer.
18. The Python Way of Working with a Listener
The main file of a Python project is very similar to a Javascript one, mutatis mutandis of course. That is to say we have to adapt libraries and functions to the proper version for a different language.
import sys from antlr4 import * from ChatLexer import ChatLexer from ChatParser import ChatParser from HtmlChatListener import HtmlChatListenerdef main(argv):input = FileStream(argv[1])lexer = ChatLexer(input)stream = CommonTokenStream(lexer)parser = ChatParser(stream)tree = parser.chat()output = open("output.html","w")htmlChat = HtmlChatListener(output)walker = ParseTreeWalker()walker.walk(htmlChat, tree)output.close() if __name__ == '__main__':main(sys.argv)We have also changed the input and output to become files, this avoid the need to launch a server in Python or the problem of using characters that are not supported in the terminal.
import sys from antlr4 import * from ChatParser import ChatParser from ChatListener import ChatListenerclass HtmlChatListener(ChatListener) :def __init__(self, output):self.output = outputself.output.write('<html><head><meta charset="UTF-8"/></head><body>')def enterName(self, ctx:ChatParser.NameContext) :self.output.write("<strong>") def exitName(self, ctx:ChatParser.NameContext) :self.output.write(ctx.WORD().getText()) self.output.write("</strong> ") def enterColor(self, ctx:ChatParser.ColorContext) :color = ctx.WORD().getText()ctx.text = '<span style="color: ' + color + '">' def exitColor(self, ctx:ChatParser.ColorContext): ctx.text += ctx.message().textctx.text += '</span>'def exitEmoticon(self, ctx:ChatParser.EmoticonContext) : emoticon = ctx.getText()if emoticon == ':-)' or emoticon == ':)' :ctx.text = "??"if emoticon == ':-(' or emoticon == ':(' :ctx.text = "??"def enterLink(self, ctx:ChatParser.LinkContext):ctx.text = '<a href="%s">%s</a>' % (ctx.TEXT()[1], (ctx.TEXT()[0]))def exitMessage(self, ctx:ChatParser.MessageContext):text = ''for child in ctx.children:if hasattr(child, 'text'):text += child.textelse:text += child.getText()if isinstance(ctx.parentCtx, ChatParser.LineContext) is False:ctx.text = textelse: self.output.write(text)self.output.write("</p>") def enterCommand(self, ctx:ChatParser.CommandContext):if ctx.SAYS() is not None :self.output.write(ctx.SAYS().getText() + ':' + '<p>')if ctx.SHOUTS() is not None :self.output.write(ctx.SHOUTS().getText() + ':' + '<p style="text-transform: uppercase">') def exitChat(self, ctx:ChatParser.ChatContext):self.output.write("</body></html>")Apart from lines 35-36, where we introduce support for links, there is nothing new. Though you might notice that Python syntax is cleaner and, while having dynamic typing, it is not loosely typed as Javascript. The different types of *Context objects are explicitly written out. If only Python tools were as easy to use as the language itself. But of course we cannot just fly over python like this, so we also introduce testing.
19. Testing with Python
While Visual Studio Code have a very nice extension for Python, that also supports unit testing, we are going to use the command line for the sake of compatibility.
python3 -m unittest discover -s . -p ChatTests.pyThat's how you run the tests, but before that we have to write them. Actually, even before that, we have to write an ErrorListener to manage errors that we could find. While we could simply read the text outputted by the default error listener, there is an advantage in using our own implementation, namely that we can control more easily what happens.
import sys from antlr4 import * from ChatParser import ChatParser from ChatListener import ChatListener from antlr4.error.ErrorListener import * import ioclass ChatErrorListener(ErrorListener):def __init__(self, output):self.output = output self._symbol = ''def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e): self.output.write(msg)self._symbol = offendingSymbol.text@property def symbol(self):return self._symbolOur class derives from ErrorListener and we simply have to implement syntaxError . Although we also add a property symbol to easily check which symbol might have caused an error.
from antlr4 import * from ChatLexer import ChatLexer from ChatParser import ChatParser from HtmlChatListener import HtmlChatListener from ChatErrorListener import ChatErrorListener import unittest import ioclass TestChatParser(unittest.TestCase):def setup(self, text): lexer = ChatLexer(InputStream(text)) stream = CommonTokenStream(lexer)parser = ChatParser(stream)self.output = io.StringIO()self.error = io.StringIO()parser.removeErrorListeners() errorListener = ChatErrorListener(self.error)parser.addErrorListener(errorListener) self.errorListener = errorListener return parserdef test_valid_name(self):parser = self.setup("John ")tree = parser.name() htmlChat = HtmlChatListener(self.output)walker = ParseTreeWalker()walker.walk(htmlChat, tree) # let's check that there aren't any symbols in errorListener self.assertEqual(len(self.errorListener.symbol), 0)def test_invalid_name(self):parser = self.setup("Joh-")tree = parser.name() htmlChat = HtmlChatListener(self.output)walker = ParseTreeWalker()walker.walk(htmlChat, tree) # let's check the symbol in errorListenerself.assertEqual(self.errorListener.symbol, '-')if __name__ == '__main__':unittest.main()The setup method is used to ensure that everything is properly set; on lines 19-21 we setup also our ChatErrorListener , but first we remove the default one, otherwise it would still output errors on the standard output. We are listening to errors in the parser, but we could also catch errors generated by the lexer. It depends on what you want to test. You may want to check both.
The two proper test methods checks for a valid and an invalid name. The checks are linked to the property symbol , that we have previously defined, if it's empty everything is fine, otherwise it contains the symbol that created the error. Notice that on line 28, there is a space at the end of the string, because we have defined the rule name to end with a WHITESPACE token.
20. Parsing Markup
ANTLR can parse many things, including binary data, in that case tokens are made up of non printable characters. But a more common problem is parsing markup languages such as XML or HTML. Markup is also a useful format to adopt for your own creations, because it allows to mix unstructured text content with structured annotations. They fundamentally represent a form of smart document, containing both text and structured data. The technical term that describe them is island languages . This type is not restricted to include only markup, and sometimes it's a matter of perspective.
For example, you may have to build a parser that ignores preprocessor directives. In that case, you have to find a way to distinguish proper code from directives, which obeys different rules.
In any case, the problem for parsing such languages is that there is a lot of text that we don't actually have to parse, but we cannot ignore or discard, because the text contain useful information for the user and it is a structural part of the document. The solution is lexical modes , a way to parse structured content inside a larger sea of free text.
21. Lexical Modes
We are going to see how to use lexical modes, by starting with a new grammar.
lexer grammar MarkupLexer;OPEN : '[' -> pushMode(BBCODE) ; TEXT : ~('[')+ ;// Parsing content inside tags mode BBCODE;CLOSE : ']' -> popMode ; SLASH : '/' ; EQUALS : '=' ; STRING : '"' .*? '"' ; ID : LETTERS+ ; WS : [ \t\r\n] -> skip ;fragment LETTERS : [a-zA-Z] ;Looking at the first line you could notice a difference: we are defining a lexer grammar , instead of the usual (combined) grammar . You simply can't define a lexical mode together with a parser grammar . You can use lexical modes only in a lexer grammar, not in a combined grammar. The rest is not suprising, as you can see, we are defining a sort of BBCode markup, with tags delimited by square brackets.
On lines 3, 7 and 9 you will find basically all that you need to know about lexical modes. You define one or more tokens that can delimit the different modes and activate them.
The default mode is already implicitly defined, if you need to define yours you simply use mode followed by a name. Other than for markup languages, lexical modes are typically used to deal with string interpolation. When a string literal can contain more than simple text, but things like arbitrary expressions.
When we used a combined grammar we could define tokens implicitly: when in a parser rule we used a string like '=' that is what we did. Now that we are using separate lexer and parser grammars we cannot do that. That means that every single token has to be defined explicitly. So we have definitions like SLASH or EQUALS which typically could be just be directly used in a parser rule. The concept is simple: in the lexer grammar we need to define all tokens, because they cannot be defined later in the parser grammar.
22. Parser Grammars
We look at the other side of a lexer grammar, so to speak.
parser grammar MarkupParser;options { tokenVocab=MarkupLexer; }file : element* ;attribute : ID '=' STRING ;content : TEXT ;element : (content | tag) ;tag : '[' ID attribute? ']' element* '[' '/' ID ']' ;On the first line we define a parser grammar . Since the tokens we need are defined in the lexer grammar, we need to use an option to say to ANTLR where it can find them. This is not necessary in combined grammars, since the tokens are defined in the same file.
There are many other options available, in the documentation .
There is almost nothing else to add, except that we define a content rule so that we can manage more easily the text that we find later in the program.
I just want to say that, as you can see, we don't need to explicitly use the tokens everytime (es. SLASH), but instead we can use the corresponding text (es. '/').
ANTLR will automatically transform the text in the corresponding token, but this can happen only if they are already defined. In short, it is as if we had written:
tag : OPEN ID attribute? CLOSE element* OPEN SLASH ID CLOSE ;But we could not have used the implicit way, if we hadn't already explicitly defined them in the lexer grammar. Another way to look at this is: when we define a combined grammar ANTLR defines for use all the tokens, that we have not explicitly defined ourselves. When we need to use a separate lexer and a parser grammar, we have to define explicitly every token ourselves. Once we have done that, we can use them in every way we want.
Before moving to actual Java code, let's see the AST for a sample input.
You can easily notice that the element rule is sort of transparent: where you would expect to find it there is always going to be a tag or content . So why did we define it? There are two advantages: avoid repetition in our grammar and simplify managing the results of the parsing. We avoid repetition because if we did not have the element rule we should repeat (content|tag) everywhere it is used. What if one day we add a new type of element? In addition to that it simplify the processing of the AST because it makes both the node represent tag and content extend a comment ancestor.
高級
In this section we deepen our understanding of ANTLR. We will look at more complex examples and situations we may have to handle in our parsing adventures. We will learn how to perform more adavanced testing, to catch more bugs and ensure a better quality for our code. We will see what a visitor is and how to use it. Finally, we will see how to deal with expressions and the complexity they bring.
You can come back to this section when you need to deal with complex parsing problems.
23. The Markup Project in Java
You can follow the instructions in Java Setup or just copy the antlr-java folder of the companion repository. Once the file pom.xml is properly configured, this is how you build and execute the application.
// use mwn to generate the package mvn package // every time you need to execute the application java -cp target/markup-example-1.0-jar-with-dependencies.jar me.tomassetti.examples.MarkupParser.AppAs you can see, it isn't any different from any typical Maven project, although it's indeed more complicated that a typical Javascript or Python project. Of course, if you use an IDE you don't need to do anything different from your typical workflow.
24. The Main App.java
We are going to see how to write a typical ANTLR application in Java.
package me.tomassetti.examples.MarkupParser; import org.antlr.v4.runtime.*; import org.antlr.v4.runtime.tree.*;public class App {public static void main( String[] args ){ANTLRInputStream inputStream = new ANTLRInputStream("I would like to [b][i]emphasize[/i][/b] this and [u]underline [b]that[/b][/u] ." +"Let's not forget to quote: [quote author=\"John\"]You're wrong![/quote]");MarkupLexer markupLexer = new MarkupLexer(inputStream);CommonTokenStream commonTokenStream = new CommonTokenStream(markupLexer);MarkupParser markupParser = new MarkupParser(commonTokenStream);MarkupParser.FileContext fileContext = markupParser.file(); MarkupVisitor visitor = new MarkupVisitor(); visitor.visit(fileContext); } }At this point the main java file should not come as a surprise, the only new development is the visitor. Of course, there are the obvious little differences in the names of the ANTLR classes and such. This time we are building a visitor, whose main advantage is the chance to control the flow of the program. While we are still dealing with text, we don't want to display it, we want to transform it from pseudo-BBCode to pseudo-Markdown.
25. Transforming Code with ANTLR
The first issue to deal with our translation from pseudo-BBCode to pseudo-Markdown is a design decision. Our two languages are different and frankly neither of the two original one is that well designed.
BBCode was created as a safety precaution, to make possible to disallow the use of HTML but giove some of its power to users. Markdown was created to be an easy to read and write format, that could be translated into HTML. So they both mimic HTML, and you can actually use HTML in a Markdown document. Let's start to look into how messy would be a real conversion.
package me.tomassetti.examples.MarkupParser;import org.antlr.v4.runtime.*; import org.antlr.v4.runtime.misc.*; import org.antlr.v4.runtime.tree.*;public class MarkupVisitor extends MarkupParserBaseVisitor {@Overridepublic String visitFile(MarkupParser.FileContext context){visitChildren(context);System.out.println("");return null;}@Overridepublic String visitContent(MarkupParser.ContentContext context){System.out.print(context.TEXT().getText());return visitChildren(context);} }The first version of our visitor prints all the text and ignore all the tags.
You can see how to control the flow, either by calling visitChildren , or any other visit* function, and deciding what to return. We just need to override the methods that we want to change. Otherwise, the default implementation would just do like visitContent , on line 23, it will visit the children nodes and allows the visitor to continue. Just like for a listener, the argument is the proper context type. If you want to stop the visitor just return null as on line 15.
26. Joy and Pain of Transforming Code
Transforming code, even at a very simple level, comes with some complications. Let's start easy with some basic visitor methods.
@Override public String visitContent(MarkupParser.ContentContext context) { return context.getText(); } @Override public String visitElement(MarkupParser.ElementContext context) {if(context.parent instanceof MarkupParser.FileContext){if(context.content() != null) System.out.print(visitContent(context.content())); if(context.tag() != null)System.out.print(visitTag(context.tag()));} return null; }Before looking at the main method, let's look at the supporting ones. Foremost we have changed visitContent by making it return its text instead of printing it. Second, we have overridden the visitElement so that it prints the text of its child, but only if it's a top element, and not inside a tag . In both cases, it achieve this by calling the proper visit* method. It knows which one to call because it checks if it actually has a tag or content node.
@Override public String visitTag(MarkupParser.TagContext context) {String text = "";String startDelimiter = "", endDelimiter = "";String id = context.ID(0).getText();switch(id){case "b":startDelimiter = endDelimiter = "**"; break;case "u":startDelimiter = endDelimiter = "*"; break;case "quote":String attribute = context.attribute().STRING().getText();attribute = attribute.substring(1,attribute.length()-1);startDelimiter = System.lineSeparator() + "> ";endDelimiter = System.lineSeparator() + "> " + System.lineSeparator() + "> – "+ attribute + System.lineSeparator();break;} text += startDelimiter;for (MarkupParser.ElementContext node: context.element()){ if(node.tag() != null)text += visitTag(node.tag());if(node.content() != null)text += visitContent(node.content()); } text += endDelimiter;return text; }VisitTag contains more code than every other method, because it can also contain other elements, including other tags that have to be managed themselves, and thus they cannot be simply printed. We save the content of the ID on line 5, of course we don't need to check that the corresponding end tag matches, because the parser will ensure that, as long as the input is well formed.
The first complication starts with at lines 14-15: as it often happens when transforming a language in a different one, there isn't a perfect correspondence between the two. While BBCode tries to be a smarter and safer replacement for HTML, Markdown want to accomplish the same objective of HTML, to create a structured document. So BBCode has an underline tag, while Markdown does not.
So we have to make a decision
Do we want to discard the information, or directly print HTML, or something else? We choose something else and instead convert the underline to an italic. That might seem completely arbitrary, and indeed there is an element of choice in this decision. But the conversion forces us to lose some information, and both are used for emphasis, so we choose the closer thing in the new language.
The following case, on lines 18-22, force us to make another choice. We can't maintain the information about the author of the quote in a structured way, so we choose to print the information in a way that will make sense to a human reader.
On lines 28-34 we do our “magic”: we visit the children and gather their text, then we close with the endDelimiter . Finally we return the text that we have created.
That's how the visitor works
- if it's a content node, it directly returns the text
- if it's a tag , it setups the correct delimiters and then it checks its children. It repeats step 2 for each children and then it returns the gathered text
It's obviously a simple example, but it show how you can have great freedom in managing the visitor once you have launched it. Together with the patterns that we have seen at the beginning of this section you can see all of the options: to return null to stop the visit, to return children to continue, to return something to perform an action ordered at an higher level of the tree.
27. Advanced Testing
The use of lexical modes permit to handle the parsing of island languages, but it complicates testing.
We are not going to show MarkupErrorListener.java because w edid not changed it; if you need you can see it on the repository.
You can run the tests by using the following command.
mvn testNow we are going to look at the tests code. We are skipping the setup part, because that also is obvious, we just copy the process seen on the main file, but we simply add our error listener to intercept the errors.
// private variables inside the class AppTest private MarkupErrorListener errorListener; private MarkupLexer markupLexer;public void testText() {MarkupParser parser = setup("anything in here");MarkupParser.ContentContext context = parser.content(); assertEquals("",this.errorListener.getSymbol()); }public void testInvalidText() {MarkupParser parser = setup("[anything in here");MarkupParser.ContentContext context = parser.content(); assertEquals("[",this.errorListener.getSymbol()); }public void testWrongMode() {MarkupParser parser = setup("author=\"john\""); MarkupParser.AttributeContext context = parser.attribute(); TokenStream ts = parser.getTokenStream(); assertEquals(MarkupLexer.DEFAULT_MODE, markupLexer._mode);assertEquals(MarkupLexer.TEXT,ts.get(0).getType()); assertEquals("author=\"john\"",this.errorListener.getSymbol()); }public void testAttribute() {MarkupParser parser = setup("author=\"john\"");// we have to manually push the correct modethis.markupLexer.pushMode(MarkupLexer.BBCODE);MarkupParser.AttributeContext context = parser.attribute(); TokenStream ts = parser.getTokenStream(); assertEquals(MarkupLexer.ID,ts.get(0).getType());assertEquals(MarkupLexer.EQUALS,ts.get(1).getType());assertEquals(MarkupLexer.STRING,ts.get(2).getType()); assertEquals("",this.errorListener.getSymbol()); }public void testInvalidAttribute() {MarkupParser parser = setup("author=/\"john\"");// we have to manually push the correct modethis.markupLexer.pushMode(MarkupLexer.BBCODE);MarkupParser.AttributeContext context = parser.attribute(); assertEquals("/",this.errorListener.getSymbol()); }The first two methods are exactly as before, we simply check that there are no errors, or that there is the correct one because the input itself is erroneous. On lines 30-32 things start to get interesting: the issue is that by testing the rules one by one we don't give the chance to the parser to switch automatically to the correct mode. So it remains always on the DEFAULT_MODE, which in our case makes everything looks like TEXT . This obviously makes the correct parsing of an attribute impossible.
The same lines shows also how you can check the current mode that you are in, and the exact type of the tokens that are found by the parser, which we use to confirm that indeed all is wrong in this case.
While we could use a string of text to trigger the correct mode, each time, that would make testing intertwined with several pieces of code, which is a no-no. So the solution is seen on line 39: we trigger the correct mode manually. Once you have done that, you can see that our attribute is recognized correctly.
28. Dealing with Expressions
So far we have written simple parser rules, now we are going to see one of the most challenging parts in analyzing a real (programming) language: expressions. While rules for statements are usually larger they are quite simple to deal with: you just need to write a rule that encapsulate the structure with the all the different optional parts. For instance a for statement can include all other kind of statements, but we can simply include them with something like statement*. An expression, instead, can be combined in many different ways.
An expression usually contains other expressions. For example the typical binary expression is composed by an expression on the left, an operator in the middle and another expression on the right. This can lead to ambiguities. Think, for example, at the expression 5 + 3 * 2 , for ANTLR this expression is ambiguous because there are two ways to parse it. It could either parse it as 5 + (3 * 2) or (5 +3) * 2.
Until this moment we have avoided the problem simply because markup constructs surround the object on which they are applied. So there is not ambiguity in choosing which one to apply first: it's the most external. Imagine if this expression was written as:
<add><int>5</int><mul><int>3</int><int>2</int></mul> </add>That would make obvious to ANTLR how to parse it.
These types of rules are called left-recursive rules. You might say: just parse whatever comes first. The problem with that is semantic: the addition comes first, but we know that multiplications have a precedence over additions. Traditionally the way to solve this problem was to create a complex cascade of specific expressions like this:
expression : addition; addition : multiplication ('+' multiplication)* ; multiplication : atom ('*' atom)* ; atom : NUMBER ;This way ANTLR would have known to search first for a number, then for multiplications and finally for additions. This is cumbersome and also counterintuitive, because the last expression is the first to be actually recognized. Luckily ANTLR4 can create a similar structure automatically, so we can use a much more natural syntax .
expression : expression '*' expression| expression '+' expression | NUMBER;In practice ANTLR consider the order in which we defined the alternatives to decide the precedence. By writing the rule in this way we are telling to ANTLR that the multiplication has precedence on the addition.
29. Parsing Spreadsheets
Now we are prepared to create our last application, in C#. We are going to build the parser of an Excel-like application. In practice, we want to manage the expressions you write in the cells of a spreadsheet.
grammar Spreadsheet;expression : '(' expression ')' #parenthesisExp| expression (ASTERISK|SLASH) expression #mulDivExp| expression (PLUS|MINUS) expression #addSubExp| <assoc=right> expression '^' expression #powerExp| NAME '(' expression ')' #functionExp| NUMBER #numericAtomExp| ID #idAtomExp;fragment LETTER : [a-zA-Z] ; fragment DIGIT : [0-9] ;ASTERISK : '*' ; SLASH : '/' ; PLUS : '+' ; MINUS : '-' ;ID : LETTER DIGIT ;NAME : LETTER+ ;NUMBER : DIGIT+ ('.' DIGIT+)? ;WHITESPACE : ' ' -> skip;With all the knowledge you have acquired so far everything should be clear, except for possibly three things:
The parentheses comes first because its only role is to give the user a way to override the precedence of operator, if it needs to do so. This graphical representation of the AST should make it clear.
The things on the right are labels , they are used to make ANTLR generate specific functions for the visitor or listener. So there will be a VisitFunctionExp , a VisitPowerExp , etc. This makes possible to avoid the use of a giant visitor for the expression rule.
The expression relative to exponentiation is different because there are two possible ways to act, to group them, when you meet two sequential expressions of the same type. The first one is to execute the one on the left first and then the one on the right, the second one is the inverse: this is called associativity . Usually the one that you want to use is left-associativity, which is the default option. Nonetheless exponentiation is right-associative , so we have to signal this to ANTLR.
Another way to look at this is: if there are two expressions of the same type, which one has the precedence: the left one or the right one? Again, an image is worth a thousand words.
We have also support for functions, alphanumeric variables that represents cells and real numbers.
30. The Spreadsheet Project in C#
You just need to follow the C# Setup : to install a nuget package for the runtime and an ANTLR4 extension for Visual Studio. The extension will automatically generate everything whenever you build your project: parser, listener and/or visitor.
After you have done that, you can also add grammar files just by using the usual menu Add -> New Item. Do exactly that to create a grammar called Spreadsheet.g4 and put in it the grammar we have just created. Now let's see the main Program.cs .
using System; using Antlr4.Runtime;namespace AntlrTutorial {class Program{static void Main(string[] args){string input = "log(10 + A1 * 35 + (5.4 - 7.4))";AntlrInputStream inputStream = new AntlrInputStream(input);SpreadsheetLexer spreadsheetLexer = new SpreadsheetLexer(inputStream);CommonTokenStream commonTokenStream = new CommonTokenStream(spreadsheetLexer);SpreadsheetParser spreadsheetParser = new SpreadsheetParser(commonTokenStream);SpreadsheetParser.ExpressionContext expressionContext = spreadsheetParser.expression();SpreadsheetVisitor visitor = new SpreadsheetVisitor();Console.WriteLine(visitor.Visit(expressionContext));}} }There is nothing to say, apart from that, of course, you have to pay attention to yet another slight variation in the naming of things: pay attention to the casing. For instance, AntlrInputStream , in the C# program, was ANTLRInputStream in the Java program.
Also you can notice that, this time, we output on the screen the result of our visitor, instead of writing the result on a file.
31. Excel is Doomed
We are going to take a look at our visitor for the Spreadsheet project.
public class SpreadsheetVisitor : SpreadsheetBaseVisitor<double> {private static DataRepository data = new DataRepository();public override double VisitNumericAtomExp(SpreadsheetParser.NumericAtomExpContext context){ return double.Parse(context.NUMBER().GetText(), System.Globalization.CultureInfo.InvariantCulture);}public override double VisitIdAtomExp(SpreadsheetParser.IdAtomExpContext context){String id = context.ID().GetText();return data[id];}public override double VisitParenthesisExp(SpreadsheetParser.ParenthesisExpContext context){return Visit(context.expression());}public override double VisitMulDivExp(SpreadsheetParser.MulDivExpContext context){double left = Visit(context.expression(0));double right = Visit(context.expression(1));double result = 0;if (context.ASTERISK() != null)result = left * right;if (context.SLASH() != null)result = left / right;return result;}[..]public override double VisitFunctionExp(SpreadsheetParser.FunctionExpContext context){String name = context.NAME().GetText();double result = 0;switch(name){case "sqrt":result = Math.Sqrt(Visit(context.expression()));break;case "log":result = Math.Log10(Visit(context.expression()));break;}return result;} }VisitNumeric and VisitIdAtom return the actual numbers that are represented either by the literal number or the variable. In a real scenario DataRepository would contain methods to access the data in the proper cell, but in our example is just a Dictionary with some keys and numbers. The other methods actually work in the same way: they visit/call the containing expression(s). The only difference is what they do with the results.
Some perform an operation on the result, the binary operations combine two results in the proper way and finally VisitParenthesisExp just reports the result higher on the chain. Math is simple, when it's done by a computer.
32. Testing Everything
Up until now we have only tested the parser rules, that is to say we have tested only if we have created the correct rule to parse our input. Now we are also going to test the visitor functions. This is the ideal chance because our visitor return values that we can check individually. In other occasions, for instance if your visitor prints something to the screen, you may want to rewrite the visitor to write on a stream. Then, at testing time, you can easily capture the output.
We are not going to show SpreadsheetErrorListener.cs because it's the same as the previous one we have already seen; if you need it you can see it on the repository.
To perform unit testing on Visual Studio you need to create a specific project inside the solution. You can choose different formats, we opt for the xUnit version. To run them there is an aptly named section “TEST” on the menu bar.
[Fact] public void testExpressionPow() {setup("5^3^2");PowerExpContext context = parser.expression() as PowerExpContext;CommonTokenStream ts = (CommonTokenStream)parser.InputStream; Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(0).Type);Assert.Equal(SpreadsheetLexer.T__2, ts.Get(1).Type);Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(2).Type);Assert.Equal(SpreadsheetLexer.T__2, ts.Get(3).Type);Assert.Equal(SpreadsheetLexer.NUMBER, ts.Get(4).Type); }[Fact] public void testVisitPowerExp() {setup("4^3^2");PowerExpContext context = parser.expression() as PowerExpContext;SpreadsheetVisitor visitor = new SpreadsheetVisitor();double result = visitor.VisitPowerExp(context);Assert.Equal(double.Parse("262144"), result); }[..][Fact] public void testWrongVisitFunctionExp() {setup("logga(100)");FunctionExpContext context = parser.expression() as FunctionExpContext;SpreadsheetVisitor visitor = new SpreadsheetVisitor();double result = visitor.VisitFunctionExp(context);CommonTokenStream ts = (CommonTokenStream)parser.InputStream;Assert.Equal(SpreadsheetLexer.NAME, ts.Get(0).Type);Assert.Equal(null, errorListener.Symbol);Assert.Equal(0, result); }[Fact] public void testCompleteExp() {setup("log(5+6*7/8)");ExpressionContext context = parser.expression();SpreadsheetVisitor visitor = new SpreadsheetVisitor();double result = visitor.Visit(context);Assert.Equal("1.01072386539177", result.ToString(System.Globalization.CultureInfo.GetCultureInfo("en-US").NumberFormat)); }The first test function is similar to the ones we have already seen; it checks that the corrects tokens are selected. On line 11 and 13 you may be surprised to see that weird token type, this happens because we didn't explicitly created one for the '^' symbol so one got automatically created for us. If you need you can see all the tokens by looking at the *.tokens file generated by ANTLR.
On line 25 we visit our test node and get the results, that we check on line 27. It's all very simple because our visitor is simple, while unit testing should always be easy and made up of small parts it really can't be easier than this.
The only thing to pay attention to it's related to the format of the number, it's not a problem here, but look at line 59, where we test the result of a whole expression. There we need to make sure that the correct format is selected, because different countries use different symbols as the decimal mark.
There are some things that depends on the cultural context
If your computer was already set to the American English Culture this wouldn't be necessary, but to guarantee the correct testing results for everybody we have to specify it. Keep that in mind if you are testing things that are culture-dependent: such as grouping of digits, temperatures, etc.
On line 44-46 you see than when we check for the wrong function the parser actually works. That's because indeed “l(fā)ogga” is syntactically valid as a function name, but it's not semantically correct. The function “l(fā)ogga” doesn't exists, so our program doesn't know what to do with it. So when we visit it we get 0 as a result. As you recall this was our choice: since we initialize the result to 0 and we don't have a default case in VisitFunctionExp. So if there no function the result remains 0. A possib alternative could be to throw an exception.
Final Remarks
In this section we see tips and tricks that never came up in our example, but can be useful in your programs. We suggest more resources you may find useful if you want to know more about ANTLR, both the practice and the theory, or you need to deal with the most complex problems.
33. Tips and Tricks
Let's see a few tricks that could be useful from time to time. These were never needed in our examples, but they have been quite useful in other scenarios.
Catchall Rule
The first one is the ANY lexer rule. This is simply a rule in the following format.
ANY : . ;This is a catchall rule that should be put at the end of your grammar. It matches any character that didn't find its place during the parsing. So creating this rule can help you during development, when your grammar has still many holes that could cause distracting error messages. It's even useful during production, when it acts as a canary in the mines. If it shows up in your program you know that something is wrong.
頻道
There is also something that we haven't talked about: channels . Their use case is usually handling comments. You don't really want to check for comments inside every of your statements or expressions, so you usually throw them way with -> skip . But there are some cases where you may want to preserve them, for instance if you are translating a program in another language. When this happens you use channels . There is already one called HIDDEN that you can use, but you can declare more of them at the top of your lexer grammar.
channels { UNIQUENAME } // and you use them this way COMMENTS : '//' ~[\r\n]+ -> channel(UNIQUENAME) ;Rule Element Labels
There is another use of labels other than to distinguish among different cases of the same rule. They can be used to give a specific name, usually but not always of semantic value, to a common rule or parts of a rule. The format is label=rule , to be used inside another rule.
expression : left=expression (ASTERISK|SLASH) right=expression ;This way left and right would become fields in the ExpressionContext nodes. And instead of using context.expression(0) , you could refer to the same entity using context.left .
Problematic Tokens
In many real languages some symbols are reused in different ways, some of which may lead to ambiguities. A common problematic example are the angle brackets, used both for bitshift expression and to delimit parameterized types.
// bitshift expression, it assigns to x the value of y shifted by three bits x = y >> 3; // parameterized types, it define x as a list of dictionaries List<Dictionary<string, int>> x;The natural way of defining the bitshift operator token is as a single double angle brackets, '>>'. But this might lead to confusing a nested parameterized definition with the bitshift operator, for instance in the second example shown up here. While a simple way of solving the problem would be using semantic predicates, an excessive number of them would slow down the parsing phase. The solution is to avoid defining the bitshift operator token and instead using the angle brackets twice in the parser rule, so that the parser itself can choose the best candidate for every occasion.
// from this RIGHT_SHIFT : '>>'; expression : ID RIGHT_SHIFT NUMBER; // to this expression : ID SHIFT SHIFT NUMBER;34. Conclusions
We have learned a lot today:
- what are a lexer and a parser
- how to create lexer and parser rules
- how to use ANTLR to generate parsers in Java, C#, Python and JavaScript
- the fundamental kinds of problems you will encounter parsing and how to solve them
- how to understand errors
- how to test your parsers
That's all you need to know to use ANTLR on your own. And I mean literally, you may want to know more, but now you have solid basis to explore on your own.
Where to look if you need more information about ANTLR:
- On this very website there is whole category dedicated to ANTLR .
- The official ANTLR website is a good starting point to know the general status of the project, the specialized development tools and related project, like StringTemplate
- The ANTLR documentation on GitHub ; especially useful are the information on targets and how to setup it on different languages .
- The ANTLR 4.6 API ; it's related to the Java version, so there might be some differences in other languages, but it's the best place where to settle your doubts about the inner workings of this tool.
- For the very interested in the science behind ANTLR4, there is an academic paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis
- The Definitive ANTLR 4 Reference , by the man itself, Terence Parr , the creator of ANTLR. The resource you need if you want to know everything about ANTLR and a good deal about parsing languages in general.
Also the book it's only place where you can find and answer to question like these:
ANTLR v4 is the result of a minor detour (twenty-five years) I took in graduate
學(xué)校。 I guess I'm going to have to change my motto slightly.
Why program by hand in five days what you can spend twenty-five years of your
life automating?
We worked quite hard to build the largest tutorial on ANTLR: the mega-tutorial! A post over 13.000 words long, or more than 30 pages, to try answering all your questions about ANTLR. Missing something? Contact us and let us now, we are here to help
翻譯自: https://www.javacodegeeks.com/2017/03/antlr-mega-tutorial.html
總結(jié)
- 上一篇: java内存模型和内存结构_Java内存
- 下一篇: rsync 同步优化_可以优化同步吗?