當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Go语言的分词器（sego）

發(fā)布時(shí)間：2024/4/11 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Go语言的分词器（sego）小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

今天，主要來(lái)介紹一個(gè)Go語(yǔ)言的中文分詞器，即sego。本分詞器是由陳輝寫(xiě)的，他的微博在這里，github詳

見(jiàn)此處。由于之前他在Google，所以對(duì)Go語(yǔ)言特別熟悉。sego的介紹如下

?? sego是Go語(yǔ)言的中文分詞器，詞典用前綴樹(shù)實(shí)現(xiàn)，分詞器算法為基于詞頻的最短路徑加動(dòng)態(tài)規(guī)劃。

?? 支持普通和搜索引擎兩種分詞模式，支持用戶詞典、詞性標(biāo)注，可運(yùn)行JSON RPC服務(wù)。

?? 分詞速度單線程2.7MB/s，goroutines并發(fā)13MB/s, 處理器Core i7-3615QM 2.30GHz 8核。

接下來(lái)，以如下幾個(gè)方面來(lái)介紹sego

?? 1. sego的安裝

?? 2. sego的原理

?? 3. sego的使用

1. sego的安裝

???首先，在Go語(yǔ)言中，有很多第三方包，可以幫助我們實(shí)現(xiàn)某些特定的功能。比如這里。而sego的項(xiàng)目在這里。

?? 先把工程根目錄加入GOPATH，然后執(zhí)行如下命令

????

???然后在sego項(xiàng)目根目錄下就得到了src和pkg，這是一個(gè)Go語(yǔ)言項(xiàng)目，可以在src中創(chuàng)建文件進(jìn)行使用。

?? sego的源文件如下

???

???其中詞庫(kù)在data目錄下，至此，sego就可以直接使用了。接下來(lái)會(huì)介紹sego的原理。

2. sego的原理

?? 之前，我用過(guò)中科院的分詞器ICTCLAS2014，它是一個(gè)非常優(yōu)秀的分詞器，支持C++，Java，C#，Hadoop等

?? 等，使用起來(lái)非常方便，分詞效果也不錯(cuò)，但是它不是開(kāi)源的。今天介紹的sego分詞器是開(kāi)源Go語(yǔ)言分詞器，

??? 它的原理是：詞典用前綴樹(shù)實(shí)現(xiàn)，而分詞器算法為基于詞頻的最短路徑加動(dòng)態(tài)規(guī)劃。具體參考這兩部分的代碼

???詞典：dictionary.go

package segoimport ("bytes" )// Dictionary結(jié)構(gòu)體實(shí)現(xiàn)了一個(gè)字串前綴樹(shù)，一個(gè)分詞可能出現(xiàn)在葉子節(jié)點(diǎn)也有可能出現(xiàn)在非葉節(jié)點(diǎn) type Dictionary struct {root node // 根節(jié)點(diǎn)maxTokenLength int // 詞典中最長(zhǎng)的分詞numTokens int // 詞典中分詞數(shù)目tokens []*Token // 詞典中所有的分詞，方便遍歷totalFrequency int64 // 詞典中所有分詞的頻率之和 }// 前綴樹(shù)節(jié)點(diǎn) type node struct {word Text // 該節(jié)點(diǎn)對(duì)應(yīng)的字元token *Token // 當(dāng)此節(jié)點(diǎn)沒(méi)有對(duì)應(yīng)的分詞時(shí)值為nilchildren []*node // 該字元后繼的所有可能字元，當(dāng)為葉子節(jié)點(diǎn)時(shí)為空 }// 詞典中最長(zhǎng)的分詞 func (dict *Dictionary) MaxTokenLength() int {return dict.maxTokenLength }// 詞典中分詞數(shù)目 func (dict *Dictionary) NumTokens() int {return dict.numTokens }// 詞典中所有分詞的頻率之和 func (dict *Dictionary) TotalFrequency() int64 {return dict.totalFrequency }// 向詞典中加入一個(gè)分詞 func (dict *Dictionary) addToken(token *Token) {current := &dict.rootfor _, word := range token.text {// 一邊向深處移動(dòng)一邊添加節(jié)點(diǎn)（如果需要的話）current = upsert(¤t.children, word)}// 當(dāng)這個(gè)分詞不存在詞典中時(shí)添加此分詞，否則忽略if current.token == nil {current.token = tokenif len(token.text) > dict.maxTokenLength {dict.maxTokenLength = len(token.text)}dict.numTokens++dict.tokens = append(dict.tokens, token)dict.totalFrequency += int64(token.frequency)} }// 在詞典中查找和字元組words可以前綴匹配的所有分詞 // 返回值為找到的分詞數(shù) func (dict *Dictionary) lookupTokens(words []Text, tokens []*Token) int {// 特殊情況if len(words) == 0 {return 0}current := &dict.rootnumTokens := 0for _, word := range words {// 如果已經(jīng)抵達(dá)葉子節(jié)點(diǎn)則不再繼續(xù)尋找if len(current.children) == 0 {break}// 否則在該節(jié)點(diǎn)子節(jié)點(diǎn)中進(jìn)行下個(gè)字元的匹配index, found := binarySearch(current.children, word)if !found {break}// 匹配成功，則跳入匹配的子節(jié)點(diǎn)中current = current.children[index]if current.token != nil {tokens[numTokens] = current.tokennumTokens++}}return numTokens }// 二分法查找字元在子節(jié)點(diǎn)中的位置 // 如果查找成功，第一個(gè)返回參數(shù)為找到的位置，第二個(gè)返回參數(shù)為true // 如果查找失敗，第一個(gè)返回參數(shù)為應(yīng)當(dāng)插入的位置，第二個(gè)返回參數(shù)false func binarySearch(nodes []*node, word Text) (int, bool) {start := 0end := len(nodes) - 1// 特例：if len(nodes) == 0 {// 當(dāng)slice為空時(shí)，插入第一位置return 0, false}compareWithFirstWord := bytes.Compare(word, nodes[0].word)if compareWithFirstWord < 0 {// 當(dāng)要查找的元素小于首元素時(shí)，插入第一位置return 0, false} else if compareWithFirstWord == 0 {// 當(dāng)首元素等于node時(shí)return 0, true}compareWithLastWord := bytes.Compare(word, nodes[end].word)if compareWithLastWord == 0 {// 當(dāng)尾元素等于node時(shí)return end, true} else if compareWithLastWord > 0 {// 當(dāng)尾元素小于node時(shí)return end + 1, false}// 二分current := end / 2for end-start > 1 {compareWithCurrentWord := bytes.Compare(word, nodes[current].word)if compareWithCurrentWord == 0 {return current, true} else if compareWithCurrentWord < 0 {end = currentcurrent = (start + current) / 2} else {start = currentcurrent = (current + end) / 2}}return end, false }// 將字元加入節(jié)點(diǎn)數(shù)組中，并返回插入的節(jié)點(diǎn)指針 // 如果字元已經(jīng)存在則返回存在的節(jié)點(diǎn)指針 func upsert(nodes *[]*node, word Text) *node {index, found := binarySearch(*nodes, word)if found {return (*nodes)[index]}*nodes = append(*nodes, nil)copy((*nodes)[index+1:], (*nodes)[index:])(*nodes)[index] = &node{word: word}return (*nodes)[index] }

分詞器算法：segmenter.go

//Go中文分詞 package segoimport ("bufio""fmt""log""math""os""strconv""strings""unicode""unicode/utf8" )const (minTokenFrequency = 2 // 僅從字典文件中讀取大于等于此頻率的分詞 )// 分詞器結(jié)構(gòu)體 type Segmenter struct {dict *Dictionary }// 該結(jié)構(gòu)體用于記錄Viterbi算法中某字元處的向前分詞跳轉(zhuǎn)信息 type jumper struct {minDistance float32token *Token }// 返回分詞器使用的詞典 func (seg *Segmenter) Dictionary() *Dictionary {return seg.dict }// 從文件中載入詞典 // // 可以載入多個(gè)詞典文件，文件名用","分隔，排在前面的詞典優(yōu)先載入分詞，比如 // "用戶詞典.txt,通用詞典.txt" // 當(dāng)一個(gè)分詞既出現(xiàn)在用戶詞典也出現(xiàn)在通用詞典中，則優(yōu)先使用用戶詞典。 // // 詞典的格式為（每個(gè)分詞一行）： // 分詞文本頻率詞性 func (seg *Segmenter) LoadDictionary(files string) {seg.dict = new(Dictionary)for _, file := range strings.Split(files, ",") {log.Printf("載入sego詞典 %s", file)dictFile, err := os.Open(file)defer dictFile.Close()if err != nil {log.Fatalf("無(wú)法載入字典文件 \"%s\" \n", file)}reader := bufio.NewReader(dictFile)var text stringvar freqText stringvar frequency intvar pos string// 逐行讀入分詞for {size, _ := fmt.Fscanln(reader, &text, &freqText, &pos)if size == 0 {// 文件結(jié)束break} else if size < 2 {// 無(wú)效行continue} else if size == 2 {// 沒(méi)有詞性標(biāo)注時(shí)設(shè)為空字符串pos = ""}// 解析詞頻var err errorfrequency, err = strconv.Atoi(freqText)if err != nil {continue}// 過(guò)濾頻率太小的詞if frequency < minTokenFrequency {continue}// 將分詞添加到字典中words := splitTextToWords([]byte(text))token := Token{text: words, frequency: frequency, pos: pos}seg.dict.addToken(&token)}}// 計(jì)算每個(gè)分詞的路徑值，路徑值含義見(jiàn)Token結(jié)構(gòu)體的注釋logTotalFrequency := float32(math.Log2(float64(seg.dict.totalFrequency)))for _, token := range seg.dict.tokens {token.distance = logTotalFrequency - float32(math.Log2(float64(token.frequency)))}// 對(duì)每個(gè)分詞進(jìn)行細(xì)致劃分，用于搜索引擎模式，該模式用法見(jiàn)Token結(jié)構(gòu)體的注釋。for _, token := range seg.dict.tokens {segments := seg.segmentWords(token.text, true)// 計(jì)算需要添加的子分詞數(shù)目numTokensToAdd := 0for iToken := 0; iToken < len(segments); iToken++ {if len(segments[iToken].token.text) > 1 {// 略去字元長(zhǎng)度為一的分詞// TODO: 這值得進(jìn)一步推敲，特別是當(dāng)字典中有英文復(fù)合詞的時(shí)候numTokensToAdd++}}token.segments = make([]*Segment, numTokensToAdd)// 添加子分詞iSegmentsToAdd := 0for iToken := 0; iToken < len(segments); iToken++ {if len(segments[iToken].token.text) > 1 {token.segments[iSegmentsToAdd] = &segments[iSegmentsToAdd]iSegmentsToAdd++}}}log.Println("sego詞典載入完畢") }// 對(duì)文本分詞 // // 輸入?yún)?shù)： // bytes UTF8文本的字節(jié)數(shù)組 // // 輸出： // []Segment 劃分的分詞 func (seg *Segmenter) Segment(bytes []byte) []Segment {return seg.internalSegment(bytes, false) }func (seg *Segmenter) internalSegment(bytes []byte, searchMode bool) []Segment {// 處理特殊情況if len(bytes) == 0 {return []Segment{}}// 劃分字元text := splitTextToWords(bytes)return seg.segmentWords(text, searchMode) }func (seg *Segmenter) segmentWords(text []Text, searchMode bool) []Segment {// 搜索模式下該分詞已無(wú)繼續(xù)劃分可能的情況if searchMode && len(text) == 1 {return []Segment{}}// jumpers定義了每個(gè)字元處的向前跳轉(zhuǎn)信息，包括這個(gè)跳轉(zhuǎn)對(duì)應(yīng)的分詞，// 以及從文本段開(kāi)始到該字元的最短路徑值jumpers := make([]jumper, len(text))tokens := make([]*Token, seg.dict.maxTokenLength)for current := 0; current < len(text); current++ {// 找到前一個(gè)字元處的最短路徑，以便計(jì)算后續(xù)路徑值var baseDistance float32if current == 0 {// 當(dāng)本字元在文本首部時(shí)，基礎(chǔ)距離應(yīng)該是零baseDistance = 0} else {baseDistance = jumpers[current-1].minDistance}// 尋找所有以當(dāng)前字元開(kāi)頭的分詞numTokens := seg.dict.lookupTokens(text[current:minInt(current+seg.dict.maxTokenLength, len(text))], tokens)// 對(duì)所有可能的分詞，更新分詞結(jié)束字元處的跳轉(zhuǎn)信息for iToken := 0; iToken < numTokens; iToken++ {location := current + len(tokens[iToken].text) - 1if !searchMode || current != 0 || location != len(text)-1 {updateJumper(&jumpers[location], baseDistance, tokens[iToken])}}// 當(dāng)前字元沒(méi)有對(duì)應(yīng)分詞時(shí)補(bǔ)加一個(gè)偽分詞if numTokens == 0 || len(tokens[0].text) > 1 {updateJumper(&jumpers[current], baseDistance,&Token{text: []Text{text[current]}, frequency: 1, distance: 32, pos: "x"})}}// 從后向前掃描第一遍得到需要添加的分詞數(shù)目numSeg := 0for index := len(text) - 1; index >= 0; {location := index - len(jumpers[index].token.text) + 1numSeg++index = location - 1}// 從后向前掃描第二遍添加分詞到最終結(jié)果outputSegments := make([]Segment, numSeg)for index := len(text) - 1; index >= 0; {location := index - len(jumpers[index].token.text) + 1numSeg--outputSegments[numSeg].token = jumpers[index].tokenindex = location - 1}// 計(jì)算各個(gè)分詞的字節(jié)位置bytePosition := 0for iSeg := 0; iSeg < len(outputSegments); iSeg++ {outputSegments[iSeg].start = bytePositionbytePosition += textSliceByteLength(outputSegments[iSeg].token.text)outputSegments[iSeg].end = bytePosition}return outputSegments }// 更新跳轉(zhuǎn)信息: // 1. 當(dāng)該位置從未被訪問(wèn)過(guò)時(shí)(jumper.minDistance為零的情況)，或者 // 2. 當(dāng)該位置的當(dāng)前最短路徑大于新的最短路徑時(shí) // 將當(dāng)前位置的最短路徑值更新為baseDistance加上新分詞的概率 func updateJumper(jumper *jumper, baseDistance float32, token *Token) {newDistance := baseDistance + token.distanceif jumper.minDistance == 0 || jumper.minDistance > newDistance {jumper.minDistance = newDistancejumper.token = token} }// 取兩整數(shù)較小值 func minInt(a, b int) int {if a > b {return b}return a }// 取兩整數(shù)較大值 func maxInt(a, b int) int {if a > b {return a}return b }// 將文本劃分成字元 func splitTextToWords(text Text) []Text {output := make([]Text, len(text))current := 0currentWord := 0inAlphanumeric := truealphanumericStart := 0for current < len(text) {r, size := utf8.DecodeRune(text[current:])if size <= 2 && (unicode.IsLetter(r) || unicode.IsNumber(r)) {// 當(dāng)前是拉丁字母或數(shù)字（非中日韓文字）if !inAlphanumeric {alphanumericStart = currentinAlphanumeric = true}} else {if inAlphanumeric {inAlphanumeric = falseif current != 0 {output[currentWord] = toLower(text[alphanumericStart:current])currentWord++}}output[currentWord] = text[current : current+size]currentWord++}current += size}// 處理最后一個(gè)字元是英文的情況if inAlphanumeric {if current != 0 {output[currentWord] = toLower(text[alphanumericStart:current])currentWord++}}return output[:currentWord] }// 將英文詞轉(zhuǎn)化為小寫(xiě) func toLower(text []byte) []byte {output := make([]byte, len(text))for i, t := range text {if t >= 'A' && t <= 'Z' {output[i] = t - 'A' + 'a'} else {output[i] = t}}return output }

3. sego的使用

?? 接下來(lái)介紹sego的使用。首先介紹網(wǎng)頁(yè)版的分詞器，在源文件的server目錄下面，運(yùn)行server.go，如下

????

???server.go是一個(gè)輕量級(jí)的web服務(wù)器，端口為8080，在瀏覽器打開(kāi)后如下圖

????

接下來(lái)，對(duì)如下文章進(jìn)行分詞

近年結(jié)識(shí)了一位警察朋友，好槍法。不單單在射擊場(chǎng)上百發(fā)百中，更在解救人質(zhì)的現(xiàn)場(chǎng)，次次百步穿楊。當(dāng)然了，這個(gè)“楊”不是楊樹(shù)的楊，而是匪徒的代稱。我向他請(qǐng)教射擊的要領(lǐng)。他說(shuō)，很簡(jiǎn)單，就是極端的平靜。我說(shuō)這個(gè)要領(lǐng)所有打槍的人都知道，可是做不到。他說(shuō)，記住，你要像煙灰一樣松散。只有放松，全部潛在的能量才會(huì)釋放出來(lái)，協(xié)同你達(dá)到完美。他的話我似懂非懂，但從此我開(kāi)始注意以前忽略了的煙灰。煙灰，尤其是那些優(yōu)質(zhì)香煙燃燒后的煙灰，非常松散，幾乎沒(méi)有重量和形狀，真一個(gè)大象無(wú)形。它們懶洋洋地趴在那里，好像在冬眠。其實(shí)，在煙灰的內(nèi)部，棲息著高度警覺(jué)和機(jī)敏的鳥(niǎo)群，任何一陣微風(fēng)掠過(guò)，哪怕只是極輕微的嘆息，它們都會(huì)不失時(shí)機(jī)地騰空而起馭風(fēng)而行。它們的力量來(lái)自放松，來(lái)自一種飄揚(yáng)的本能。松散的反面是緊張。幾乎每個(gè)人都有過(guò)由于緊張而慘敗的經(jīng)歷。比如，考試的時(shí)候，全身肌肉僵直，心跳得好像無(wú)數(shù)個(gè)小炸彈在身體的深淺部位依次爆破。手指發(fā)抖頭冒虛汗，原本記得滾瓜爛熟的知識(shí)，改頭換面潛藏起來(lái)，原本涇渭分明的答案變得似是而非，泥鰍一樣滑走……面試的時(shí)候，要么扭扭捏捏不夠大方，無(wú)法表現(xiàn)自己的真實(shí)實(shí)力，要么口若懸河躁動(dòng)不安，拿捏不準(zhǔn)問(wèn)題的實(shí)質(zhì)，只得用不停的述說(shuō)掩飾自己的緊張，適得其反……相信每個(gè)人都儲(chǔ)存了一大堆這類不堪回首的往事。在最危急的時(shí)刻能保持極端的放松，不是一種技術(shù)，而是一種修養(yǎng)，是一種長(zhǎng)期潛移默化修煉提升的結(jié)果。我們常說(shuō)，某人勝就勝在心理上，或是說(shuō)某人敗就敗在心理上。這其中的差池不是指在理性上，而是這種心靈張弛的韌性上。沒(méi)事的時(shí)候看看煙灰吧。他們?cè)?jīng)是火焰，燃燒過(guò)，沸騰過(guò)，但它們此刻安靜了。它們毫不張揚(yáng)地聚精會(huì)神地等待著下一次的乘風(fēng)而起，攜帶著全部的能量，抵達(dá)陽(yáng)光能到的任何地方。

那么代碼如下

package mainimport ("os""fmt""github.com/huichen/sego" )func main() {// 載入詞典var segmenter sego.Segmentersegmenter.LoadDictionary("../src/github.com/huichen/sego/data/dictionary.txt")//讀取文件內(nèi)容到buf中filename := "../src/test/Text"fp, err := os.Open(filename)defer fp.Close()if err != nil {fmt.Println(filename, err)return}buf := make([]byte, 4096)n, _ := fp.Read(buf)if n == 0 {return}// 分詞segments := segmenter.Segment(buf)// 處理分詞結(jié)果// 支持普通模式和搜索模式兩種分詞，見(jiàn)utils.go代碼中SegmentsToString函數(shù)的注釋。// 如果需要詞性標(biāo)注，用SegmentsToString(segments, false)，更多參考utils.go文件output := sego.SegmentsToSlice(segments, false)//輸出分詞后的成語(yǔ)for _, v := range output {if len(v) == 12 {fmt.Println(v)}} }

運(yùn)行結(jié)果如下

整個(gè)工程的結(jié)構(gòu)如下

??????

總結(jié)

以上是生活随笔為你收集整理的Go语言的分词器（sego）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： SVD原理及其应用导论
下一篇： Logistic回归与梯度下降法