日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > windows >内容正文

windows

edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统

發(fā)布時間:2025/3/20 windows 58 豆豆
生活随笔 收集整理的這篇文章主要介紹了 edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

n-gram語言模型用于就是計算句子的概率,通俗來講就是判斷這句話是人話的可能性有多少。n就是將句子做切割,n個單詞為一組。

如何計算句子的概率?根據(jù)條件概率和鏈式規(guī)則

P(B|A)=P(A,B)/P(A) ==>P(A,B) = P(A)P(B|A)

所以P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)

如果句子很長,那么這個式子就會很長,計算會變得很復雜。為了簡化,引入了馬爾科夫假設:隨意一個詞出現(xiàn)的概率只與它前面出現(xiàn)的有限的一個或者幾個詞有關。

假設前面的例子當中,一個詞出現(xiàn)的概率只和前面的一個詞有關,那么

P(A,B,C,D) = P(B|A)P(C|B)P(D|C)

用公式來表達,如果出現(xiàn)的一個詞出現(xiàn)的概率只和前面的k個詞(就是n-gram里的n)有關,那么一個句子的計算概率就為

以下為翻譯文,原版是

https://github.com/tsuirak/deeplearning.ai/blob/master/Natural%20Language%20Processing/Course%202%20-%20Probabilistic%20Models/Labs/Week%203/C2-W3-assginment-Auto%20Complete.ipynb?github.com

在這篇文章里,你將創(chuàng)建一個自動匹配系統(tǒng)。自動匹配系統(tǒng),是你每天都能看到的

  • 當在你google里搜索的時候,你經(jīng)常會得到一些提示來幫助你完成你的搜索。
  • 當你在寫一封郵件的時候,你會得到一些對于你的語句中結束詞的建議

這這個任務的結尾,你將會開發(fā)類似于這個系統(tǒng)的原型。

大綱

1.加載和處理數(shù)據(jù)

1.1加載數(shù)據(jù)

1.2處理數(shù)據(jù)

2.開發(fā)一個n-gram基本語言模型

3.復雜性

4.創(chuàng)建一個自動完成系統(tǒng)

自動完成系統(tǒng)中,一個關鍵的創(chuàng)建區(qū)塊就是一個語言模型。一個語言模型給一個序列的所有詞分配概率,換句話來說最有可能的句子序列有更高的分數(shù)

"I have a pen" 跟"I am a pen" 相比,我們希望它有更高的概率,因為它在我們的實際中,更加符合自然的句子

你可以利用概率計算去開發(fā)一套自動完成系統(tǒng)。比如用戶輸入:

"I eat scrambled" ,那么你可以找到一個單詞x 使 "I eat scrambled x" 擁有最高的概率. 如果x = "eggs", 那么句子將會是 "I eat scrambled eggs"

現(xiàn)在已經(jīng)有很多種語言模型已經(jīng)被開發(fā)出來, 在這個任務當中,我們使用 N-grams, 這個簡單又強大的方法去開發(fā)語言模型。

  • N-grams 同樣使用于機器翻譯和語音識別.

以下是這個任務的步驟

  • 加載和預處理數(shù)據(jù)
    • 加載數(shù)據(jù),然后分詞.
    • 把句子分成訓練數(shù)據(jù)集和測試數(shù)據(jù)集.
    • 對于低頻詞,使用 <unk>代替.

    2.開發(fā)一個n-grams的基礎語言模型

    • 從一個給定的數(shù)據(jù)集中計算n-grams的數(shù)量
    • 使用k-平滑方式,估算下個一個詞的條件概率

    3.通過計算困惑度分數(shù)來評估N-gram模型

    4. 利用的你模型,來預測你句子的下一個單詞

    先引入所需要的庫

    import math import random import numpy as np import nltk import pandas as pd nltk.data.path.append('.')

    Part 1: 加載和預處理數(shù)據(jù)

    Part 1.1: 加載數(shù)據(jù)

    你將使用twitter data. 運行下面的代碼,加載和看前幾個句子。

    注意數(shù)據(jù)是一個包含很多很多推文的長長的字符串,在推文之間有換行符 "n" 分割。

    with open("en_US.twitter.txt","r") as f:data = f.read()print("Data type:", type(data)) print("Number of letters:", len(data)) print("First 300 letters of the data") print("-------") display(data[0:300]) print("-------")print("Last 300 letters of the data") print("-------") display(data[-300:]) print("-------")

    輸出:

    Data type: <class 'str'> Number of letters: 3335477 First 300 letters of the data -------


    "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.nWhen you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.nthey've decided its more fun if I don't.nSo Tired D; Played Lazer Tag & Ran A "


    ------- Last 300 letters of the data -------
    "ust had one a few weeks back....hopefully we will be back soon! wish you the best yonColombia is with an 'o'...“: We now ship to 4 countries in South America (fist pump). Please welcome Columbia to the Stunner Family”n#GutsiestMovesYouCanMake Giving a cat a bath.nCoffee after 5 was a TERRIBLE idea.n"
    -------

    Part 1.2 預處理數(shù)據(jù)

    根據(jù)以下步驟預處理數(shù)據(jù):

  • 使用 "n" 做為分隔符,將數(shù)據(jù)成句子.
  • 將句子分詞。 注意在這篇文章里,我們使用 "token" and "words" ,他們是同一個意思.
  • 將句子分配進訓練數(shù)據(jù)機和測試數(shù)據(jù)集
  • 在訓練數(shù)據(jù)機里,找到出現(xiàn)次數(shù)大于n的詞
  • 對于出現(xiàn)次數(shù)小于n的,使用<unk>帶地
  • Note: 在這項練習里,我們拋棄了驗證數(shù)據(jù)集合。

    • 在真實的應用中,我們應該保留一部分數(shù)據(jù)當作驗證數(shù)據(jù)集,去調整我們的訓練
    • 為了簡單起見,我們跳過了這一步

    Exercise 01

    將數(shù)據(jù)分隔成句子

    def split_to_sentences(data):"""Split data by linebreak "n"Args:data: strReturns:A list of sentences"""sentences = data.split('n')# Additional clearning (This part is already implemented)# - Remove leading and trailing spaces from each sentence# - Drop sentences if they are empty strings.sentences = [s.strip() for s in sentences]sentences = [s for s in sentences if len(s) > 0]return sentences

    Exercise 02

    下一步就是將句子分詞 (將一個句子分成一系列的詞).

    • 將所有的詞都轉換成小寫,這樣的話大寫的詞和小寫的詞(比如big 和Big就可以當作同一個詞)
    • 將每個一個句子的分詞列表添加道一個總的list里
    def tokenize_sentences(sentences):"""Tokenize sentences into tokens (words)Args:sentences: List of stringsReturns:List of lists of tokens"""# Initialize the list of lists of tokenized sentencestokenized_sentences = []# go through each sentence for sentence in sentences:# convert to lowercase letterssentence = sentence.lower()# convert to a list of wordstokenized = nltk.word_tokenize(sentence)# append the list of words to the list of liststokenized_sentences.append(tokenized)return tokenized_sentences

    Exercise 03

    使用以上定義的兩個方法獲取分詞數(shù)據(jù)。

    • 將數(shù)據(jù)分成句子
    • 給句子分詞
    def get_tokenized_data(data):"""Make a list of tokenized sentencesArgs:data: StringReturns:List of lists of tokens"""# get the sentences bt splitting up the datasentences = split_to_sentences(data)# get the list of lists of tokens by tokenizing the sentencestokenized_sentences = tokenize_sentences(sentences)return tokenized_sentences

    將數(shù)據(jù)分為訓練集和測試集

    tokenized_data = get_tokenized_data(data) random.seed(87) random.shuffle(tokenized_data)train_size = int(len(tokenized_data) * 0.8) train_data = tokenized_data[0:train_size] test_data = tokenized_data[train_size:]

    Exercise 04

    并非有所的詞在訓練里你都會用到,你只會用到高頻詞

    • 你只會關注在數(shù)據(jù)集里出現(xiàn)n次的詞
    • 首先需要計算詞出現(xiàn)的此時

    你需要嵌套兩個循環(huán),一個是sentences ,一個是sentences里的詞

    def count_words(tokenized_sentences):"""Count the number of word appearence in the tokenized sentencesArgs:tokenized_sentences: List of lists of stringsReturns:dict that maps word (str) to the frequency (int)"""word_counts = {}# Loop through the sentencesfor sentence in tokenized_sentences:for token in sentence:# If the token is not in the dictionary yet,set the count to 1if token not in word_counts.keys():word_counts[token] = 1# If the token already in the dictionary,increment the count by 1else:word_counts[token] += 1return word_counts

    處理 'Out of Vocabulary' words

    如果你的模型正在實現(xiàn)自動完成,但是遇到一個在訓練的時候從來沒有出現(xiàn)過的詞,沒法得到下一個詞的建議。這個模型不可以預測下一個詞因為對于當前詞(未出現(xiàn)的詞,它的數(shù)量是沒有的)

    • 這個新詞叫做 'unknown word', 或者 out of vocabulary (OOV) words.
    • unknown words 在測試集里的百分比叫做 OOV 率.

    在預測過程中,為了處理這些unknown words,我們用一個特殊詞'unk'表示 。

    修改訓練數(shù)據(jù),方便訓練一些 'unknown' words .

    • 在測試數(shù)據(jù)里將低頻詞轉換成"unknown" words
    • 在訓練集里創(chuàng)建一個高頻詞列表,叫closed vocabulary .
    • 將所有非高頻詞都轉化成 'unk'.

    Exercise 05

    創(chuàng)建一個方法,使用文檔和一個數(shù)量門檻'count_threshold'

    • 任何詞頻大于 'count_threshold' 的被當作 closed vocabulary.
    def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):"""Find the words that appear N times or moreArgs:tokenized_sentences: List of lists of sentencescount_threshold: minimum number of occurrences for a word to be in the closed vocabulary.Returns:List of words that appear N times or more"""# Initialize ant empty list to contain in the words that# appear at least N timesclosed_vocab = []# Get the word counts of the tokenized sentences# Use the function that you defined earlier to count the wordsword_counts = count_words(tokenized_sentences)for word,cnt in word_counts.items():# Check that the word's count# is at least as greater as the minimum countif cnt >= count_threshold:closed_vocab.append(word)return closed_vocab

    Exercise 06

    • 除了高頻詞,其他都是'unknown'.
    • 將'unknown'.用"<unk>"表示
    def replace_oov_words_by_unk(tokenized_sentences, vocabulary, unknown_token="<unk>"):"""Replace words not in the given vocabulary with '<unk>' token.Args:tokenized_sentences: List of lists of stringsvocabulary: List of strings that we will useunknown_token: A string representing unknown (out-of-vocabulary) wordsReturns:List of lists of strings, with words not in the vocabulary replaced"""# Place vocabulary into a set for faster searchvocabulary = set(vocabulary)# Initialize a list that will hold the sentences # after less frequent words are replaced by the unkreplaced_tokenized_sentences = []for sentence in tokenized_sentences:# Initialize the list that will contain# a single sentence with unk replacementsreplaced_sentence = []for token in sentence:if token in vocabulary:replaced_sentence.append(token)else:replaced_sentence.append(unknown_token)replaced_tokenized_sentences.append(replaced_sentence)return replaced_tokenized_sentences

    Exercise 07

    現(xiàn)在我們已經(jīng)準備通過組合之前已經(jīng)實現(xiàn)的方法來處理數(shù)據(jù).

  • 在訓練數(shù)據(jù)集里,找到出現(xiàn)次數(shù)大于 count_threshold 的詞
  • 在訓練數(shù)據(jù)集和測死數(shù)據(jù)集里,將出現(xiàn)次數(shù)小于count_threshold 的,使用 "<unk>" 代替
  • def preprocess_data(train_data, test_data, count_threshold):"""Preprocess data, i.e.,- Find tokens that appear at least N times in the training data.- Replace tokens that appear less than N times by "<unk>" both for training and test data. Args:train_data, test_data: List of lists of strings.count_threshold: Words whose count is less than this are treated as unknown.Returns:Tuple of- training data with low frequent words replaced by "<unk>"- test data with low frequent words replaced by "<unk>"- vocabulary of words that appear n times or more in the training data"""# Get the closed vocabulary using the train datavocabulary = get_words_with_nplus_frequency(train_data,count_threshold)# For the train data, replace less common words with "<unk>"train_data_replaced = replace_oov_words_by_unk(train_data,vocabulary)# For the test data, replace less common words with "<unk>"test_data_replaced = replace_oov_words_by_unk(test_data,vocabulary)

    Part 2: 開發(fā) n-gram 基本語言模型

    在這個章節(jié)里, 將會開發(fā) n-grams 語言模型.

    • 假設當前詞的概率只和前面的 n-gram,有關
    • 前面的 n-gram 是指前面的一系列的n個詞

    這句子中,位置t的詞,它前面的詞是 wt?1,wt?2?wt?nwt?1,wt?2?wt?n ,它的條件概率是is:

    P(wt|wt?1…wt?n)

    你可以通過計算在訓練集里這一序列的詞的次數(shù)來估算這個概率值.

    這個概率中,分子是位置t的詞(wt)出現(xiàn)在wt?n.....wt?1這一序列詞之后的次數(shù)。分母是wt?n.....wt?1這一序列的詞出現(xiàn)的次數(shù)。

    如果數(shù)量為0(無論是分子還是分母),可以通過添加k平滑修改這個概率計算公式。

    根據(jù)以上概率計算公式,對于分母我們需要計算n個詞組成的序列的次數(shù);對于分子,我們需要計算n+1個詞組成的序列的次數(shù)

    Exercise 08

    下面,你將會編寫一個可以計算n-grams 數(shù)量(n是任意值)的方法。

    計算之前,需要在句子面前添加n個<s>表明是句子的開頭,比如n=2,假若句子是"I like food",則需要將句子修改成"<s><s> I like food"。同時,也要在句子后面添加<e>表明是句子的結尾。???

    Technical note: 在這個方法里, 你將會使用dictionary來保存數(shù)量.

    • dictionary 的key是一個 n 個單詞的tuple (并非list)
    • dictionary 的value是出現(xiàn)的次數(shù)
    • key使用tuple而不是list的原因是list在python里是一個可變對象(可以修改的);但是 tuple 是不可變的,一旦創(chuàng)建就不能修改。
    def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):"""Count all n-grams in the dataArgs:data: List of lists of wordsn: number of words in a sequenceReturns:A dictionary that maps a tuple of n-words to its frequency"""# Initialize dictionary of n-grams and their countsn_grams = {}for sentence in data:# prepend start token n times, and append <e> one timesentence = [start_token] * n + sentence + [end_token]# convert list to tuple # So that the sequence of words can be used as# a key in the dictionarysentence = tuple(sentence)# Use i to indicate the start of the n-gram# from index 0# to the last index where the end of the n-gram# is within the sentencem = len(sentence) if n==1 else len(sentence) -1 for i in range(m):# get the n-gram is in the dictionaryn_gram = sentence[i: i + n]# check if the n-gram if in the dictionaryif n_gram in n_grams.keys():n_grams[n_gram] += 1else:n_grams[n_gram] = 1return n_grams

    Exercise 09

    下一步,評估在給定的n個詞之后的單詞的概率。

    根據(jù)這個式子,如果n-grams在訓練集里沒有出現(xiàn)過,那么分母就為0,上面的式子就行不通。為了處理這種數(shù)量為0的情況,我們加入了 k平滑。

    分子加入一個常量k,分母加入k|v|,任何數(shù)量為0的n-gram的概率都是1/v(v是單詞數(shù)量)

    def estimate_probability(word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):"""Estimate the probabilities of a next word using the n-gram counts with k-smoothingArgs:word: next wordprevious_n_gram: A sequence of words of length nn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary_size: number of words in the vocabularyk: positive constant, smoothing parameterReturns:A probability"""# convert list to tuple to use it as a dictionary keyprevious_n_gram = tuple(previous_n_gram)# Set the denominator# If the previous n-gram exists in the dictionary of n-gram counts,# Get its count. Otherwise set the count to zero# Use the dictionary that has counts for n-gramsprevious_n_gram_count = n_gram_counts[previous_n_gram] if previous_n_gram in n_gram_counts else 0# Calculate the denominator using the count of the previous n-gram# and apply k-smoothingdenominator = previous_n_gram_count + k*vocabulary_size# Define n plus 1 gram as the previous n-gram plus the current word as a tuplen_plus1_gram = previous_n_gram + (word,)# Set the count to the count in the dictionary,# otherwise 0 if not in the dictionary# use the dictionary that has counts for the n-gram plus current wordn_plus1_gram_count = n_plus1_gram_counts[n_plus1_gram] if n_plus1_gram in n_plus1_gram_counts else 0# Define the numerator use the count of the n-gram plus current word,# and apply smoothingnumerator = n_plus1_gram_count + k# Calculate the probability as the numerator divided bt denominatorprobability = numerator / denominatorreturn probability

    計算所有詞的概率

    下面的方法是遍歷語料庫,計算所有詞的概率。

    def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):"""Estimate the probabilities of next words using the n-gram counts with k-smoothingArgs:previous_n_gram: A sequence of words of length nn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary: List of wordsk: positive constant, smoothing parameterReturns:A dictionary mapping from next words to the probability."""# Convert list to tuple to use it as dictionary keyprevious_n_gram = tuple(previous_n_gram)# add <e> <unk> to the vocabulary# <s> is not needed since it should not appear as the next wordvocabulary = vocabulary + ['<e>','<unk>']vocabulary_size = len(vocabulary)probabilities = {}for word in vocabulary:probability = estimate_probability(word,previous_n_gram,n_gram_counts,n_plus1_gram_counts,vocabulary_size,k=k)probabilities[word] = probabilityreturn probabilities

    Count and probability matrices

    數(shù)量和概率矩陣。

    根據(jù)以上我們定義的方法,我們可以創(chuàng)建數(shù)量矩陣和概率矩陣。

    def make_count_matrix(n_plus1_gram_counts, vocabulary):# add <e> <unk> to the vocabulary# <s> is omitted since it should not appear as the next wordvocabulary = vocabulary + ["<e>", "<unk>"]# obtain unique n-gramsn_grams = []for n_plus1_gram in n_plus1_gram_counts.keys():n_gram = n_plus1_gram[0:-1]n_grams.append(n_gram)n_grams = list(set(n_grams))# mapping from n-gram to rowrow_index = {n_gram:i for i, n_gram in enumerate(n_grams)}# mapping from next word to columncol_index = {word:j for j, word in enumerate(vocabulary)}nrow = len(n_grams)ncol = len(vocabulary)count_matrix = np.zeros((nrow, ncol))for n_plus1_gram, count in n_plus1_gram_counts.items():n_gram = n_plus1_gram[0:-1]word = n_plus1_gram[-1]if word not in vocabulary:continuei = row_index[n_gram]j = col_index[word]count_matrix[i, j] = countcount_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)return count_matrix

    vv

    sentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1])) print('ntrigram counts') trigram_counts = count_n_grams(sentences, 3) display(make_count_matrix(trigram_counts, unique_words))

    可得如下結果

    計算概率矩陣

    def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)count_matrix += kprob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)return prob_matrixsentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1])) print("trigram probabilities") trigram_counts = count_n_grams(sentences, 3) display(make_probability_matrix(trigram_counts, unique_words, k=1))

    可得如下結果

    Part 3: 困惑度

    困惑度用來衡量一個模型的好壞,困惑度越低,模型越好。

    在這節(jié)里,我們會在測試集里計算困惑度。

    • N 是句子的數(shù)量
    • n 是在n-gram中,單詞的數(shù)量 in the n-gram (e.g. 2 for a bigram).
    • 數(shù)字從1開始,而非0

    在代碼里,數(shù)組索引從0開始,所以在代碼里t的范圍改成從n到N+1將使用以下公式

    概率越好,困惑度越低.

    n-grams給我們提供越多的句子信息,困惑度越低。

    Exercise 10

    def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):"""Calculate perplexity for a list of sentencesArgs:sentence: List of stringsn_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary_size: number of unique words in the vocabularyk: Positive smoothing constantReturns:Perplexity score"""# length of previous wordsn = len(list(n_gram_counts.keys())[0]) # prepend <s> and append <e>sentence = ["<s>"] * n + sentence + ["<e>"]# Cast the sentence from a list to a tuplesentence = tuple(sentence)# length of sentence (after adding <s> and <e> tokens)N = len(sentence)# The variable p will hold the product# that is calculated inside the n-root# Update this in the code belowproduct_pi = 1.0### START CODE HERE (Replace instances of 'None' with your code) #### Index t ranges from n to N - 1(t的范圍是從n到N-1)for t in range(n, N): # complete this line# get the n-gram preceding the word at position tn_gram = sentence[t-n:t]# get the word at position tword = sentence[t]# Estimate the probability of the word given the n-gram# using the n-gram counts, n-plus1-gram counts,# vocabulary size, and smoothing constantprobability = estimate_probability(word,n_gram, n_gram_counts, n_plus1_gram_counts, len(unique_words), k=1)# Update the product of the probabilities# This 'product_pi' is a cumulative product # of the (1/P) factors that are calculated in the loopproduct_pi *= 1 / probability# Take the Nth root of the productperplexity = product_pi**(1/float(N))### END CODE HERE ### return perplexity

    Part 4: 創(chuàng)建一個自動完成系統(tǒng)

    在這一章節(jié)里,我們將使用前面使用的方法,來創(chuàng)建一個自動完成系統(tǒng)。

    在以下的方法中,使用了一個start_with方法,指定下個單詞的前幾個字母

    def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):"""Get suggestion for the next wordArgs:previous_tokens: The sentence you input where each token is a word. Must have length > n n_gram_counts: Dictionary of counts of (n+1)-gramsn_plus1_gram_counts: Dictionary of counts of (n+1)-gramsvocabulary: List of wordsk: positive constant, smoothing parameterstart_with: If not None, specifies the first few letters of the next wordReturns:A tuple of - string of the most likely next word- corresponding probability"""# length of previous wordsn = len(list(n_gram_counts.keys())[0]) # From the words that the user already typed# get the most recent 'n' words as the previous n-gramprevious_n_gram = previous_tokens[-n:]# Estimate the probabilities that each word in the vocabulary# is the next word,# given the previous n-gram, the dictionary of n-gram counts,# the dictionary of n plus 1 gram counts, and the smoothing constantprobabilities = estimate_probabilities(previous_n_gram,n_gram_counts, n_plus1_gram_counts,vocabulary, k=k)# Initialize suggested word to None# This will be set to the word with highest probabilitysuggestion = None# Initialize the highest word probability to 0# this will be set to the highest probability # of all words to be suggestedmax_prob = 0### START CODE HERE (Replace instances of 'None' with your code) #### For each word and its probability in the probabilities dictionary:for word, prob in probabilities.items(): # complete this line# If the optional start_with string is setif start_with != None: # complete this line# Check if the word starts with the letters in 'start_with'if not word.startswith(start_with): # complete this line#If so, don't consider this word (move onto the next word)continue # complete this line# Check if this word's probability# is greater than the current maximum probabilityif prob > max_prob: # complete this line# If so, save this word as the best suggestion (so far)suggestion = word# Save the new maximum probabilitymax_prob = prob### END CODE HEREreturn suggestion, max_prob

    簡單測試下:

    sentences = [['i', 'like', 'a', 'cat'],['this', 'dog', 'is', 'like', 'a', 'cat']] unique_words = list(set(sentences[0] + sentences[1]))unigram_counts = count_n_grams(sentences, 1) bigram_counts = count_n_grams(sentences, 2)previous_tokens = ["i", "like"] tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0) print(f"The previous words are 'i like',ntand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")print() # test your code when setting the starts_with tmp_starts_with = 'c' tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with) print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`ntand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")

    輸出是

    The previous words are 'i like',
    and the suggested word is `a` with a probability of 0.2727
    The previous words are 'i like', the suggestion must start with `c`
    and the suggested word is `cat` with a probability of 0.0909

    獲取多條建議

    def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):model_counts = len(n_gram_counts_list)suggestions = []for i in range(model_counts-1):n_gram_counts = n_gram_counts_list[i]n_plus1_gram_counts = n_gram_counts_list[i+1]suggestion = suggest_a_word(previous_tokens, n_gram_counts,n_plus1_gram_counts, vocabulary,k=k, start_with=start_with)suggestions.append(suggestion)return suggestions

    使用任意長度的n-grams ,獲取多條建議

    祝賀你! 你已經(jīng)開發(fā)了自己的自動完成系統(tǒng)需要的所有模塊。

    讓我們來看看基于任意長度的n-grams模型 (unigrams, bigrams, trigrams, 4-grams...6-grams).

    n_gram_counts_list = [] for n in range(1, 6):print("Computing n-gram counts with n =", n, "...")n_model_counts = count_n_grams(train_data_processed, n)n_gram_counts_list.append(n_model_counts) previous_tokens = ["i", "am", "to"] tmp_suggest4 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)print(f"The previous words are {previous_tokens}, the suggestions are:") display(tmp_suggest4)

    輸出

    The previous words are ['i', 'am', 'to'], the suggestions are:
    [('be', 0.027665685098338604), ('have', 0.00013487086115044844), ('have', 0.00013490725126475548), ('i', 6.746272684341901e-05)]

    previous_tokens = ["hey", "how", "are", "you"] tmp_suggest7 = get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0)print(f"The previous words are {previous_tokens}, the suggestions are:") display(tmp_suggest7)

    輸出:

    The previous words are ['hey', 'how', 'are', 'you'], the suggestions are:
    [("'re", 0.023973994311255586), ('?', 0.002888465830762161), ('?', 0.0016134453781512605), ('<e>', 0.00013491635186184566)]

    總結

    以上是生活随笔為你收集整理的edittext禁止换行符但能自动换行简书_使用n-gram创建自动完成系统的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。