自然语言处理综述_自然语言处理
自然語言處理綜述
Aren't we all initially got surprised when smart devices understood what we were telling them? And in fact, it answered in the most friendly manner too, isn't it? Like Apple’s Siri and Amazon’s Alexa comprehend when we ask the weather, for directions, or to play a certain genre of music. Ever since then I was wondering how do these computers get our language. This long due curiosity rekindled me and I thought to write a blog as a newbie on this.
當(dāng)智能設(shè)備理解了我們告訴他們的內(nèi)容后,我們所有人最初并不感到驚訝嗎? 實(shí)際上,它也以最友好的方式回答,不是嗎? 就像蘋果公司的Siri和亞馬遜公司的Alexa一樣,當(dāng)我們詢問天氣,方向或播放某種音樂時,他們就會明白。 從那時起,我一直在想這些計算機(jī)如何獲得我們的語言。 這種長期的好奇心使我重新燃起了生命,我想以此為博客寫一個新手。
In this article, I will be using a popular NLP library called NLTK. Natural Language Toolkit or NLTK is one of the most powerful and probably the most popular natural language processing libraries. Not only does it have the most comprehensive library for python-based programming, but it also supports the most number of different human languages.
在本文中,我將使用一個流行的名為NLTK的NLP庫 。 自然語言工具包或NLTK是功能最強(qiáng)大且可能是最受歡迎的自然語言處理庫之一。 它不僅具有用于基于python的編程的最全面的庫,而且還支持大多數(shù)不同的人類語言。
What is Natural Language Processing?
什么是自然語言處理?
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to train computers to process and analyze large amounts of natural language data.
自然語言處理(NLP)是語言學(xué),計算機(jī)科學(xué),信息工程和人工智能的一個子領(lǐng)域,與計算機(jī)和人類語言之間的相互作用有關(guān),尤其是如何訓(xùn)練計算機(jī)以處理和分析大量自然語言數(shù)據(jù)。
Why sorting of Unstructured Datatype is so important?
為什么對非結(jié)構(gòu)化數(shù)據(jù)類型進(jìn)行排序如此重要?
For every tick of the clock, the world generates the overwhelming amount of data!!, yeah, this is mind-boggling!! and the majority of the data falls under unstructured datatype. The data formats such as text, audio, video, image are classic examples of unstructured data. The Unstructured Datatype will not be having fixed dimensions and structures like traditional row and column structure of relational databases. Therefore it’s more difficult to analyze and not easily searchable. Having said that, it is also important for business organizations to find ways of addressing challenges and embracing opportunities to derive insights and prosper in highly competitive environments to be successful. However, with the help of natural language processing and machine learning, this is changing fast.
每一刻時鐘,世界都會產(chǎn)生大量數(shù)據(jù)!是的,這真是令人難以置信! 并且大多數(shù)數(shù)據(jù)屬于非結(jié)構(gòu)化數(shù)據(jù)類型。 文本,音頻,視頻,圖像等數(shù)據(jù)格式是非結(jié)構(gòu)化數(shù)據(jù)的經(jīng)典示例。 非結(jié)構(gòu)化數(shù)據(jù)類型將沒有固定的維度和結(jié)構(gòu),如關(guān)系數(shù)據(jù)庫的傳統(tǒng)行和列結(jié)構(gòu)。 因此,它更難以分析且不易搜索。 話雖如此,對于企業(yè)組織來說,找到應(yīng)對挑戰(zhàn)和把握機(jī)遇的方法也很重要,以便在高競爭環(huán)境中獲得見識并取得成功。 但是,借助自然語言處理和機(jī)器學(xué)習(xí),這種情況正在Swift改變。
Are Computers confused with our Natural Language?
計算機(jī)與我們的自然語言混淆了嗎?
Human language is one of the powerful tools of communication. The words, the tone, the sentences, the gestures which we use draw information. There are countless different ways of assembling words in a phrase. Words can also have many shades of meaning and, to comprehend human language with the intended meaning is a challenge. A linguistic paradox is a phrase or sentence that contradicts itself, for example, “oh, this is my open secret”, “can you please act naturally”, though it sounds pointedly foolish, we humans can understand and use in everyday speech but for machines, natural language’s ambiguity and inaccurate characteristics are the hurdles to sail-off.
語言是交流的強(qiáng)大工具之一。 我們使用的單詞,語氣,句子,手勢會吸引信息。 在短語中組合單詞的方式有無數(shù)種。 單詞也可以具有多種含義,要使人類語言具有預(yù)期的含義是一個挑戰(zhàn)。 語言悖論是與自己矛盾的短語或句子,例如,“哦,這是我的公開秘密”,“您能自然地行動嗎”,雖然聽起來很愚蠢,但我們?nèi)祟惪梢栽谌粘UZ音中理解和使用,但對于機(jī)器,自然語言的歧義和不正確的特征是航行的障礙。
Most used NLP Libraries
最常用的NLP庫
In the past, only pioneers could be part of NLP projects those who would have superior knowledge in mathematics, computer learning, and linguistics in natural language processing. Now developers can use ready-made libraries to simplify pre-processing of texts so that they can concentrate on creating machine learning models. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. Most popular NLP libraries are:
過去,只有先驅(qū)者才能成為NLP項目的一部分,他們將對數(shù)學(xué),計算機(jī)學(xué)習(xí)和自然語言處理方面的語言有豐富的知識。 現(xiàn)在,開發(fā)人員可以使用現(xiàn)成的庫來簡化文本的預(yù)處理,以便他們可以專注于創(chuàng)建機(jī)器學(xué)習(xí)模型。 這些庫僅通過幾行代碼就可以進(jìn)行文本理解,解釋和情感分析。 最受歡迎的NLP庫是:
Spark NLP, NLTK, PyTorch-Transformers, TextBlob, Spacy, Stanford CoreNLP, Apache OpenNLP, Allen NLP, GenSim, NLP Architecture, sci-kit learn.
Spark NLP,NLTK,PyTorch-Transformers,TextBlob,Spacy,Stanford CoreNLP,Apache OpenNLP,Allen NLP,GenSim,NLP Architecture,Sci-kit學(xué)習(xí)。
The question is from where should we start and how?
問題是我們應(yīng)該從哪里開始,如何開始?
Have you ever observed how kids start to understand and learn a language? yeah, by picking each word and then sentence formations, right! Making computers understand our language is more or less similar to it.
您是否曾經(jīng)觀察過孩子如何開始理解和學(xué)習(xí)語言? 是的,先選擇每個單詞,然后再選擇句子形式,對! 使計算機(jī)理解我們的語言或多或少類似于它。
Pre-processing Steps :
預(yù)處理步驟:
Sentence Tokenization(Sentence Segmentation)To make computers understand the natural language, the first step is to break the paragraphs into the sentences. Punctuation marks are such an easy way out for splitting the sentences apart.
句子標(biāo)記化(Sentence Segmentation)為了使計算機(jī)理解自然語言,第一步是將段落分成句子。 標(biāo)點(diǎn)符號是將句子分開的簡便方法。
nltk.download('punkt')text = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
print("The number of sentences in the paragrah:",len(sentences))for sentence in sentences:
print(sentence)OUTPUT:The number of sentences in the paragraph: 3 Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area.
2. Word Tokenization(Word Segmentation)By now we have separated sentences with us and the next step is to break the sentences into words which are often called Tokens.
2.單詞標(biāo)記化(單詞分割)到目前為止,我們已經(jīng)將句子分隔開了,下一步是將這些句子分解為通常稱為標(biāo)記的單詞。
The way of creating a space in one’s own life helps for good, similarly, Space between the words helps in breaking the words apart in a phrase. We can consider punctuation marks as separate tokens as well, as punctuation has a purpose too.
在自己的生活中創(chuàng)造空間的方式有益于美好,同樣,單詞之間的空間有助于將單詞在短語中分開。 我們也可以將標(biāo)點(diǎn)符號也視為單獨(dú)的標(biāo)記,因為標(biāo)點(diǎn)符號也是有目的的。
for sentence in sentences:words = nltk.word_tokenize(sentence)
print("The number of words in a sentence:", len(words))
print(words)OUTPUT:The number of words in a sentence: 32
['Home', 'Farm', 'is', 'one', 'of', 'the', 'biggest', 'junior', 'football', 'clubs', 'in', 'Ireland', 'and', 'their', 'senior', 'team', ',', 'from', '1970', 'up', 'to', 'the', 'late', '1990s', ',', 'played', 'in', 'the', 'League', 'of', 'Ireland', '.'] The number of words in a sentence: 18
['However', ',', 'the', 'link', 'between', 'Home', 'Farm', 'and', 'the', 'senior', 'team', 'was', 'severed', 'in', 'the', 'late', '1990s', '.'] The number of words in a sentence: 22
['The', 'senior', 'side', 'was', 'briefly', 'known', 'as', 'Home', 'Farm', 'Fingal', 'in', 'an', 'effort', 'to', 'identify', 'it', 'with', 'the', 'north', 'Dublin', 'area', '.']
The prerequisite to use word_tokenize() or sent_tokenize() functions in the program, we should have punkt package downloaded.
在程序中使用word_tokenize()或sent_tokenize()函數(shù)的前提條件是,我們應(yīng)該已下載punkt軟件包。
3. Stemming and Text Lemmatization
3.詞干和詞法化
In every text document, we usually come across different forms of words like write, writes, writing with an alike meaning, and the same base word. But how to make a computer to analyze such words?That’s when Text Lemmatization and Stemming comes in the picture.
在每個文本文檔中,我們通常會遇到不同形式的單詞,例如寫,寫,具有相似含義的寫詞和相同的基本單詞。 但是如何使計算機(jī)分析此類單詞呢? 那就是圖片的詞法化和詞法提取的時候。
Stemming and Text Lemmatization are the normalization techniques that offer the same idea of chopping the ends of a word to the core word. While both of them want to solve the same problem, but they are going about it in entirely different ways. Stemming is often a crude heuristic process whereas Lemmatization is a vocabulary -based morphological base word. Let’s just take a closer look!
詞干和文本詞法歸類化是歸一化技術(shù),它們提供將單詞的結(jié)尾切成核心單詞的相同思想。 雖然他們兩個都想解決相同的問題,但是他們以完全不同的方式來解決這個問題。 詞干提取通常是一個粗略的啟發(fā)式過程,而詞干提取則是基于詞匯的詞法基礎(chǔ)詞。 讓我們仔細(xì)看看!
Stemming- Words are reduced to their stem word. A word stem need not be the same root as a dictionary-based morphological(smallest unit) root, it just is an equal to or smaller form of the word.
詞干 -單詞被簡化為詞干。 詞干不必與基于字典的詞法(最小單位)詞根相同,而可以等于或小于該詞的形式。
from nltk.stem import PorterStemmer#create an object of class PorterStemmerporter = PorterStemmer()#A list of words to be stemmed
word_list = ['running', ',', 'driving', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Porter Stemmer"))for word in word_list:
print("{0:20}{1:20}".format(word,porter.stem(word)))OUTPUT:
Word Porter Stemmer
running run
, ,
driving drive
sung sung
between between
lasted last
was wa
paticipated paticip
before befor
severed sever
1990s 1990
. .
Stemming is not as easy as it looks :(we might get into two issues such as under-stemming and over-stemming of a word.
詞干看起來并不容易:(我們可能會遇到兩個問題,例如單詞的詞干 不足和詞干 過度 。
Lemmatization-When we think that stemming is the best estimate method to snip a word based on how it appears and meanwhile, on the other hand, lemmatization is a method that seems to be even more planned way of pruning the word. Their dictionary process includes resolving words. Indeed a word’s lemma is its dictionary or canonical form.
詞法化 -當(dāng)我們認(rèn)為詞干是根據(jù)單詞出現(xiàn)的方式來截斷單詞的最佳估計方法時,另一方面,詞法化似乎是一種修剪單詞的更有計劃的方法。 他們的詞典處理過程包括解析單詞。 確實(shí),單詞的引理是其字典或規(guī)范形式。
nltk.download('wordnet')from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()#A list of words to lemmatizeword_list = ['running', ',', 'drives', 'sung', 'between', 'lasted', 'was', 'paticipated', 'before', 'severed', '1990s', '.']print("{0:20}{1:20}".format("Word","Lemma"))for word in word_list:
print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))OUTPUT:Word Lemma
running running
, ,
drives drive
sung sung
between between
lasted lasted
was wa
paticipated paticipated
before before
severed severed
1990s 1990s
. .
If speed is needed, then resorting to stemming is better. But it’s better to use lemmatization when accuracy is needed.
如果需要速度,則最好采用阻止。 但是,當(dāng)需要準(zhǔn)確性時,最好使用定理。
4. Stop Words‘in’, ‘a(chǎn)t’, ‘on’, ‘so’.. etc are considered as stop words. Stop words don't play an important role in NLP, but the removal of stop words necessarily plays an important role during sentiment analysis.
4.“在”,“在”,“在”,“如此”等上的停用詞被視為停用詞。 停用詞在NLP中并不重要,但是在情感分析過程中停用詞的去除必定起著重要作用。
NLTK comes with the stopwords for 16 different languages that contain stop word lists.
NLTK隨附了包含停用詞列表的16種不同語言的停用詞。
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))print("The stop words in NLTK lib are:", stop_words)para="""Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."""tokenized_para=word_tokenize(para)
modified_token_list=[word for word in tokenized_para if not word in stop_words]
print("After removing the stop words in the sentence:")
print(modified_token_list)OUTPUT:The stop words in NLTK lib are: {'about', 'ma', "shouldn't", 's', 'does', 't', 'our', 'mightn', 'doing', 'while', 'ourselves', 'themselves', 'will', 'some', 'you', "aren't", 'by', "needn't", 'in', 'can', 'he', 'into', 'as', 'being', 'between', 'very', 'after', 'couldn', 'himself', 'herself', 'had', 'its', 've', 'him', 'll', "isn't", 'through', 'should', 'was', 'now', 'them', "you'll", 'again', 'who', 'don', 'been', 'they', 'weren', "you're", 'both', 'd', 'me', 'didn', "won't", "you'd", 'only', 'itself', 'hadn', "should've", 'than', 'how', 'few', 're', 'down', 'these', 'y', "haven't", "mightn't", 'won', "hadn't", 'other', 'above', 'all', "doesn't", 'isn', "that'll", 'not', 'yourselves', 'at', 'mustn', "it's", 'on', 'the', 'for', "didn't", 'what', "mustn't", 'his', 'haven', 'doesn', "you've", 'are', 'out', 'hers', 'with', 'has', 'she', 'most', 'ain', 'those', 'when', 'myself', 'before', 'their', 'during', 'there', 'or', 'until', 'that', 'more', "hasn't", 'o', 'we', 'and', "shan't", 'which', 'because', "don't", 'why', 'shan', 'an', 'my', 'if', 'did', 'having', "couldn't", 'your', 'theirs', 'aren', 'just', 'further', 'here', 'of', "wouldn't", 'be', 'too', 'her', 'no', 'same', 'it', 'is', 'were', 'yourself', 'have', 'off', 'this', 'needn', 'once', "wasn't", 'against', 'wouldn', 'up', 'a', 'i', 'below', "weren't", 'over', 'own', 'then', 'so', 'do', 'from', 'shouldn', 'am', 'under', 'any', 'yours', 'ours', 'hasn', 'such', 'nor', 'wasn', 'to', 'where', 'm', "she's", 'each', 'whom', 'but'} After removing the stopwords in the sentence:
['Home', 'Farm', 'one', 'biggest', 'junior', 'football', 'clubs', 'Ireland', 'senior', 'team', ',', '1970', 'late', '1990s', ',', 'played', 'League', 'Ireland', '.', 'However', ',', 'link', 'Home', 'Farm', 'senior', 'team', 'severed', 'late', '1990s', '.', 'The', 'senior', 'side', 'briefly', 'known', 'Home', 'Farm', 'Fingal', 'effort', 'identify', 'north', 'Dublin', 'area', '.']
5. POS TaggingDown the memories lane of our early English grammar classes, can we all remember how our teachers used to give relevant instructions around basic parts of speech to have effective communication? Yeah, good old days!!Let's teach parts of speech to our computers too. :)
5. POS標(biāo)記我們早期英語語法課的記憶里,我們都還記得我們的老師曾經(jīng)如何圍繞基本的言語給予相關(guān)指導(dǎo)以進(jìn)行有效的交流嗎? 是的,過去美好!讓我們也將詞性教學(xué)到我們的計算機(jī)上。 :)
The eight parts of speech are nouns, verbs, pronouns, adjectives, adverbs, prepositions, conjunctions, and interjections.
語音的八個部分是名詞,動詞,代詞,形容詞,副詞,介詞,連詞和感嘆詞。
POS Tagging is an ability to identify and assign parts of speech to the words in a sentence. There are different methods to tag, but we will be using the universal style of tagging.
POS標(biāo)記是一種識別語音部分并將其分配給句子中單詞的功能。 標(biāo)記的方法不同,但是我們將使用通用的標(biāo)記樣式。
nltk.download('averaged_perceptron_tagger')nltk.download('universal_tagset')
pos_tag= [nltk.pos_tag(i,tagset="universal") for i in words]
print(pos_tag)[[('Home', 'NOUN'), ('Farm', 'NOUN'), ('is', 'VERB'), ('one', 'NUM'), ('of', 'ADP'), ('the', 'DET'), ('biggest', 'ADJ'), ('junior', 'NOUN'), ('football', 'NOUN'), ('clubs', 'NOUN'), ('in', 'ADP'), ('Ireland', 'NOUN'), ('and', 'CONJ'), ('their', 'PRON'), ('senior', 'ADJ'), ('team', 'NOUN'), (',', '.'), ('from', 'ADP'), ('1970', 'NUM'), ('up', 'ADP'), ('to', 'PRT'), ('the', 'DET'), ('late', 'ADJ'), ('1990s', 'NUM'), (',', '.'), ('played', 'VERB'), ('in', 'ADP'), ('the', 'DET'), ('League', 'NOUN'), ('of', 'ADP'), ('Ireland', 'NOUN'), ('.', '.')]
One of the applications of POS tagging to analyze the qualities of a product in feedback, by sorting the adjectives in the customers’ review we can evaluate the sentiment of the feedback. Say example, how was your shopping with us?
POS標(biāo)記用于分析產(chǎn)品在反饋中的質(zhì)量的一種應(yīng)用,通過對客戶評論中的形容詞進(jìn)行分類,我們可以評估反饋的情緒。 舉例來說, 您如何與我們一起購物?
6. ChunkingChunking is used to add more structure to the sentence by tagging the following parts of speech (POS). Also named as shallow parsing. The resulting word group is named “chunks.” There are no such predefined rules to perform chunking.
6.分塊分塊用于通過標(biāo)記以下詞性(POS)為句子添加更多結(jié)構(gòu)。 也稱為淺層解析。 所得的單詞組稱為“塊”。 沒有此類預(yù)定義規(guī)則可以執(zhí)行分塊。
Phrase structure conventions:
短語結(jié)構(gòu)約定:
- S(Sentence) → NP VP. S(句子)→NP VP。
- NP → {Determiner, Noun, Pronoun, Proper name}. NP→{確定詞,名詞,代詞,專有名稱}。
- VP → V (NP)(PP)(Adverb). VP→V(NP)(PP)(副詞)。
- PP → Pronoun (NP). PP→代詞(NP)。
- AP → Adjective (PP). AP→形容詞(PP)。
I never had a good time with complex regular expressions, I used to remain as far as I could but off late realized, how important it is to have a grip on regular expressions in data science. Let’s start by understanding the simple instance.
我從來沒有過過使用復(fù)雜的正則表達(dá)式的美好時光,我曾經(jīng)盡我所能,但后來才意識到,掌握數(shù)據(jù)科學(xué)中的正則表達(dá)式是多么重要。 讓我們從了解簡單實(shí)例開始。
If we need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below
如果我們需要標(biāo)記句子中的名詞,動詞(過去式),形容詞和協(xié)調(diào)連接。 您可以使用以下規(guī)則
chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}
塊:{<NN。?> * <VBD。?> * <JJ。?> * <CC>?}
import nltkfrom nltk.tokenize import word_tokenizecontent = "Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."tokenized_text = nltk.word_tokenize(content)
print("After Split:",tokenized_text)
tokens_tag = pos_tag(tokenized_text)
print("After Token:",tokens_tag)patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)OUTPUT:After Regex: chunk.RegexpParser with 1 stages: RegexpChunkParser with 1 rules: <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'> After Chunking
(S (mychunk Home/NN Farm/NN) is/VBZ one/CD of/IN the/DT
(mychunk biggest/JJS)
(mychunk junior/NN football/NN clubs/NNS) in/IN
(mychunk Ireland/NNP and/CC) their/PRP$
(mychunk senior/JJ)
(mychunk team/NN) ,/, from/IN 1970/CD up/IN to/TO the/DT (mychunk late/JJ) 1990s/CD ,/, played/VBN in/IN the/DT (mychunk League/NNP) of/IN (mychunk Ireland/NNP) ./.)
7. Wordnet
7.詞網(wǎng)
Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to generate a synonym or antonym.
Wordnet是NLTK語料庫閱讀器,英語的詞匯數(shù)據(jù)庫。 它可用于生成同義詞或反義詞。
from nltk.corpus import wordnetsynonyms = []antonyms = []for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
synonyms.append(lemmas.name())for syn in wordnet.synsets("active"):
for lemmas in syn.lemmas():
if lemmas.antonyms():
antonyms.append(lemmas.antonyms()[0].name())print("Synonyms are:",synonyms)
print("Antonyms are:",antonyms)OUTPUT:Synonyms are: ['active_agent', 'active', 'active_voice', 'active', 'active', 'active', 'active', 'combat-ready', 'fighting', 'active', 'active', 'participating', 'active', 'active', 'active', 'active', 'alive', 'active', 'active', 'active', 'dynamic', 'active', 'active', 'active'] Antonyms are: ['passive_voice', 'inactive', 'passive', 'inactive', 'inactive', 'inactive', 'quiet', 'passive', 'stative', 'extinct', 'dormant', 'inactive']
8. Bag of WordsA bag of words model turns the raw text into words, and the frequency for the words in the text is also counted.
8.單詞袋單詞袋模型將原始文本轉(zhuǎn)換為單詞,并計算單詞在單詞中的出現(xiàn)頻率。
import nltkimport re # to match regular expressions
import numpy as nptext="Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland. However, the link between Home Farm and the senior team was severed in the late 1990s. The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area."sentences = nltk.sent_tokenize(text)
for i in range(len(sentences)):
sentences[i] = sentences[i].lower()
sentences[i] = re.sub(r'\W', ' ', sentences[i])
sentences[i] = re.sub(r'\s+', ' ', sentences[i])bag_of_words = {}
for sentence in sentences:
words = nltk.word_tokenize(sentence)
for word in words:
if word not in bag_of_words.keys():
bag_of_words[word] = 1
else:
bag_of_words[word] += 1
print(bag_of_words)OUTPUT:{'home': 3, 'farm': 3, 'is': 1, 'one': 1, 'of': 2, 'the': 8, 'biggest': 1, 'junior': 1, 'football': 1, 'clubs': 1, 'in': 4, 'ireland': 2, 'and': 2, 'their': 1, 'senior': 3, 'team': 2, 'from': 1, '1970': 1, 'up': 1, 'to': 2, 'late': 2, '1990s': 2, 'played': 1, 'league': 1, 'however': 1, 'link': 1, 'between': 1, 'was': 2, 'severed': 1, 'side': 1, 'briefly': 1, 'known': 1, 'as': 1, 'fingal': 1, 'an': 1, 'effort': 1, 'identify': 1, 'it': 1, 'with': 1, 'north': 1, 'dublin': 1, 'area': 1}
9. TF-IDF
9.特遣部隊
TF-IDF stands for Term Frequency — Inverse document frequency.
TF-IDF代表術(shù)語頻率-反向文檔頻率 。
Text data needs to be converted to the numerical format where each word is represented in the matrix form. The encoding of a given word is the vector in which the corresponding element is set to one, and all other elements are zero. Thus TF-IDF technique is also referred to as Word Embedding.
文本數(shù)據(jù)需要轉(zhuǎn)換為數(shù)字格式,其中每個單詞都以矩陣形式表示。 給定單詞的編碼是將相應(yīng)元素設(shè)置為1并將所有其他元素設(shè)置為零的向量。 因此,TF-IDF技術(shù)也稱為詞嵌入 。
TF-IDF works on two concepts:
TF-IDF處理兩個概念:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
TF(t)=(術(shù)語t在文檔中出現(xiàn)的次數(shù))/(文檔中術(shù)語的總數(shù))
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
IDF(t)= log_e(文件總數(shù)/其中帶有術(shù)語t的文件數(shù))
from sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_extraction.text import CountVectorizer
import pandas as pddocs=["Home Farm is one of the biggest junior football clubs in Ireland and their senior team, from 1970 up to the late 1990s, played in the League of Ireland",
"However, the link between Home Farm and the senior team was severed in the late 1990s",
" The senior side was briefly known as Home Farm Fingal in an effort to identify it with the north Dublin area"]#instantiate CountVectorizer()
cv=CountVectorizer()# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])# sort ascending
df_idf.sort_values(by=['idf_weights'])# count matrix
count_vector=cv.transform(docs)# tf-idf scores
tf_idf_vector=tfidf_transformer.transform(count_vector)feature_names = cv.get_feature_names()#get tfidf vector for the document
first_document_vector=tf_idf_vector[0]#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)tfidf
of 0.374810
ireland 0.374810
the 0.332054
in 0.221369
1970 0.187405
football 0.187405
up 0.187405
as 0.000000
an 0.000000and so on..
What are these scores telling us? The more common the word across documents, the lower its score, and the more unique a word the higher the score will be.
這些分?jǐn)?shù)告訴我們什么? 文檔中的單詞越常見,其得分就越低,單詞越獨(dú)特,得分就會越高。
So far, we learned the steps of cleaning and preprocessing the text. What can we do with the sorted data after all this? We could use this data for sentiment analysis, chatbot, market intelligence. Maybe build a recommender system based on user purchases or item reviews or customer segmentation with clustering.
到目前為止,我們學(xué)習(xí)了清理和預(yù)處理文本的步驟。 所有這些之后,我們該如何處理排序后的數(shù)據(jù)? 我們可以使用這些數(shù)據(jù)進(jìn)行情感分析,聊天機(jī)器人,市場情報。 也許可以基于用戶購買或商品評論或具有集群的客戶細(xì)分來構(gòu)建推薦系統(tǒng)。
Computers are still not accurate with human language as much as they are with numbers. With the massive proportion of text data generated every day, NLP is indeed becoming ever more significant to make sense of the data and is being used in many other applications. Hence there are endless ways to explore NLP.
計算機(jī)對人類語言的準(zhǔn)確性仍然不如數(shù)字。 隨著每天生成大量文本數(shù)據(jù),NLP確實(shí)變得越來越重要以理清數(shù)據(jù),并在許多其他應(yīng)用程序中得到使用。 因此,有無數(shù)種探索NLP的方法。
翻譯自: https://medium.com/analytics-vidhya/natural-language-processing-bedb2e1c8ceb
自然語言處理綜述
總結(jié)
以上是生活随笔為你收集整理的自然语言处理综述_自然语言处理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 火傀儡怎么做(火到底是什么)
- 下一篇: 来自天秤座的梦想_天秤座:单线全自动机器