日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

自然语言12_Tokenizing Words and Sentences with NLTK

發布時間:2024/8/26 编程问答 44 豆豆
生活随笔 收集整理的這篇文章主要介紹了 自然语言12_Tokenizing Words and Sentences with NLTK 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?

sklearn實戰-乳腺癌細胞數據挖掘(博主親自錄制視頻教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

?

https://www.pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

# -*- coding: utf-8 -*- """ Created on Sun Nov 13 09:14:13 2016@author: daxiong """from nltk.tokenize import sent_tokenize,word_tokenizeexample_text="Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of bad captivity."list_sentences=sent_tokenize(example_text)list_words=word_tokenize(example_text)

?

?

代碼測試

?

?

?

Tokenizing Words and Sentences with NLTK




Welcome to a Natural Language Processing tutorial series, using the Natural Language Toolkit, or NLTK, module with Python.

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about. In this series, we're going to tackle the field of opinion mining, or sentiment analysis.

In our path to learning how to do sentiment analysis with NLTK, we're going to learn the following:

  • Tokenizing - Splitting sentences and words from the body of text.
  • Part of Speech tagging
  • Machine Learning with the Naive Bayes classifier
  • How to tie in Scikit-learn (sklearn) with NLTK
  • Training classifiers with datasets
  • Performing live, streaming, sentiment analysis with Twitter.
  • ...and much more.

In order to get started, you are going to need the NLTK module, as well as Python.

If you do not have Python yet, go to Python.org and download the latest version of Python if you are on Windows. If you are on Mac or Linux, you should be able to run an apt-get install python3

Next, you're going to need NLTK 3. The easiest method to installing the NLTK module is going to be with pip.

For all users, that is done by opening up cmd.exe, bash, or whatever shell you use and typing:
pip install nltk

Next, we need to install some of the components for NLTK. Open python via whatever means you normally do, and type:

import nltknltk.download()

Unless you are operating headless, a GUI will pop up like this, only probably with red instead of green:

Choose to download "all" for all packages, and then click 'download.' This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora. If space is an issue, you can elect to selectively download everything manually. The NLTK module will take up about 7MB, and the entire nltk_data directory will take up about 1.8GB, which includes your chunkers, parsers, and the corpora.

If you are operating headless, like on a VPS, you can install everything by running Python and doing:

import nltk

nltk.download()

d (for download)

all (for download everything)

That will download everything for you headlessly.

Now that you have all the things that you need, let's knock out some quick vocabulary:

  • Corpus - Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.
  • Lexicon - Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons. For example: To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.
  • Token - Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

These are the words you will most commonly hear upon entering the Natural Language Processing (NLP) space, but there are many more that we will be covering in time. With that, let's show an example of how one might actually tokenize something into tokens with the NLTK module.

from nltk.tokenize import sent_tokenize, word_tokenize EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard." print(sent_tokenize(EXAMPLE_TEXT))

At first, you may think tokenizing by things like words or sentences is a rather trivial enterprise. For many sentences it can be. The first step would be likely doing a simple .split('. '), or splitting by period followed by a space. Then maybe you would bring in some regular expressions to split by period, space, and then a capital letter. The problem is that things like Mr. Smith would cause you trouble, and many other things. Splitting by word is also a challenge, especially when considering things like concatenations like we and are to we're. NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation.

The above code will output the sentences, split up into a list of sentences, which you can do things like iterate through with a for loop.
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]

So there, we have created tokens, which are sentences. Let's tokenize by word instead this time:

print(word_tokenize(EXAMPLE_TEXT))

Now our output is: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']

There are a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word "shouldn't" into "should" and "n't." Finally, notice that "pinkish-blue" is indeed treated like the "one word" it was meant to be turned into. Pretty cool!

Now, looking at these tokenized words, we have to begin thinking about what our next step might be. We start to ponder about how might we derive meaning by looking at these words. We can clearly think of ways to put value to many words, but we also see a few words that are basically worthless. These are a form of "stop words," which we can also handle for. That is what we're going to be talking about in the next tutorial.

python風控評分卡建模和風控常識

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

轉載于:https://www.cnblogs.com/webRobot/p/6079892.html

總結

以上是生活随笔為你收集整理的自然语言12_Tokenizing Words and Sentences with NLTK的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 日韩成人一区二区三区 | 日本高清不卡在线观看 | 黄色精品在线 | 日韩毛片网 | 色天天色综合 | 国产成人亚洲综合 | 中文字幕综合 | 中文字字幕在线中文乱码电影 | 美国式禁忌1980 | 欧美视频在线观看一区二区 | 国内精品视频在线 | 亚洲三区精品 | 欧美a∨ | 91爱啪啪 | 免费视频爱爱太爽 | 玖玖爱这里只有精品 | 人人插人人干 | 怡红院综合网 | 色眯眯视频 | 在线艹 | 国产不卡一区二区视频 | 老熟妇仑乱一区二区av | 97色网| 夫妻精品 | 男人天堂网在线视频 | 亚洲欧美成人综合 | 亚洲中文字幕无码不卡电影 | av夜色| 色婷婷久久久亚洲一区二区三区 | 欧美亚洲一区二区三区 | 天天干天天看 | 亚洲图片小说区 | 色男天堂| 99精品偷自拍 | jvid在线| 欧美人伦| 成人免费视| 欧美日韩黄色网 | 91影音| 日日操网站 | 午夜剧场免费在线观看 | 久久久久久国产精品日本 | 真实的中国女人做爰 | 日韩专区av| 日韩精品h | 欧美成人猛片aaaaaaa | 综合色区 | 好吊操这里有精品 | 日本a一级片 | 在线观看亚洲大片短视频 | www.性欧美 | 麻豆婷婷 | 欧美精品videos| 亚洲av色区一区二区三区 | 乱一色一乱一性一视频 | 永久精品网站 | www.日韩在线观看 | 91影视在线观看 | 日本不卡视频在线播放 | 蜜桃视频在线观看一区 | 亚洲精品网站在线观看 | 麻豆污视频 | 亚洲第一视频在线 | 蜜臀视频一区二区三区 | 国产精品免费一区二区三区在线观看 | 娇小tube性极品娇小 | 欧美性视频播放 | 日本免费成人 | 日本在线视频一区二区 | 欧美888 | 波多野吉衣在线视频 | 欧美日韩一区二区三区四区 | 奇米影视一区二区三区 | 亚洲欧美日韩一区二区三区在线观看 | 成人欧美日韩 | 免费看欧美片 | 国产床上视频 | 久久网站免费 | 都市激情国产精品 | 天天干天天操天天 | 日韩久久av | 4虎最新网址| 亚洲欧美日韩天堂 | 91精品人妻一区二区 | av看片资源| 国产精品色综合 | 欧美.com | 久久精热 | 亚洲私拍 | 91网在线看| 超碰pron | 美女二区 | 国产精品mm | 国产哺乳奶水91在线播放 | 91手机视频 | 中国一级特黄毛片 | 91成人在线免费视频 | 麻豆自拍偷拍 | 精品人妻无码专区视频 |