日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

SVD分解.潜语义分析.PythonCode

發(fā)布時(shí)間:2023/12/31 python 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 SVD分解.潜语义分析.PythonCode 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

原文鏈接:http://www.cnblogs.com/appler/archive/2012/02/02/2335886.html

原始英文鏈接:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html

潛語義分析LSA介紹

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.

?

Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing (LSI),從字面上的意思理解就是通過分析文檔去發(fā)現(xiàn)這些文檔中潛在的意思和概念。假設(shè)每個(gè)詞僅表示一個(gè)概念,并且每個(gè)概念僅僅被一個(gè)詞所描述,LSA將非常簡(jiǎn)單從詞到概念存在一個(gè)簡(jiǎn)單的映射關(guān)系)

Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.

?

不幸的是,這個(gè)問題并沒有如此簡(jiǎn)單,因?yàn)榇嬖诓煌脑~表示同一個(gè)意思(同義詞),一個(gè)詞表示多個(gè)意思,所有這種二義性(多義性)都會(huì)混淆概念以至于有時(shí)就算是人也很難理解。

?

For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.

?

例如,銀行這個(gè)詞和抵押、貸款、利率一起出現(xiàn)時(shí)往往表示金融機(jī)構(gòu)。但是,和魚餌,投擲、魚一起出現(xiàn)時(shí)往往表示河岸。

?

How Latent Semantic Analysis Works

潛語義分析工作原理

?

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare?words?to find relevant documents, because what we really want to do is compare the?meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.

?

潛語義分析(Latent Semantic Analysis)源自問題:如何從搜索query中找到相關(guān)的文檔。當(dāng)我們?cè)噲D通過比較詞來找到相關(guān)的文本時(shí),存在著難以解決的局限性,那就是在搜索中我們實(shí)際想要去比較的不是詞,而是隱藏在詞之后的意義和概念。潛語義分析試圖去解決這個(gè)問題,它把詞和文檔都映射到一個(gè)‘概念’空間并在這個(gè)空間內(nèi)進(jìn)行比較(注:也就是一種降維技術(shù))。

?

Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.

?

當(dāng)文檔的作者寫作的時(shí)候,對(duì)于詞語有著非常寬泛的選擇。不同的作者對(duì)于詞語的選擇有著不同的偏好,這樣會(huì)導(dǎo)致概念的混淆。這種對(duì)于詞語的隨機(jī)選擇在 詞-概念 的關(guān)系中引入了噪音。LSA濾除了這樣的一些噪音,并且還能夠從全部的文檔中找到最小的概念集合(為什么是最小?)。

?

In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.

1.???? Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.

2.???? Concepts are represented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.

3.???? Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.

To see a small example of LSA, take a look at the next section.

?

為了讓這個(gè)難題更好解決,LSA引入一些重要的簡(jiǎn)化:

??? 1. 文檔被表示為”一堆詞(bags of words)”,因此詞在文檔中出現(xiàn)的位置并不重要,只有一個(gè)詞的出現(xiàn)次數(shù)。

??? 2. 概念被表示成經(jīng)常出現(xiàn)在一起的一些詞的某種模式。例如“l(fā)eash”(栓狗的皮帶)、“treat”、“obey”(服從)經(jīng)常出現(xiàn)在關(guān)于訓(xùn)練狗的文檔中。

??? 3. 詞被認(rèn)為只有一個(gè)意思。這個(gè)顯然會(huì)有反例(bank表示河岸或者金融機(jī)構(gòu)),但是這可以使得問題變得更加容易。(這個(gè)簡(jiǎn)化會(huì)有怎樣的缺陷呢?)

?

接下來看一個(gè)LSA的小例子,Next Part:

A Small Example

一個(gè)例子

As a small example, I searched for books using the word “investing” at Amazon.com and took the top 10 book titles that appeared. One of these titles was dropped because it had only one index word in common with the other titles. An index word is any word that:

  • appears in 2 or more titles, and
  • is not a very common word such as “and”, “the”, and so on (known as stop words). These words are not included because do not contribute much (if any) meaning.

In this example we have removed the following stop words: “and”, “edition”, “for”, “in”, “l(fā)ittle”, “of”, “the”, “to”.

一個(gè)小例子,我在amazon.com上搜索”investing”(投資) 并且取top 10搜索結(jié)果的書名。其中一個(gè)被廢棄了,因?yàn)樗缓幸粋€(gè)索引詞(index word)和其它標(biāo)題相同。索引詞可以是任何滿足下列條件的詞:

??? 1. 在2個(gè)或者2個(gè)以上標(biāo)題中出現(xiàn) 并且

??? 2. 不是那種特別常見的詞例如 “and”, ”the” 這種(停用詞-stop word)。這種詞沒有包含進(jìn)來是因?yàn)樗麄儽旧聿淮嬖谑裁匆饬x。

在這個(gè)例子中,我們拿掉了如下停用詞:“and”, “edition”, “for”, “in”, “l(fā)ittle”, “of”, “the”, “to”.

Here are the 9 remaining tiles. The index words (words that appear in 2 or more titles and are not stop words) are underlined.

下面就是那9個(gè)標(biāo)題,索引詞(在2個(gè)或2個(gè)以上標(biāo)題出現(xiàn)過的非停用詞)被下劃線標(biāo)注:

1.???? The Neatest Little Guide to Stock Market Investing

2.???? Investing ForDummies, 4th Edition

3.???? The Little Book of Common SenseInvesting: The Only Way to Guarantee Your Fair Share ofStock Market Returns

4.???? The Little Book ofValue Investing

5.???? Value Investing: From Graham to Buffett and Beyond

6.???? Rich Dad's Guide toInvesting: What theRich Invest in, That the Poor and the Middle Class Do Not!

7.???? Investing inReal Estate, 5th Edition

8.???? Stock Investing ForDummies

9.???? Rich Dad's Advisors: The ABC's ofReal Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

Once Latent Semantic Analysis has been run on this example, we can plot the index words and titles on an XY graph and identify clusters of titles. The 9 titles are plotted with blue circles and the 11 index words are plotted with red squares. Not only can we spot clusters of titles, but since index words can be plotted along with titles, we can label the clusters. For example, the blue cluster, containing titles T7 and T9, is about real estate. The green cluster, with titles T2, T4, T5, and T8, is about value investing, and finally the red cluster, with titles T1 and T3, is about the stock market. The T6 title is an outlier, off on its own.

在這個(gè)例子里面應(yīng)用了LSA,我們可以在XY軸的圖中畫出詞和標(biāo)題的位置(只有2維),并且識(shí)別出標(biāo)題的聚類。藍(lán)色圓圈表示9個(gè)標(biāo)題,紅色方塊表示11個(gè)索引詞。我們不但能夠畫出標(biāo)題的聚類,并且由于索引詞可以被畫在標(biāo)題一起,我們還可以給這些聚類打標(biāo)簽。例如,藍(lán)色的聚類,包含了T7和T9,是關(guān)于real estate(房地產(chǎn))的,綠色的聚類,包含了標(biāo)題T2,T4,T5和T8,是講value investing(價(jià)值投資)的,最后是紅色的聚類,包含了標(biāo)題T1和T3,是講stock market(股票市場(chǎng))的。標(biāo)題T6是孤立點(diǎn)(outlier)

?

In the next few sections, we'll go through all steps needed to run Latent Semantic Analysis on this example.

在下面的部分,我們會(huì)通過這個(gè)例子介紹LSA的整個(gè)流程。

Part 1 - Creating the Count Matrix

第一部分 - 創(chuàng)建計(jì)數(shù)矩陣

The first step in Latent Semantic Analysis is to create theword by title (or document) matrix. In this matrix, each index word is a rowand each title is a column. Each cell contains the number of times that wordoccurs in that title. For example, the word "book" appears one timein title T3 and one time in title T4, whereas "investing" appears onetime in every title. In general, the matrices built during LSA tend to be verylarge, but also very sparse (most cells contain 0). That is because each titleor document usually contains only a small number of all the possible words.This sparseness can be taken advantage of in both memory and time by moresophisticated LSA implementations.

LSA的第一步是要去創(chuàng)建詞到標(biāo)題(文檔)的矩陣。在這個(gè)矩陣?yán)?#xff0c;每一個(gè)索引詞占據(jù)了一行,每一個(gè)標(biāo)題占據(jù)一列。每一個(gè)單元(cell)包含了這個(gè)詞出現(xiàn)在那個(gè)標(biāo)題中的次數(shù)。例如,詞”book”出現(xiàn)在T3中一次,出現(xiàn)在T4中一次,而” investing”在所有標(biāo)題中都出現(xiàn)了一次。一般來說,在LSA中的矩陣會(huì)非常大而且會(huì)非常稀疏(大部分的單元都是0)。這是因?yàn)槊總€(gè)標(biāo)題或者文檔一般只包含所有詞匯的一小部分。更復(fù)雜的LSA算法會(huì)利用這種稀疏性去改善空間和時(shí)間復(fù)雜度。

In the following matrix, we have left out the 0's to reduceclutter.

Index Words

Titles

?

T1

T2

T3

T4

T5

T6

T7

T8

T9

book

??

1

1

?????

dads

?????

1

??

1

dummies

?

1

?????

1

?

estate

??????

1

?

1

guide

1

????

1

???

investing

1

1

1

1

1

1

1

1

1

market

1

?

1

??????

real

??????

1

?

1

rich

?????

2

??

1

stock

1

?

1

????

1

?

value

???

1

1

????

?

Python - Getting Started

Download the python code?here.

Throughout this article, we'll givePython code that implements all the steps necessary for doing Latent SemanticAnalysis. We'll go through the code section by section and explain everything.The Python code used in this article can be downloaded?here?and then run in Python. You need to havealready installed the Python NumPy and SciPy libraries.

?

在這篇文章中,我們用python代碼去實(shí)現(xiàn)LSA的所有步驟。我們將介紹所有的代碼。Python代碼可以在這里被下到(見上)。需要安裝NumPy 和 SciPy這兩個(gè)庫。

Python - Import Functions

First we need to import a few functions from Python librariesto handle some of the math we need to do. NumPy is the Python numericallibrary, and we'll import zeros, a function that creates a matrix of zeros thatwe use when building our words by titles matrix. From the linear algebra partof the scientific package (scipy.linalg) we import the svd function thatactually does the singular value decomposition, which is the heart of LSA.

NumPy是python的數(shù)值計(jì)算類,用到了zeros(初始化矩陣),scipy.linalg這個(gè)線性代數(shù)的庫中,我們引入了svd函數(shù)也就是做奇異值分解,LSA的核心。

[python] view plaincopyprint?
  • from?numpy?import?zeros??
  • from?scipy.linalg?import?svd??
  • ?

    Python - Define Data

    Next, we define the data that we are using. Titles holds the9 book titles that we have gathered, stopwords holds the 8 common words that weare going to ignore when we count the words in each title, and ignorechars hasall the punctuation characters that we will remove from words. We use Python'striple quoted strings, so there are actually only 4 punctuation symbols we areremoving: comma (,), colon (:), apostrophe ('), and exclamation point (!).

    Stopwords 是停用詞 ignorechars是無用的標(biāo)點(diǎn)

    [python] view plaincopyprint?
  • titles?=??
  • ??
  • [???
  • "The?Neatest?Little?Guide?to?Stock?Market?Investing",???
  • "Investing?For?Dummies,?4th?Edition",???
  • "The?Little?Book?of?Common?Sense?Investing:?The?Only?Way?to?Guarantee?Your?Fair?Share?of?Stock?Market?Returns",???
  • "The?Little?Book?of?Value?Investing",???
  • "Value?Investing:?From?Graham?to?Buffett?and?Beyond",???
  • "Rich?Dad's?Guide?to?Investing:?What?the?Rich?Invest?in,?That?the?Poor?and?the?Middle?Class?Do?Not!",???
  • "Investing?in?Real?Estate,?5th?Edition",???
  • "Stock?Investing?For?Dummies",???
  • "Rich?Dad's?Advisors:?The?ABC's?of?Real?Estate?Investing:?The?Secrets?of?Finding?Hidden?Profits?Most?Investors?Miss"???
  • ]??
  • ??
  • stopwords?=?['and','edition','for','in','little','of','the','to']???
  • ignorechars?=?''''',:'!'''??


  • ?

    Python - Define LSA Class

    The LSA class has methods for initialization, parsingdocuments, building the matrix of word counts, and calculating. The firstmethod is the __init__ method, which is called whenever an instance of the LSAclass is created. It stores the stopwords and ignorechars so they can be usedlater, and then initializes the word dictionary and the document countvariables.

    這里定義了一個(gè)LSA的類,包括其初始化過程wdict是詞典,dcount用來記錄文檔號(hào)。

    [python] view plaincopyprint?
  • class?LSA(object):??
  • ??
  • def?__init__(self,?stopwords,?ignorechars):??
  • ??
  • self.stopwords?=?stopwords???
  • self.ignorechars?=?ignorechars???
  • self.wdict?=?{}???
  • self.dcount?=?0??


  • ?

    Python - Parse Documents

    The parse method takes a document, splits it into words, removesthe ignored characters and turns everything into lowercase so the words can becompared to the stop words. If the word is a stop word, it is ignored and wemove on to the next word. If it is not a stop word, we put the word in thedictionary, and also append the current document number to keep track of whichdocuments the word appears in.

    The documents that each word appears in are kept in a listassociated with that word in the dictionary. For example, since the word bookappears in titles 3 and 4, we would have self.wdict['book'] = [3, 4] after alltitles are parsed.

    After processing all words from the current document, weincrease the document count in preparation for the next document to be parsed.

    這個(gè)函數(shù)就是把文檔拆成詞并濾除停用詞和標(biāo)點(diǎn),剩下的詞會(huì)把其出現(xiàn)的文檔號(hào)填入到wdict中去,例如,詞book出現(xiàn)在標(biāo)題3和4中,則我們有self.wdict['book'] = [3, 4]。相當(dāng)于建了一下倒排。

    [python] view plaincopyprint?
  • def?parse(self,?doc):??
  • ??
  • ????words?=?doc.split();??for?w?in?words:??
  • ??
  • ?????w?=?w.lower().translate(None,?self.ignorechars)???
  • ??
  • ?????if?w?in?self.stopwords:??
  • ??
  • ?????????continue??
  • ??
  • ?????elif?w?in?self.wdict:??
  • ??
  • ?????????self.wdict[w].append(self.dcount)??
  • ??
  • ?????else:??
  • ??
  • ?????????self.wdict[w]?=?[self.dcount]??
  • ??
  • ????self.dcount?+=?1??


  • ?

    Python - Build the Count Matrix

    Once all documents are parsed, all the words (dictionarykeys) that are in more than 1 document are extracted and sorted, and a matrix isbuilt with the number of rows equal to the number of words (keys), and thenumber of columns equal to the document count. Finally, for each word (key) anddocument pair the corresponding matrix cell is incremented.

    所有的文檔被解析之后,所有出現(xiàn)的詞(也就是詞典的keys)被取出并且排序。建立一個(gè)矩陣,其行數(shù)是詞的個(gè)數(shù),列數(shù)是文檔個(gè)數(shù)。最后,所有的詞和文檔對(duì)所對(duì)應(yīng)的矩陣單元的值被統(tǒng)計(jì)出來。

    [python] view plaincopyprint?
  • def?build(self):??
  • ??
  • ????self.keys?=?[k?for?k?in?self.wdict.keys()?if?len(self.wdict[k])?>?1]???
  • ??
  • ????self.keys.sort()???
  • ??
  • ????self.A?=?zeros([len(self.keys),?self.dcount])???
  • ??
  • ????for?i,?k?in?enumerate(self.keys):??
  • ??
  • ????????for?d?in?self.wdict[k]:??
  • ??
  • ????????????self.A[i,d]?+=?1??


  • ?

    Python - Print the Count Matrix

    The printA() method is very simple, it just prints out thematrix that we have built so it can be checked.

    把矩陣打印出來

    [python] view plaincopyprint?
  • def?printA(self):??
  • ??
  • print?self.A??


  • ?

    Python - Test the LSA Class

    After defining the LSA class, it's time to try it out on our9 book titles. First we create an instance of LSA, called mylsa, and pass itthe stopwords and ignorechars that we defined. During creation, the __init__method is called which stores the stopwords and ignorechars and initializes theword dictionary and document count.

    Next, we call the parse method on each title. This methodextracts the words in each title, strips out punctuation characters, convertseach word to lower case, throws out stop words, and stores remaining words in adictionary along with what title number they came from.

    Finally we call the build() method to create the matrix ofword by title counts. This extracts all the words we have seen so far, throwsout words that occur in less than 2 titles, sorts them, builds a zero matrix ofthe right size, and then increments the proper cell whenever a word appears ina title.

    [python] view plaincopyprint?
  • mylsa?=?LSA(stopwords,?ignorechars)???
  • for?t?in?titles:??
  • ??
  • mylsa.parse(t)??
  • ??
  • mylsa.build()???
  • mylsa.printA()??


  • Here is the raw output produced by printA(). As you can see,it's the same as the matrix that we showed earlier.

    在剛才的測(cè)試數(shù)據(jù)中驗(yàn)證程序邏輯,并查看最終生成的矩陣:

    [python] view plaincopyprint?
  • [[?0.?0.?1.?1.?0.?0.?0.?0.?0.]??
  • [?0.?0.?0.?0.?0.?1.?0.?0.?1.]??
  • [?0.?1.?0.?0.?0.?0.?0.?1.?0.]??
  • [?0.?0.?0.?0.?0.?0.?1.?0.?1.]??
  • [?1.?0.?0.?0.?0.?1.?0.?0.?0.]??
  • [?1.?1.?1.?1.?1.?1.?1.?1.?1.]??
  • [?1.?0.?1.?0.?0.?0.?0.?0.?0.]??
  • [?0.?0.?0.?0.?0.?0.?1.?0.?1.]??
  • [?0.?0.?0.?0.?0.?2.?0.?0.?1.]??
  • [?1.?0.?1.?0.?0.?0.?0.?1.?0.]??
  • [?0.?0.?0.?1.?1.?0.?0.?0.?0.]]?

  • 創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

    總結(jié)

    以上是生活随笔為你收集整理的SVD分解.潜语义分析.PythonCode的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。