當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习项目搭建试验 where2go

發布時間：2023/12/8 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习项目搭建试验 where2go 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

https://github.com/da248/where2go

這個項目感覺還是挺好的，雖然沒給各個數據集的下載鏈接，也有一些莫名其妙的bug，但是錯誤調試提示都還挺全，能一直有進展。

（看了下這個好像不太頂用-純html調用APIhttps://github.com/alex-engelmann/Where2Go）

1) Gathering Data

wikivoyage_xml_to_json.py

2.wikivoyage_geotags_sql.py

3.scrap_wikivoyage_banners.py

New York Times

2) EDA

Wiki voyage

更改默認保存位置

Weather

Nyt

3) Model

Webapp

Final Remarks

啟動項目

Anaconda常用命令

常用算法

自然語言處理natural language processing：

推薦系統中常見的文本處理方法：

Word2vec原理

網站細節

Html

Flask后臺接口

Model

模型分析

Eda

Model

基于 H-softmax 模型的梯度計算

here2go 是專為推薦您的地方，根據你喜歡或不喜歡的地方/字符，而不是基于有廉價航班的目的地。

有很多網站告訴你最便宜的方式去目的地和最便宜的酒店住宿。但他們忘了問你一個非常根本的問題...你知道去哪里嗎？文章，如"XX的25大旅游目的地"或"YY的100個地方，你必須訪問！

此應用程序的動機之一是建立一個公正的推薦系統，該系統將考慮目的地的特征，而不是查看其他人喜歡的目的地。為此，我決定使用旅行指南來收集目的地信息。我發現，Wikivoyage提供了偉大的旅游指南，告訴你關于這個地方的歷史和文化，以及什么看，如何四處走動，等等。

Try it out on www.where2go.help

下一個問題是找出使用哪種模型。傳統的自然語言處理推薦系統包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。這些模型可能做偉大的工作，找到類似的目的地，但我想使用模型，讓我添加地方字符，如'海灘'或'酒'在我的搜索。因此，我決定去與谷歌最近創建的模型稱為word2vec。Word2vec 是一個驚人的模型，它將單詞轉換為捕捉單詞"含義"的矢量。此模型的酷功能是，您可以添加/減去單詞，因為它們是矢量。例如，你可以做類似操作'king' - 'man' + 'woman' 產生 a vector that ~= 'queen'。我的 Word2vec 模型了解了 wikivoyage 文章中介紹的單詞和地點的旅游特定上下文，允許矢量操作推薦類似位置。

使用 word2vec，我能夠獲得與搜索查詢具有最接近語義含義的單詞/目的地的建議。但是，我必須找出一種方法，確定哪些建議實際上是地理位置，哪些只是接近的話。我能夠使用Wikivoyage的地理定位數據來檢查這一點。

一旦我訓練了旅行環境模型，我就構建了一個 Web 應用程序來交付我的數據科學項目。我使用 javascript 執行 AJAX 調用，將用戶查詢的結果更新到 MapBox map and Bootstrap to format the pages。

*我還收集了《紐約時報》的《旅行、世界和科學》（其中有很多環保文章）新聞來豐富我的數據源，但決定將其排除在外，因為結果過于"新聞化"。

Methodology

The code folder is divided into three sections 1) data collection, 2) EDA, 3) model.

1) Gathering Data

####Wikivoyage There are three files for wikivoyage data.三個wikivoyage數據的文件

wikivoyage_xml_to_json.py

The purpose of this file is to convert Wikivoyage travel guide articles to JSON format. Wikivoyage provided a data dump of its articles in XML format and I converted it to JSON format to go through exploratory data analysis with pandas.

wikivoyage_xml_to_json.py

此文件的目的是將 Wikivoyage 旅行指南文章轉換為 JSON 格式。Wikivoyage 以 XML 格式提供了文章的數據轉儲，我將其轉換為 JSON 格式，以便用pandas進行探索性數據分析。

運行：

ImportError: No module named xmltodict

圖形化界面安裝

ImportError: No module named pandas

圖形化界面安裝

Traceback (most recent call last):

? File "wikivoyage_xml_to_json.py", line 25, in <module>

??? jdata = convert_xml_to_json('data/wikivoyage/enwikivoyage-latest-pages-articles.xml')

? File "wikivoyage_xml_to_json.py", line 12, in convert_xml_to_json

??? xml_str = open(filename).read()

IOError: [Errno 2] No such file or directory: 'data/wikivoyage/enwikivoyage-latest-pages-articles.xml'

在https://dumps.wikimedia.org/enwikivoyage/latest/找數據集

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_xml_to_json.py

成功運行該文件后在where2go\code\data_collection\data\wikivoyage獲得wikivoyage.json一份，耶

2.wikivoyage_geotags_sql.py

The purpose of this file is to gather the geolocations of articles (places). Wikivoyage provided the geolocations of articles as a sql file. I created my own MySQL database to load in and query the data. I also did a bit of data cleaning in this file to remove the accents.

維基航行_地理標記_sql.py

此文件的目的是收集文章（地點）的地理位置。Wikivoyage 提供了文章的地理位置作為 sql 文件。我創建自己的 MySQL 數據庫來加載和查詢數據。我還在這個文件做了一些數據清理,刪除口音。

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python wikivoyage_geotags_sql.py

? File "wikivoyage_geotags_sql.py", line 72

??? geotag_dict = create_geotag_dict():

????????????????????????????????????? ^

SyntaxError: invalid syntax

嘗試刪除這個冒號

No module named pymysql.cursors

pip install pymysql

Traceback (most recent call last):

? File "wikivoyage_geotags_sql.py", line 9, in <module>

??? cursorclass=pymysql.cursors.DictCursor)

?pymysql.err.OperationalError: (1045, u"Access denied for user 'admin'@'localhost' (using password: NO)")

查看源碼：

# Connect to the database

connection = pymysql.connect(user='admin',

???????????????????????????? db='wiki',

cursorclass=pymysql.cursors.DictCursor)

查看連接方法：https://www.cnblogs.com/woider/p/5926744.html

pymysql.Connect()參數說明

host(str):????? MySQL服務器地址

port(int):????? MySQL服務器端口號

user(str):????? 用戶名

passwd(str):??? 密碼

db(str):??????? 數據庫名稱

charset(str):?? 連接編碼

connection對象支持的方法

cursor()??????? 使用該連接創建并返回游標

commit()??????? 提交當前事務

rollback()????? 回滾當前事務

close()???????? 關閉連接

cursor對象支持的方法

execute(op)???? 執行一個數據庫的查詢命令

fetchone()????? 取得結果集的下一行

fetchmany(size) 獲取結果集的下幾行

fetchall()????? 獲取結果集中的所有行

rowcount()????? 返回數據條數或影響行數

close()???????? 關閉游標對象

修改連接時用戶名密碼，創建數據庫

pymysql.err.ProgrammingError: (1146, u"Table 'wiki.geo_tags' doesn't exist")

看項目介紹中Wikivoyage 提供了文章的地理位置作為 sql 文件，繼續找數據集https://github.com/baturin/wikivoyage-listings

還是在這里找到（霧）：https://dumps.wikimedia.org/hewikivoyage/latest/

pymysql.err.ProgrammingError: (1146, u"Table 'wiki.page' doesn't exist")

還在剛剛的頁面找到pages.sql

下載的一個sql貌似不是英文，（?????_?????????）

看到這個貌似是官網https://www.wikidata.org/wiki/Wikidata:Wikivoyage/Resources

全語言長這樣：https://www.wikivoyage.org/

英文版的地址長這樣：https://en.wikivoyage.org/

同理類推：https://dumps.wikimedia.org/enwikivoyage/latest/

成功找到英文版sql

IOError: [Errno 2] No such file or directory: '../data/geotag_dict.pkl'

查看源碼為輸出文件，新建

成功運行

3.scrap_wikivoyage_banners.py

This file contains code that I used to scrap the banner images of articles from wikivoyage. I also used this to collect the canonical url of the wikivoyage page. I had to search destinations using a special search page on Wikivoyage to overcome minor syntax differences in place names.

此文件包含用于從 wikivoyage 中抓取文章的橫幅圖像的代碼。我也用這個來收集wikivoyage page的標準URL。我不得不在Wikivoyage上使用一個特殊的搜索頁面搜索目的地，以克服地名中的微小語法差異。

??? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb'))

IOError: [Errno 2] No such file or directory: '../../data/pickles/geotag_dict.pkl'

復制剛剛的pkl

CONNECTION ERROR!!! RECONNECT TO? page

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 109, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 95, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

INDEX ERROR!!!? page did not exist

查看源碼

def get_image_and_link(self, place):

?????? '''

?????? For a given place, get the canonical wikivoyage url and save the banner.

?????? If the banner is just a default banner, save the img path as the default

?????? banner to minimize duplicates.

?????? input: place as string

?????? output: img_path and wiki_url + (save image in the process)

?????? '''

?????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search="

?????? full_url = base_url + place.title()

?????? try:

?????????? response = requests.get(full_url).text

?????????? soup = BeautifulSoup(response, 'html.parser')

?????????? wiki_url = soup.find(rel='canonical')['href']

?????????? img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

?????? except IndexError:

?????????? print 'INDEX ERROR!!! %s page did not exist' % place

?????????? return make_default_img_url(place)

?????? except ConnectionError:

?????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

?????????? return make_default_img_url(place)

?????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src:

?????????? print '%s has default banner!' % place

?????????? img_path = 'static/banners/default.png'

?????? else:

?????????? place = place.replace('/', '_')? # REPLACE SLASH BECAUSE IT CREATES A DIRECTORY

?????????? try:

????????????? img_response = requests.get(img_src, stream=True)

????????????? img_path = 'static/banners/%s.png' % place

?????????? except IndexError:

????????????? print 'INDEX ERROR!!! %s page did not exist' % place

????????????? return make_default_img_url(place)

?????????? except ConnectionError:

????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

????????????? return make_default_img_url(place)

?????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite.

?????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

????????????????? shutil.copyfileobj(img_response.raw, out_file)

????????????? del img_response

????????????? print '%s.png successfully created' % place

?????????? else:

????????????? print '%s.png already exists!' % place

?????? return img_path, wiki_url

def scrap_banners(self):

?????? '''

?????? Go through every key in the locations dictionary and scrape the wiki url and img_path.

?????? '''

?????? for key in self.locations.iterkeys():

??? ?????? # print 'key %s,' % key

?????????? img_path, wiki_url = self.get_image_and_link(key)

?????????? self.locations[key]['wiki_url'] = wiki_url

?????????? self.locations[key]['img_path'] = img_path

def load_location(self):

?????? '''

?????? load the geolocation data.

?????? '''

?????? self.locations = pkl.load(open('../../data/pickles/geotag_dict.pkl', 'rb'))

看來還是pkl中的location出問題了，查看pkl

import cPickle as pickle?

??? f = open('path')?

??? info = pickle.load(f)?

??? print info?? #show file?

{'': {u'gt_lat': Decimal('56.83330000'), u'page_id': 18192, u'gt_lon': Decimal('60.58330000'), u'page_len': 27110}, '__': {u'gt_lat': Decimal('49.85944444'), u'page_id': 13920, u'gt_lon': Decimal('20.27472222'), u'page_len': 3453}, "_(')": {u'gt_lat': Decimal('-53.32000000'), u'page_id': 14305, u'gt_lon': Decimal('-70.91000000'), u'page_len': 3408}, "__'/": {u'gt_lat': Decimal('-22.92000000'), u'page_id': 13410, u'gt_lon': Decimal('-43.22000000'), u'page_len': 56927}, "/-'_": {u'gt_lat': Decimal('41.94610000'), u'page_id': 14123, u'gt_lon': Decimal('-87.66940000'), u'page_len': 28496},……

這些key真的詭異極了

嘗試打印full url

CONNECTION ERROR!!! RECONNECT TO https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search= page

嘗試打印key

保存副本

更改sql語言版本后成功獲得正確key

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did not exist

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 115, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 101, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 57, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

嘗試訪問url：

https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba

發get請求的狀態碼是302

wiki可以正常訪問，但不是這個網址，跳轉到

https://en.wikivoyage.org/wiki/Eastern_Cuba

和make_default_img_url中的地址一樣呢

https://en.wikivoyage.org/wiki/

但是https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default

這個搜索頁面還在

用搜索框搜索查看發出的請求是

https://en.wikivoyage.org/w/index.php?=Eastern_Cuba&sort=relevance&search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

加&fulltext=1或無該名稱則不會跳轉

https://en.wikivoyage.org/w/index.php?search=Eastern_Cuba&title=Special%3ASearch&profile=advanced&fulltext=1

所以URL可能沒有問題

make_default_img_url并不是全局變量，是不是什么錯誤讓人把它當成全局變量了

嘗試修改make_default_img_url

def make_default_img_url(self, place):

?????? '''

?????? input = place

?????? output = return the default values for img_path and wiki_url

?????? '''

?????? img_path = 'static/banners/default.png'

?????? wiki_url = 'https://en.wikivoyage.org/wiki/%s' % place

?????? return img_path, wiki_url

except IndexError:

?????????? # print 'INDEX ERROR!!! %s page did not exist' % place

?????????? print 'INDEX ERROR!!! %s -- did not exist' % full_url

?????????? # return make_default_img_url(place)

?????????? return self.make_default_img_url(place)

更改后雖然還是無法訪問，但可以連續運行了，最后報錯如下

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 753, in generate

??? raise ChunkedEncodingError(e)

requests.exceptions.ChunkedEncodingError: ("Connection broken: error(10053, '')", error(10053, ''))

在特殊搜索界面看到一個Developers：

https://www.mediawiki.org/wiki/How_to_contribute

網頁API：https://www.mediawiki.org/wiki/API:Web_APIs_hub

API:Geosearch：https://www.mediawiki.org/wiki/API:Geosearch

GET 請求用地理位置的附近坐標或頁面名稱搜索 wiki 頁面。

This module is supported through the Extension:GeoData currently not installed on MediaWiki but Wikipedia. So, in this document, we will use the URL en.wikipedia.org in all API endpoints.

此模塊通過擴展支持：地理數據當前未安裝在 MediaWiki 上，而是維基百科。因此，在本文中，我們將在所有 API 終結點中使用 URL en.wikipedia.org。

GET Request[edit]

Search for pages near Wikimedia Foundation headquarters by specifying the geographic coordinates of its location:

api.php?action=query&list=geosearch&gscoord=37.7891838|-122.4033522&gsradius=10000&gslimit=10 [try in ApiSandbox]

通過指定維基媒體基金會總部附近的頁面，指定其位置的地理坐標

API documentation：https://en.wikipedia.org/w/api.php?action=help&modules=query+geosearch

https://en.wikivoyage.org/w/api.php?action=help&modules=query

API查閱方法https://www.mediawiki.org/wiki/API:Main_page

Examples:

Fetch site info and revisions of Main Page.

api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&rvprop=user|comment&continue= [open in sandbox]

我之前用過request.urlopen，源碼為requests.get，查看這兩種區別https://blog.csdn.net/dead_cicle/article/details/86747593

構造一個Request對象，然后使用urlopen拿回來的還是對象

requests是python實現的簡單易用的HTTP庫，返回一個HTTPresp，該類有屬性：text,content,code等。

直接打印的狀態碼為200，但還是報錯，說明請求這一步是沒有問題的

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Eastern_Cuba -- did n

ot exist

查看bs：

https://blog.csdn.net/weixin_42231070/article/details/82225529

importurllib.request frombs4 importBeautifulSoup douban_path = "https://movie.douban.com"response = urllib.request.urlopen(douban_path) soup = BeautifulSoup(response, 'html.parser') # 可以接受response對象soup = BeautifulSoup(response.read().decode('utf-8'), 'html.parser') # 可以接受字符串soup = BeautifulSoup(open(test.html),'html.parser') # 可以接受本地文件

剛才嘗試打印text報錯編碼不對，但打印soup能打印出一堆html源碼

查看wiki_url成功

soup.find(rel='canonical')['href']

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://en.wikivoyage.org/wiki/Eastern_Cuba

所以可能是取img_src的問題

'https:'+soup.select('div.topbanner a.image')[0].select('img')[0]['src']

soup.select ：https://blog.csdn.net/geerniya/article/details/77842421

通過采用soup.select()方法，可以得到所需的內容。
其中關鍵點在于，對于所需內容的精準定位，通過（）內的語句來實現

https://blog.csdn.net/weixin_40425640/article/details/79470617

select 的功能跟find和find_all 一樣用來選取特定的標簽，它的選取規則依賴于css，我們把它叫做css選擇器，如果之前有接觸過jquery ，可以發現select的選取規則和jquery有點像。

標簽名不加任何修飾，會返回一個數組（所以div是標簽名

類名前加點，id名前加 #

組合查找可以分為兩種，一種是在一個tag中進行兩個條件的查找，一種是樹狀的查找一層一層之間的查找。

print soup.select('a#link2')

選擇標簽名為a，id為link2的tag。

猜測可能是最后的'src'下標無效

查找select('img') https://www.jianshu.com/p/ed2f044bd1fa

Tag或BeautifulSoup對象的.select()方法。

res = soup.select('#wrapperto')	-> tag's id
res = soup.select('img[src]')	-> 'img' tags有'src' attributes
res = soup.select('img[src=...]')	-> 'src' attributes是...

soup.select 查找Img src

https://www.cnblogs.com/calmzone/p/11139980.html

# soup.a.arrts? # 獲取a標簽所有屬性和值，返回一個字典

# soup.a.attrs['href']? # 獲取href屬性

# soup.a['href']? # 也可簡寫成這種

#上面兩種方式都可以獲取a標簽的href屬性值

https://blog.csdn.net/weixin_42231070/article/details/82225529

當屬性不存在時，使用 get 返回None，字典形式取值會報錯

print soup.select('div.topbanner a.image')

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[]

難道這返回了一個空數組，topbanner類的div中根本就沒有image類的a

查看https://en.wikivoyage.org/wiki/Eastern_Cuba的源碼

發現含有topbanner類的div是有的，但是有兩個，而且這個類名字只是包含，是好幾個類其中有個wpb-topbanner

一個div元素為了能被多個樣式表匹配到（樣式復用），通常div的class中由好幾段組成，如<div class="user login">能被.user和.login兩個選擇器選中。如果這兩個選擇器中有相同的屬性值，則該屬性值先被改為.user中的值，再被改為.login中的值，即重復的屬性以最后一個選擇器中的屬性值為準。（這個div就有好幾個類）

嘗試改select中的類名

?????????? a_img_tag=soup.select('div.wpb-topbanner a.image')

?????????? print a_img_tag

?????????? # print soup.select('div.topbanner a.image')[0].select('img')[0]

?????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

?????????? img_src = 'https:' + soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src']

打印不再是空數組了

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img

class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardal

avaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca

.jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7

d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://u

pload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>]

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 123, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 80, in get_image_and_link

??? img_response = requests.get(img_src, stream=True)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 75, in get

??? return request('get', url, params=params, **kwargs)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\api.py", line 60, in request

??? return session.request(method=method, url=url, **kwargs)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 519, in request

??? prep = self.prepare_request(req)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\sessions.py", line 462, in prepare_request

??? hooks=merge_hooks(request.hooks, self.hooks),

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 313, in prepare

??? self.prepare_url(url, params)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\requests\models.py", line 390, in prepare_url

??? raise InvalidURL("Invalid URL %r: No host supplied" % url)

requests.exceptions.InvalidURL: Invalid URL u'https:https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter

n_Cuba_Road_to_Guardalavaca.jpg': No host supplied

找到的只有一個a標簽，里面也只有一個img標簽

Src中的https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Easter可以訪問，出去額外添加的“https:”，報錯

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

[<a class="image" dir="ltr" href="/wiki/File:WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" title="Eastern Cuba"><img class="wpb-banner-image" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg" srcset="https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/640px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 640w,https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg/1280px-WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 1280w,https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg 2560w"/></a>]

CONNECTION ERROR!!! RECONNECT TO eastern_cuba page

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 123, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 109, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 89, in get_image_and_link

??? return make_default_img_url(place)

NameError: global name 'make_default_img_url' is not defined

嘗試打印請求的response

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg

Traceback (most recent call last):

? File "scrap_wikivoyage_banners.py", line 124, in <module>

??? swb.scrap_banners()

? File "scrap_wikivoyage_banners.py", line 110, in scrap_banners

??? img_path, wiki_url = self.get_image_and_link(key)

? File "scrap_wikivoyage_banners.py", line 94, in get_image_and_link

??? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

IOError: [Errno 2] No such file or directory: '../../webapp/static/banners/eastern_cuba.png'

嘗試創建banners

(py2_flask) D:\anacondaProject\where2go\code\data_collection>python scrap_wikivoyage_banners.py

https://upload.wikimedia.org/wikipedia/commons/7/7d/WV_banner_Eastern_Cuba_Road_to_Guardalavaca.jpg

eastern_cuba.png successfully created

https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Ardrossan_-_SA_WV_Banner.jpg/2560px-Ardrossan_-_SA_WV_Banner.jpg

ardrossan_(south_australia).png successfully created

不想斷網的時候爬信息一直往下滾，漏過了好多，嘗試在爬圖片網址的時候加了sleep

import time

except ConnectionError:

?????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

?????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url

?????????? time.sleep(20)

?????????? # return make_default_img_url(place)

?????????? return self.make_default_img_url(place)

這樣斷網的時候error就不會一直刷屏了，給我一點時間，把網重新連上，怎么下到一半anaconda還卡了呢= =

這回爬得順利一點，少量index error

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Fjrland&ns0=1 -- img src did not exist

Fjrland這個我直接在維基上搜也搜不到，There were no results matching the query.

下拉框有個帶梅花a的（打不出來）

找到真實鏈接為https://en.wikivoyage.org/wiki/Fj%C3%A6rland

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Heisy_Bordel/Prague/East_Bank_Of_Vltava&ns0=1 -- img src did not exist

Heisy_Bordel/Prague/East_Bank_Of_Vltava也搜不到

Heisy Bordel是Prague/East_Bank_Of_Vltava的一個貢獻用戶

真實鏈接https://en.wikivoyage.org/wiki/Prague/East_bank_of_Vltava

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Berlinichthyosaur_State_Park&ns0=1

下拉框有Berlin–Ichthyosaur State Park，真實鏈接https://en.wikivoyage.org/wiki/Berlin%E2%80%93Ichthyosaur_State_Park

圖片：https://en.wikivoyage.org/wiki/File:Berlin%E2%80%93Ichthyosaur_State_Park_banner.JPG

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Hafnarfjorur&ns0=1 -- img src did not exist

真實鏈接https://en.wikivoyage.org/wiki/Hafnarfj%C3%B6r%C3%B0ur

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Drivingukbanner1.Jpg&ns0=1 -- img src did not exist

Drivingukbanner1.Jpg這很奇怪，地名怎么變成jpg了，而且前面的Driving uk也不知道怎么查了

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Owl_Ad_Wouters.Jpg&ns0=1

可能是https://en.wikivoyage.org/wiki/Ad%27s_Path

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Snogebk&ns0=1 -- img src did not exist

真實鏈接https://en.wikivoyage.org/wiki/Snogeb%C3%A6k

INDEX ERROR!!! https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search=Nstved&ns0=1 -- img src did not exist

真實鏈接

https://en.wikivoyage.org/wiki/N%C3%A6stved

因為網老斷，需要重復多次運行，每次都重復請求url然后判斷圖片存在太慢了，先判斷一波

def get_image_and_link(self, place):

?????? '''

?????? For a given place, get the canonical wikivoyage url and save the banner.

?????? If the banner is just a default banner, save the img path as the default

?????? banner to minimize duplicates.

?????? input: place as string

?????? output: img_path and wiki_url + (save image in the process)

?????? '''

?????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

?????????? #look over before request

?????????? base_url = "https://en.wikivoyage.org/w/index.php?title=Special%3ASearch&profile=default&search="

?????????? full_url = base_url + place.title()

?????????? # print 'place %s,' % place

?????????? # print 'place_title %s' % place.title()

?????????? try:

????????????? response = requests.get(full_url).text

????????????? soup = BeautifulSoup(response, 'html.parser')

????????????? wiki_url = soup.find(rel='canonical')['href']

????????????? a_img_tag=soup.select('div.wpb-topbanner a.image')

????????????? # img_src = 'https:' + soup.select('div.topbanner a.image')[0].select('img')[0]['src']

????????????? img_src = soup.select('div.wpb-topbanner a.image')[0].select('img')[0]['src']

??? # mark

?????????? except IndexError:

????????????? # print 'INDEX ERROR!!! %s page did not exist' % place

????????????? print 'INDEX ERROR!!! %s -- img src did not exist' % wiki_url

??? ?????????? # return make_default_img_url(place)

????????????? return self.make_default_img_url(place)

?????????? except ConnectionError:

????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s page' % place

????????????? print 'CONNECTION ERROR!!! RECONNECT TO -- %s ' % full_url

????????????? time.sleep(20)

????????????? # return make_default_img_url(place)

????????????? return self.make_default_img_url(place)

?????????? if 'Pagebanner_default' in img_src or 'default_banner' in img_src:

????????????? print '%s has default banner!' % place

????????????? img_path = 'static/banners/default.png'

?????????? else:

????????????? place = place.replace('/', '_')? # REPLACE '/' with '_' BECAUSE IT CREATES A DIRECTORY

????????????? try:

????????????????? img_response = requests.get(img_src, stream=True)

????????????????? # print img_response

????????????????? img_path = 'static/banners/%s.png' % place

????????????? except IndexError:

????????????????? print 'INDEX ERROR!!! %s img did not exist' % place

????????????????? return self.make_default_img_url(place)

????????????? except ConnectionError:

????????????????? # print 'CONNECTION ERROR!!! RECONNECT TO %s img' % place

????????????????? print 'CONNECTION ERROR!!! RECONNECT TO %s img' % img_src

????????????????? return self.make_default_img_url(place)

????????????? # save the img file if it doesn't already exist. if it already exists, dont overwrite.

????????????? if not os.path.exists('../../webapp/static/banners/%s.png' % place):

????????????????? with open('../../webapp/static/banners/%s.png' % place, 'wb') as out_file:

???????????????????? shutil.copyfileobj(img_response.raw, out_file)

????????????????? del img_response

????????????????? print '%s.png successfully created' % place

????????????? else:

????????????????? print '%s.png already exists!' % place

?????????? #look over before request

?????? else:

?????????? print '%s.png already exists!' % place

?????????? return self.make_default_img_url(place)

?????? return img_path, wiki_url

（真不容易，電腦總是斷網，一直修不好。去工作室蹭網下圖片，老師還說有領導參觀，不讓呆）

網不好真的太難爬了，使用默認網址的模型，剩下的交給別人爬

全部運行完在D:\anacondaProject\where2go\data生成了一個geotag_imglink_wikibanner.pkl

New York Times

4.nyt_articles_api.py

This file was use to gather the most recent NYT articles in World, Science, and Travel sections. MongoDB was used to save the articles called with the official NYT API. Data was collected but was not incorporated to the model because the articles contained too much news like semantics.

此文件用于收集《世界、科學和旅行》部分中最新的《紐約時報》文章。MongoDB 用于保存使用官方的 NYT API 調用的文章。數據收集但未納入模型，因為文章包含太多像語義的新聞。

ImportError: No module named pymongo

去蹭網下叭

pip install pymongo

運行了一會兒后報錯

pymongo.errors.ServerSelectionTimeoutError: localhost:27017: [Errno 10061]

想起來這個是要mongodb的

https://www.jianshu.com/p/c9777b063593

https://blog.csdn.net/huasonl88/article/details/51755621

MongoDB 不同于關系型結構的三層結構——database--> table --> record，它的層級為 database -->collection --> document

https://blog.csdn.net/zwq912318834/article/details/77689568

import pymongo

# mongodb服務的地址和端口號

mongo_url = "127.0.0.1:27017"

# 連接到mongodb，如果參數不填，默認為“localhost:27017”

client = pymongo.MongoClient(mongo_url)

#連接到數據庫myDatabase

DATABASE = "myDatabase"

db = client[DATABASE]

#連接到集合(表):myDatabase.myCollection

COLLECTION = "myCollection"

db_coll = db[COLLECTION ]

# 在表myCollection中尋找date字段等于2017-08-29的記錄，并將結果按照age從大到小排序

queryArgs = {'date':'2017-08-29'}

search_res = db_coll.find(queryArgs).sort('age',-1)

for record in search_res:

????? print(f"_id = {record['_id']}, name = {record['name']}, age = {record['age']}")

源碼：

# Define the MongoDB database and table

db_cilent = MongoClient()

db = db_cilent['nyt_dump']

table = db['articles']

'''

??? Get all the links, visit the page and scrape the content

??? '''

??? if not section:

??????? links = table.find({'content_txt': {'$exists': False}}, {'web_url': 1})

??? else:

??????? links = table.find({'$and': [{'content_txt': {'$exists': False}},

?? ????????????????????????{'section_name': section}]}, {'web_url': 1})

開啟mongodb

D:\Program Files\Mongo\bin>mongod.exe --dbpath "D:\MongoDB\DBData"

Mongo還用不了了，卸載重裝https://www.cnblogs.com/6luv-ml/p/9174818.html

看了下，可能因為上次重裝系統的問題，程序與功能里并沒有mongodb，直接刪除了安裝

https://www.baidu.com/link?url=aA78IHXRSyxzObA9ArXLH43I1blC1eDEdnj9io1WJtH5LeR-cHl-gJgEwVfOkuJzsJiWNx_78t_CHZFXGHGNwzY9Vtz5wBluVD2AobNJiaW&wd=&eqid=b831449c000210e4000000035d5e441d

沒再報錯了（沒看到寫入文件，有沒有數據也不想管了-反正后面可能也用不著）

Service Name：MongoDB

Data Directory：D:\Program Files\Mongo\data\

2) EDA

Exploratory data analysis and data cleaning have been performed with ipython notebook. Wikivoyage and NYT data were loaded, cleaned, pickled out as input format for word2vec, which is a list of sentences where each sentence is represented as a list of words. Also, global NOAA weather data was downloaded but I later determined that it leaves out major parts of the world. Thus, more data has to be collected to incorporate weather to the project.

Ipython notebook已執行探索性數據分析和數據清理。Wikivoyage 和 NYT 數據被加載、清理、挑選出來作為 word2vec 的輸入格式，該格式是句子列表，其中每個句子都表示為單詞列表。此外，全球NOAA天氣數據被下載，但我后來確定，它忽略了世界的主要部分。因此，要將天氣納入項目必須收集更多的數據。

Wiki voyage

(py2_flask) D:\anacondaProject\where2go\code\data_collection>ipython

Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]

Type 'copyright', 'credits' or 'license' for more information

IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.

Anaconda中預置了ipython

可能是用jupyter做的，直接打開看起來像一堆json

Jupyter介紹：http://baijiahao.baidu.com/s?id=1601883438842526311&wfr=spider&for=pc

當你還處于原型開發階段時，Jupyter Notebooks 的作用更是引人注目。這是因為你的代碼是按獨立單元的形式編寫的，而且這些單元是獨立執行的。這讓用戶可以測試一個項目中的特定代碼塊，而無需從項目開始處執行代碼。

要運行你的 Jupyter Notebooks，只需在命令行輸入以下命令即可！

jupyter notebook

完成之后，Jupyter Notebooks 就會在你的默認網絡瀏覽器打開，地址是：

http://localhost:8888/tree

在某些情況下，它可能不會自動打開。而是會在終端/命令行生成一個 URL，并帶有令牌密鑰提示。你需要將包含這個令牌密鑰在內的整個 URL 都復制并粘貼到你的瀏覽器，然后才能打開一個筆記本。

打開筆記本后，你會看到頂部有三個選項卡：Files、Running 和 Clusters。其中，Files 基本上就是列出所有文件，Running 是展示你當前打開的終端和筆記本，Clusters 是由 IPython 并行提供的。

要打開一個新的 Jupyter 筆記本，點擊頁面右側的「New」選項。你在這里會看到 4 個需要選擇的選項：

Python 3Text FileFolderTerminal

選擇 Text File，你會得到一個空面板。你可以添加任何字母、單詞和數字。其基本上可以看作是一個文本編輯器（類似于 Ubuntu 的文本編輯器）。你可以在其中選擇語言（有很多語言選項），所以你可以在這里編寫腳本。你也可以查找和替換該文件中的詞。

選擇 Folder 選項時，你會創建一個新的文件夾，你可以在其中放入文件，重命名或刪除它。各種操作都可以。

Terminal 完全類似于在 Mac 或 Linux 機器上的終端（或 Windows 上的 cmd）。其能在你的網絡瀏覽器內執行一些支持終端會話的工作。在這個終端輸入 python，你就可以開始寫你的 Python 腳本了！

在代碼上面的菜單中，你有一些操作各個單元的選項：添加、編輯、剪切、向上和向下移動單元、運行單元內的代碼、停止代碼、保存工作以及重啟 kernel。

上圖所示的下拉菜單中，你還有 4 個選項：

Code——不言而喻，就是寫代碼的地方。Markdown——這是寫文本的地方。你可以在運行一段代碼后添加你的結論、添加注釋等。Raw NBConvert——這是一個可將你的筆記本轉換成另一種格式（比如 HTML）的命令行工具。Heading——這是你添加標題的地方，這樣你可以將不同的章節分開，讓你的筆記本看起來更整齊更清晰。這個現在已經被轉換成 Markdown 選項本身了。輸入一個「##」之后，后面輸入的內容就會被視為一個標題。

！%clear、%autosave、%debug 和 %mkdir 等功能你以前肯定見過。現在，神奇的命令可以以兩種方式運行：

逐行方式逐單元方式

顧名思義，逐行方式是執行單行的命令，而逐單元方式則是執行不止一行的命令，而是執行整個單元中的整個代碼塊。

在逐行方式中，所有給定的命令必須以 % 字符開頭；而在逐單元方式中，所有的命令必須以 %% 開頭

快捷方式是 Jupyter Notebooks 最大的優勢之一。當你想運行任意代碼塊時，只需要按 Ctrl+Enter 就行了。

Jupyter Notebooks 提供了兩種不同的鍵盤輸入模式——命令和編輯。命令模式是將鍵盤和筆記本層面的命令綁定起來，并且由帶有藍色左邊距的灰色單元邊框表示。編輯模式讓你可以在活動單元中輸入文本（或代碼），用綠色單元邊框表示。

你可以分別使用 Esc 和 Enter 在命令模式和編輯模式之間跳躍。

如之前提到的，Ctrl + Enter 會運行你的整個單元塊。

?Alt + Enter 不止會運行你的單元塊，還會在下面添加一個新單元。

?Ctrl + Shift + F 打開命令面板。

要查看鍵盤快捷鍵完整列表，可在命令模式按「H」或進入「Help > Keyboard Shortcuts」。

保存和共享你的筆記本

當我必須寫一篇博客文章時，我的代碼和評論都會在一個 Jupyter 文件中，我需要首先將它們轉換成另一個格式。記住這些筆記本是 json 格式的，這在進行共享時不會很有幫助。我總不能在電子郵件和博客上貼上不同單元塊，對不對？

進入「Files」菜單，你會看到「Download As」選項：

你可以用 7 種可選格式保存你的筆記本。其中最常用的是 .ipynb 文件和 .html 文件。使用 .ipynb 文件可讓其他人將你的代碼復制到他們的機器上，使用 .html 文件能以網頁格式打開（當你需要保存嵌入在筆記本中的圖片時會很方便）。

你也可以使用 nbconvert 選項手動將你的筆記本轉換成 HTML 或 PDF 等格式。

你也可以使用 jupyterhub，地址：https://github.com/jupyterhub/jupyterhub。其能讓你將筆記本托管在它的服務器上并進行多用戶共享。很多頂級研究項目都在使用這種方式進行協作。

有時候你的文件中有非常大量的代碼。看看能不能將你認為不重要的某些代碼隱藏起來，之后再引用。這能讓你的筆記本看起來整潔清晰，這是非常可貴的。查看這個在 matplotlib 上的筆記本，看看可以如何簡練地進行呈現：http://nbviewer.jupyter.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb

另一個額外技巧！在你想創建一個演示文稿時，你可能首先想到的工具是 PowerPoint 和 Google Slides。其實你的 Jupyter Notebooks 也能創建幻燈片！

更改默認保存位置

打開Windows的cmd，在cmd中輸入jupyter notebook --generate-config如下圖：

可以看到路徑為D:\Users……找到此路徑修改jupyter_notebook_config.py文件

打開此文件找到

## The directory to use for notebooks and kernels.
#c.NotebookApp.notebook_dir = ''
將其改為
## The directory to use for notebooks and kernels.
c.NotebookApp.notebook_dir = 'E:\Jupyter'
其中E:\Jupyter為我的工作空間，你可以改成你自己的，
注意:

1.#c.NotebookApp.notebook_dir = ''中的#必須刪除，且前面不能留空格。
2. E:\Jupyter,Jupyter文件夾必須提前新建，如果沒有新建，Jupyter Notebook會找不到這個文件，會產生閃退現象。

Cmd中沒有jupyter環境，無法運行jupyter notebook --generate-config，在anaconda中修改的配置，也在anaconda中打開

(base) C:\Users\Lenovo>jupyter notebook

反斜杠有可能識別為轉義

c.NotebookApp.notebook_dir = 'D:\\anacondaProject'

嘗試用base環境直接運行，新建Wiki voyage eda副本。

報錯

ModuleNotFoundError: No module named 'gensim'

暫時不管，待會再看在哪個環境裝好

發現py2環境雖然沒特地裝jupyter，但是居然也可以運行，所有配置和base環境一樣（右上角也有個py3的標志）

解決：http://www.360doc.com/content/17/0413/22/1489589_645405947.shtml

Jupyter Notebook的環境和kernels內核有關。用everything搜索kernel.json找到

/jupyter/kernels/python3/kernel.json

(py27)環境還缺少ipykernel

conda install ipykernel

切換

https://blog.csdn.net/castle_cc/article/details/77476081

python -m pip install ipykernel

python -m ipykernel install --user

成功切換。

運行報錯

FileNotFoundError: [Errno 2] No such file or directory: '../data/wikivoyage.json'

復制data collection中py生成的文件到該目錄D:\anacondaProject\where2go\code\data

報錯

ImportError: matplotlib is required for plotting.

https://www.cnblogs.com/star-zhao/p/9726212.html

嘗試重啟IDE，全部重新運行，報錯

LookupError Traceback (most recent call last) <ipython-input-36-3b51d0f0aedc> in <module>() 5 # final_articles_words[key] = convert_article_into_list_of_words(value) 6 #print article ----> 7 final_articles_words[key] = convert_article_into_list_of_words(value) <ipython-input-33-e13c4daff3a0> in convert_article_into_list_of_words(article) 14 text = clean_paragraph(text) 15 #tokenize paragraph to sentences ---> 16 sentences = sent_tokenize(text) 17 18 for sentence in sentences:

LookupError: ********************************************************************** Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt') For more information see: https://www.nltk.org/data.html Attempted to load tokenizers/punkt/english.pickle

https://blog.csdn.net/qq_31747765/article/details/80307450

命令行

python

import nltk

nltk.download()

切換到models標簽，找到punkt

在報錯中查找的地址中選一個，更改Download Directory

D:\\ProgramData\\Anaconda3\\envs\\py2_flask\\nltk_data

最后生成../data/wikivoyage_list_of_words.pkl

Weather

報錯

No module named haversine

pip install haversine

報錯

No such file or directory: '../../data/pickles/geotag_imglink_wikiurl.pkl'

沒找到拿他作為輸出文件的代碼，weather_normals_eda-checkpoint和此處都是作為讀入文件

嘗試拿剛剛生成的pickle改名字

報錯

IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.dat'

https://www.jianshu.com/p/3d4b606ec359

全球歷史氣候網絡月度（GHCNm）數據集是來自世界各地數千個氣象站的一組月度氣候摘要。月度數據具有通過站最早觀測可追溯至18改變記錄期間日世紀。一些臺站記錄純粹是歷史性的，不再更新，而其他許多臺站仍在運行，并提供對氣候監測有用的短時間延遲更新。

在該網頁找到該數據集地址：https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-4

地址https://www.ncei.noaa.gov/data/global-historical-climatology-network-monthly/

GHCN：https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn

v4版本的只有qcf、qcu、qfe，沒有qca，還是決定下v3的

PHA已經過廣泛的評估（例如，Williams等人，2012），并且GHCNm v4數據被提供為均質化（調整）和非均質化（未調整）。均勻化數據由字符串“ qcf ” 已知，而未均勻化數據由字符串“ qcu ” 指定。如Menne等人所述。（2018），PHA作為整體周期性地運行以量化均質化的不確定性。還評估了其他不確定因素。

放到指定文件夾，修改讀取文件名中的時間日期

globaldata = pd.read_fwf('../data/weather/ghcnm.tavg.v3.3.0.20190821.qca.dat',header = None, widths=widths)

報錯：

IOError: [Errno 2] No such file or directory: '../data/weather/ghcnm.tavg.v3.3.0.20150624.qca.inv'

解壓并改讀取文件名

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:6: FutureWarning: The current behaviour of 'Series.argmin' is deprecated, use 'idxmin' instead. The behavior of 'argmin' will be corrected to return the positional minimum in the future. For now, use 'series.values.argmin' or 'np.argmin(np.array(values))' to get the position of the minimum row. D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\ipykernel_launcher.py:7: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated import sys

# ndarray compat

argmin = idxmin

argmax = idxmax

別名而已

ix和loc、iloc函數都是用來獲取某一行或者某一列數據的。

????? col1? col2? col3

row1???? 1???? 2???? 3

row2???? 4???? 5???? 6

row3???? 7???? 8???? 9

.loc[]?is primarily label based, but may also be used with a boolean array.

完全基于標簽位置（而不是下標）的索引器，所謂標簽位置就是上面定義的'row1','row2'。

使用方法（row1就是行標簽）：print df.loc['row1']

.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.完全基于行號的索引器，所謂行號就是第0、1、2行。

print df.iloc[0]

.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.支持標簽和行號混合的索引器，既可以通過標簽也可以通過行號，還可以組合在一起（這個函數已經過期，建議使用上面兩個函數替代）

源碼：

closest_station = distance.argmin()

temps = master.ix[closest_station][months]

修改：

closest_station = distance.idxmin()

temps = master.loc[closest_station][months]

全部運行完成

Nyt

報錯

ImportError: No module named nyt_articles_api

這是之前collection中的模塊

如何在jupyter中調用自己寫的Python模塊：

https://blog.csdn.net/w371500241/article/details/55809362

https://www.cnblogs.com/master-pokemon/p/6136483.html

放同一個目錄或

import sys

sys.path.append('e:/workspace/Modules')

import Hello

Hello.hello()

Jupyter中直接寫絕對路徑無法識別，但相對路徑可以，加入

import sys

sys.path.append('..\data_collection')

報錯：

ServerSelectionTimeoutError: localhost:27017: [Errno 10061]

估計是mongodb的

這回好像裝的是個自帶server版本，自己就運行了，沒再報錯了，生成文件

../data/nyt_articles_word_list.pkl','wb'

啟動jupyter：

activate py2_flask

jupyter notebook

3) Model

Where2go is based on a model created at Google called word2vec. Word2vec is a neural network with 1 hidden layer that has continuous bag of words (CBOW) or skip-grams implementation. Where2go uses the version that uses skip-grams and hierarchical softmax for optimization.

On the high level, word2vec tries to train the neural network to paramatize a model that can predict the surrounding words for every word in the corpus. The predictions are then used to backpropogate and optimize the parameters to make words with similar contexts be closer together, while being further away from words that have different contexts. The input-hidden layer weighting matrix, which is also the vector representation of words, is then used to gain insight into the meaning/similarity of words.

In my where2go_model.py file, I implemented gensim's word2vec model and wrote functions to vectorize user search queries and functions to filter the recommendations to actual geolocations and output destinations in geojson format.

Where2go基于谷歌創建的名為word2vec的模型。Word2vec 是一個神經網絡，具有 1 個隱藏層，該層具有連續的單詞袋（CBOW）或skip-grams實現。where2go 使用的版本使用skip-grams 和hierarchical softmax進行優化。

在高層級上，word2vec 試圖訓練神經網絡，以參數化一個模型，該模型可以預測語料庫中每個單詞的周圍單詞。然后，這些預測用于回推和優化參數，使具有相似上下文的單詞更緊密地聯系在一起，同時遠離具有不同上下文的單詞。然后，使用輸入隱藏層加權矩陣（也是單詞的矢量表示形式）來深入了解單詞的含義/相似性。

在我的 where2go_model.py 文件中，我實現了 gensim 的 word2vec 模型，并編寫了矢量化用戶搜索查詢的函數，和將建議篩選到實際地理位置、以geojson格式輸出目的地的函數。

activate py2_flask

cd /d D:\anacondaProject\where2go\code\model

python where2go_model.py

報錯：

IOError: [Errno 2] No such file or directory: '../../data/pickles/geo_imglink_wikiurl.pkl'

原名稱為geotag_imglink_wikiurl，備份，修改名稱

報錯：

IOError: [Errno 2] No such file or directory: '../../data/pickles/wikivoyage_list_of_words.pkl'

將上面eda運行出來的粘上

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\base_any2vec.py:743: UserWarning: C extension not loaded, training will be slow. Install a C compiler and reinstall gensim for fast training.

? "C extension not loaded, training will be slow. "

https://blog.csdn.net/menghuanguaishou/article/details/90546838

pip uninstall gensim

pip install gensim==3.6

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

https://blog.csdn.net/qq_41185868/article/details/88344862

據說沒有關系

警告：

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\models\phrases.py:598: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class

? warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

源碼使用：

bigram = gensim.models.Phrases(self.wikivoyage_list, min_count=10)

??????? trigram = gensim.models.Phrases(bigram[self.wikivoyage_list], min_count=10)

奇怪，沒有看到解決這個警告的

https://blog.csdn.net/lwhsyit/article/details/82750218

應當生成../../data/pickles/where2go_model.pkl

Webapp

I was able to launch my own website using python Flask. I used javascript to perform AJAX calls for the search engine so that I could run a user's search query on my model to predict the most similar places and show my recommendations on the map. The Flask file is named 'app.py' and can be found in the folder 'webapp'; the 'index.html' file contains the html and javascript and can be found in the folder 'templates'. I used Bootstrap to design my website.

我能夠運行我自己的網站使用python Flask。我使用 javascript 對搜索引擎執行 AJAX 調用，以便可以在模型上運行用戶的搜索查詢，以預測最相似的位置并在地圖上顯示我的建議。Flask 文件名為"app.py"，可在文件夾"webapp"中找到;"index.html"文件包含 html 和 javascript，可以在文件夾"模板"中找到。我用Bootstrap來設計我的網站。

Final Remarks

This project has been very fun and intellectually challenging. I started this application as a capstone project but there are many things I would like to add to this app. I really want to add more travel guide data to make my results more robust, add historical weather data to help users decide when to go to a destination, and add average flight and hotel costs to help users choose plausible places. If you have any comments and recommendations for this project, please feel free to contact me.

這個項目很有趣，智力上很有挑戰性。我開始這個應用程序作為一個頂點項目，但有很多東西我想添加到這個app。我希望添加更多的旅游指南數據，使我的搜索結果更加可靠，添加歷史天氣數據，以幫助用戶決定何時前往目的地，并增加平均航班和酒店費用，以幫助用戶選擇合理的地方。如果您有任何意見和建議這個項目，請隨時與我聯系。

啟動項目

該項目為python2。在我anaconda，py3，py2共存的環境中使py2能夠使用。

發現需要的模塊還挺多，決定在anaconda中新建py2虛擬環境

圖形界面（fetching的過程挺長的）：

https://www.cnblogs.com/zimo-jing/p/7834808.html?utm_source=debugrun&utm_medium=referral

命令行：https://jingyan.baidu.com/article/455a9950500494a166277808.html

安裝模塊：

Traceback (most recent call last):

? File "app.py", line 10, in <module>

??? from where2go_model import Where2Go_Model

? File "../code/model\where2go_model.py", line 1, in <module>

??? import gensim

ImportError: No module named gensim

pip install genism（在圖形化界面中安裝總是報錯，安裝失敗）

報錯無滿足版本

pip?install?--upgrade?gensim

安裝（可能網絡問題）報錯

File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\pip\_vendor\urllib3\response.py", line 374, in _error_catcher

??? raise ReadTimeoutError(self._pool, None, 'Read timed out.')

ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

重復安裝命令

缺少模塊

Traceback (most recent call last):

? File "app.py", line 1, in <module>

??? from flask import Flask

ImportError: No module named flask

圖形化界面安裝

Traceback (most recent call last):

? File "app.py", line 10, in <module>

??? from where2go_model import Where2Go_Model

? File "../code/model\where2go_model.py", line 7, in <module>

??? from bs4 import BeautifulSoup

ImportError: No module named bs4

安裝beautifulsoup4

圖形化界面安裝報錯

pip install bs4

安裝成功

When I run the web app in a python2.7 environment with all the dependencies, I get the following error:

Traceback (most recent call last):

? File "app.py", line 44, in <module>

??? app.where2go = load_pickle()

? File "app.py", line 19, in load_pickle

??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))

IOError: [Errno 2] No such file or directory: '../data/pickles/where2go_model.pkl'

查看源碼：

webapp/app.py

Showing the top four matches Last indexed Jun 30, 2018

Python

4	from collections import defaultdict
5	import cPickle as pkl
6	import json
7	import random
8	import sys
9	sys.path.insert(0,'../code/model')
…	?
18	def load_pickle():
19	return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))
20	?
21	@app.route('/')
22	def welcome():

code/model/where2go_model.py

Showing the top three matches Last indexed Jun 30, 2018

Python

7	from bs4 import BeautifulSoup
8	?
9	?
10	class Where2Go_Model(object):
11	?
12	def __init__(self):
13	self.geotag_imglink_wikiurl = None
…	?
212	with open('../../data/pickles/where2go_model.pkl', 'wb') as f:
213	cPickle.dump(where2go, f)

嘗試使用code文件夾中代碼搜集數據集

啟動顯示：

(py2_flask) D:\anacondaProject\where2go\webapp>python app.py

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

?* Serving Flask app "app" (lazy loading)

?* Environment: production

?? WARNING: This is a development server. Do not use it in a production deployment.

?? Use a production WSGI server instead.

?* Debug mode: on

?* Restarting with stat

D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\gensim\utils.py:1212: UserWarning: detected Windows; aliasing chunkize to chunkize_serial

? warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")

?* Debugger is active!

?* Debugger PIN: 232-882-558

?* Running on http://0.0.0.0:80/ (Press CTRL+C to quit)

成功打開網頁

使用方法：

To search travel destinations, you can
1. Write destinations and/or characteristics
2. Put a + (add) or - (subtract) sign in front of the words to denote preferences
3. Multiply the words by a number to strengthen (greater than 1.0) or lower (less than 1.0) its influence

Where2go is likely to recommend a place at the same description level as the inputs. This means that when a city is searched, it is more likely to return a city name than a country name. It will work best when you input...

要搜索旅游目的地，您可以

1. 寫入目的地和/或特征

2. 在單詞前面放置一個 +（添加）或 -（減去）符號以表示首選項

3. 將單詞乘以數字以增強（大于 1.0）或更低（小于 1.0）其影響

where2go 可能會推薦與輸入處于相同描述級別的位置。這意味著，在搜索城市時，它更有可能返回城市名稱而不是國家/地區名稱。您最好輸入...

個別國家/城市Individual country/cities

Spain
Beijing
Maldives

輸入相同級別的位置Adding places of same level description level

hong kong + singapore
paris + 1.2*milan
0.8*dubai+cairo

地點+特征Adding places + characteristic

french polynesia + guam + scuba diving
California + wine
rome + beach

Search Tips:

至少放一個地點Try to put at least one place

word2vec searches similar words so it is likely to return places with names related to the search

*權重Play around with the place multipliers

san francisco + 1.5*malaga will yield results more like malaga than san francisco + malaga

當你想在B國到A這樣的城市When you want cities like A but in country B

City A - Country of Place A + Country of Place B.
Again, play around with the weights to see different results.
Example: Barcelona - Spain + 1.5*Italy

無法搜索，報錯

127.0.0.1 - - [02/Sep/2019 13:09:16] "POST /map HTTP/1.1" 500 -

Traceback (most recent call last):

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__

??? return self.wsgi_app(environ, start_response)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app

??? response = self.handle_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app

?? ?response = self.full_dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request

??? rv = self.handle_user_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request

??? rv = self.dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request

??? return self.view_functions[rule.endpoint](**req.view_args)

? File "D:\anacondaProject\where2go\webapp\app.py", line 41, in userinput

??? return json.dumps(app.result)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\__init__.py", line 244, in dumps

??? return _default_encoder.encode(obj)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 207, in encode

??? chunks = self.iterencode(o, _one_shot=True)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 270, in iterencode

??? return _iterencode(o, 0)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\json\encoder.py", line 184, in default

??? raise TypeError(repr(o) + " is not JSON serializable")

TypeError: Decimal('113.26700000') is not JSON serializable

查看源碼

def userinput():

??? data = request.data

??? ms = app.where2go.most_similar(data)

??? top_places_json = app.where2go.get_top_places_json(ms)?

???

??? app.result['top_places'] = top_places_json

??? print top_places_json

??? return json.dumps(app.result)

報錯

[(1.0, 'beijing')]

[(u'guangzhou', 0.7928651571273804), (u'seoul', 0.7863544225692749), (u'nanjing', 0.7803971767425537), (u'tianjin', 0.776152491569519), (u'shanghai', 0.7680126428604126), (u'hangzhou', 0.747714638710022), (u'wuhan', 0.7452333569526672), (u'kunming', 0.7269240021705627), (u'fuzhou', 0.720137357711792), (u'xiamen', 0.7125382423400879), (u'beijing_shanghai', 0.7124168872833252), (u'busan', 0.7067762613296509), (u'harbin', 0.7055091857910156), (u'xian', 0.703764796257019), (u'taipei', 0.7032514810562134), (u'moscow', 0.7001821994781494), (u'urumqi', 0.6986857652664185), (u'shenyang', 0.6914734244346619), (u'chengdu', 0.6909835338592529), (u'munich', 0.6862865686416626), (u'vienna', 0.6839408874511719), (u'ulaanbaatar', 0.6831813454627991), (u'budapest', 0.6821123957633972), (u'vladivostok', 0.6806952953338623), (u'zhengzhou', 0.6783512830734253), (u'brussels', 0.6768432259559631), (u'copenhagen', 0.6743952035903931), (u'pyongyang', 0.6742997169494629), (u'bratislava', 0.667781412601471), (u'astana', 0.6674197912216187), (u'ningbo', 0.6667413711547852), (u'chongqing', 0.6655149459838867), (u'shenzhen', 0.6651620864868164), (u'qingdao', 0.6618784070014954), (u'sofia', 0.6600873470306396), (u'frankfurt', 0.6579354405403137), (u'nanning', 0.6576802730560303), (u'berlin', 0.6552646160125732), (u'wuchang', 0.6497694253921509)]

127.0.0.1 - - [02/Sep/2019 14:35:10] "POST /map HTTP/1.1" 500 -

Traceback (most recent call last):

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2463, in __call__

??? return self.wsgi_app(environ, start_response)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2449, in wsgi_app

??? response = self.handle_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1866, in handle_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 2446, in wsgi_app

?? ?response = self.full_dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1951, in full_dispatch_request

??? rv = self.handle_user_exception(e)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1820, in handle_user_exception

??? reraise(exc_type, exc_value, tb)

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request

??? rv = self.dispatch_request()

? File "D:\ProgramData\Anaconda3\envs\py2_flask\lib\site-packages\flask\app.py", line 1935, in dispatch_request

??? return self.view_functions[rule.endpoint](**req.view_args)

? File "D:\anacondaProject\where2go\webapp\app.py", line 39, in userinput

??? app.result['top_places'] = top_places_json

TypeError: 'NoneType' object does not support item assignment

top_places_json、app.result是一個list

查看json.dumps的使用方法：

django自帶encoder，無法序列化時增加一個cls=NpEncoder的參數。

也可能是數據中帶有numpy等數據類型，dumps無法識別

此處Decimal('113.26700000')是小數的意思

自定義類：https://blog.csdn.net/rt5476238/article/details/91398332

https://stackoverflow.com/questions/1960516/python-json-serialize-a-decimal-object/8274307#8274307

Simplejson 2.1 2.1 and higher has native support for Decimal type: json.dumps(Decimal('3.9'), use_decimal=True)

Note that use_decimal is True by default

Simplejson：是一個簡單，快速，完整，正確和可擴展的JSON[http://json.org]編碼器和解碼器的Python 2.5+和Python 3.3*。它是純 Python 代碼，沒有依賴項，但包括可選的 C 擴展，用于嚴重提升速度。

簡單json的最新文檔可以在這里在線閱讀：https://simplejson.readthedocs.io/

simplejson 是 Python 2.6 和 Python 3.0 附帶的 json 庫的外部維護開發版本，但保留了與 Python 2.5 的向后兼容性。

使用文檔：https://simplejson.readthedocs.io/en/latest/

嘗試引入：

修改import json

為import simplejson as json

pip install simplejson

（除了存的banner圖片還沒放到文件夾里）全部完成，但是沒有包含線路，和客戶關注信息。

Anaconda常用命令

新建虛擬環境

(base) C:\Users\zdp>conda create -n django

激活虛擬環境

(base) C:\Users\zdp>activate py2_flask

進入項目文件夾路徑

(django) C:\Users\laugo>cd /d D:\anacondaProject\where2go\webapp

運行py文件：

python app.py

爬的時候運行：

activate py2_flask

cd /d D:\anacondaProject\where2go\code\data_collection

python scrap_wikivoyage_banners.py

打開app時：

activate py2_flask

cd /d D:\anacondaProject\where2go\webapp

python app.py

查看flask版本：

python

import flask

flask.__version__

常用算法

下一個問題是找出使用哪種模型。傳統的自然語言處理推薦系統包括 TF-IDF + cos-similarity和 TF-IDF + SVD + k - means聚類等模型。

自然語言處理natural language processing：

自然語言處理技術（NLP）在推薦系統中的應用https://blog.csdn.net/heyc861221/article/details/80130263

相比結構化信息（例如商品的屬性等），文本信息在具體使用時具有一些先天缺點：結構代表著信息量，無論是使用算法還是業務規則，都可以根據結構化信息來制定推薦策略；信息量不確定；歧義問題較多

優點：數據量大；多樣性豐富；信息及時

Word2vec原理

Word2vec概述：http://www.mamicode.com/info-detail-2150217.html無監督學習

概要（比較專業詳細）：https://www.jianshu.com/p/bca4e7bfb86d

應用, 序列數據 + 局部強關聯

聚類, 找同義詞, 詞性分析

文本序列: 近鄰強關聯, 可通過上下文預測目標詞(選詞填空)

社交網絡: 隨機游走生成序列, 然后使用word2vec訓練每個節點的向量.

推薦系統, 廣告(APP下載序列: word2vec + similarity = aggr to )

word2vec 從原理到實現：https://zhuanlan.zhihu.com/p/43736169

word2vec中哈夫曼樹原理https://www.jianshu.com/p/f9351532f281

genism中關于word2vec使用的文檔https://radimrehurek.com/gensim/models/word2vec.html

word2vec原理介紹：https://www.zhihu.com/topic/19886836/hot（其中的幾篇參考也值得一看）

Hierarchical softmax 和 negative sampling優化：https://www.cnblogs.com/Determined22/p/5807362.html

網站細節

我能夠運行網站使用python Flask。使用 javascript 對搜索引擎執行 AJAX 調用，以便可以在模型上運行用戶的搜索查詢，以預測最相似的位置并在地圖上顯示建議。

Flask 文件名為"app.py"，可在文件夾"webapp"中找到;

"index.html"文件包含 html 和 javascript，可以在文件夾"模板"中找到。Bootstrap來設計網站。

Html

Html中點擊查詢，onClick="sendToFlask()"

function sendToFlask() {

??????????????????????????????? data = $('#user_input').val();

??????????????????????????????? $.ajax({

??????????????????????????????????? 'url': '/map',

??????????????????????????????????? 'data': data,

??????????????????????????????????? 'type': 'POST',

??????????????????????????????????? 'contentType': 'application/json',

??????????????????????????????????? 'success': function (data) {

??????????????????????????????????????? model_output = JSON.parse(data)

?????????????????????????????????????? ?var center_location = model_output['center_location'];

?????????????? ?????????????????????????var geojson = model_output['top_places'];

???????????????????????????????????????

??????????????????????????????????????? //Initialize

??????????????????????????????????????? $('#error_msg').remove()

????????????????????????????? ??????????$('#portfolio').empty()

??????????????????????????????????????? //if geojson list is empty, display error message.

??????????????????????????????????????? if (geojson.length==0) {

??????????????????????????????????????????? error_message()

???? ???????????????????????????????????};

??????????????????????????????????????? var portfolio_header =? '<br/><div class="col-lg-12 text-center"><h3 class="section-heading">Places 2 Go</h2></div>'

??????????????????????????????????????? $('#portfolio').append(portfolio_header);

??????????????????????????????????????? addtoPortfolio(geojson);

??????????????????????????????????????? // Clear the map before adding new markers

??????????????????????????????????????? mapSimple.removeLayer(myLayer);

??????????????????????????????????????? // Create new layer

??????????????????????????????????????? myLayer = L.mapbox.featureLayer();

??????????????????????????????????????? // Add custom popups to each using our custom feature properties

?????????????? ?????????????????????????myLayer.on('layeradd', function(e) {

??????????????????????????????????????????? var marker = e.layer,

??????????????????????????????????????????????? feature = marker.feature;

??????????????????????????????????????????? // Create custom popup content

??????????????????????????????????????????? var popupContent =? '<a target="_blank" class="popup" href="' + feature.properties.url + '">' + '<div class=crop><img src="' + feature.properties.image + '" height/></div><div class=text-center style="padding:15px 0 0 0"><font size="5">' + feature.properties.title + '</font></div></a>';

??????????????????????????????????????????? // http://leafletjs.com/reference.html#popup

??????????????????????????????????????????? marker.bindPopup(popupContent,{

??????????????????????????????????????????????? closeButton: true,

??????????????????????????????????????????????? minWidth: 320

??????????????????????????????????????????? });

??????????????????????????????????????? });

??????????????????????????????????????? myLayer.setGeoJSON(geojson).addTo(mapSimple);

???????????????????????????????????????

??????????????????????????????????????? mapSimple.fitBounds(myLayer.getBounds());

??????????????????????????????????????? // mapSimple.clearLayers();

??????????????????????????????????? },

??????????????????????????????????? 'error': function (request, status, error) {

??????????????????????????????????????? $('#error_msg').remove()

??????????????????????????????????????? error_message();

??????????????????????????????????????? console.log('Oh no!! Something went wrong.');

??????????????????????????????????? }

??????????????????????????????? });

??????????????????????????? };

網頁使用mapbox

地圖無法顯示了，有可能是因為token失效或者沒連上網（控制臺提示'L' is not defined）

Font awesome的icon使用（沒有CSDN，不打算用了）

雖然不影響使用，貌似有一些沒下載下來？

GET /static/font-awesome/css/font-awesome.min.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/bootstrap.min.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/css/agency.css HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/jquery.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/bootstrap.min.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/cbpAnimatedHeader.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/agency.js HTTP/1.1" 304 -

127.0.0.1 - - [29/Dec/2019 21:39:05] "GET /static/js/classie.js HTTP/1.1" 304 -

Flask后臺接口

范例：

@app.route('/1' , methods=['POST'])

def aa(): #傳什么返回什么

??? with open('1.txt','a') as f:

??????? print(str(request.data, encoding='utf-8'),file=f)

??? return request.data

if __name__=='__main__':

??? app.run(port=3002)#默認不填寫的話，是5000端口

文件為app.py

app.run(host= '0.0.0.0', port=80, debug=True)

@app.route('/map', methods=['POST'])

def userinput():

??? data = request.data

??? ms = app.where2go.most_similar(data)

??? top_places_json = app.where2go.get_top_places_json(ms)?

??? # print top_places_json

??? app.result['top_places'] = top_places_json

??? return json.dumps(app.result)

Model

use the trained word2vec model to give most similar recommendations to the input

??????? input = search string in the format of place/char + place/char -...

??????? output = top recommendations in json format

使用經過訓練的 word2vec 模型為輸入提供最類似的建議

terms = self.parse_search_query(input)

將用戶查詢解析為乘數和目標

??????? # Set to make sure the output doesn't include one of the input destinations.

check = set()

確保輸出中不包含輸入的目的地

# For (multiplier, destination), get the multiplier * vector of that destination.

??????? # Then sum up to the master vector.

??????? for i, term in enumerate(terms):

??????????? multiplier, word = term

??????????? check.add(word)

??????????? if i == 0:

??????????????? master_vector = multiplier * self.model[word]

??????????? else:

??????????????? master_vector += multiplier * self.model[word]

對于（乘數、目的地），獲取該目標的乘數 + 矢量。

然后加到主向量中

# Find the most similar vectors to the amter vector

??????? ms = self.model.most_similar(positive=[master_vector], topn=topn)

??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check]

? ??????print

??????? print ms_wo_search_terms

??????? return ms_wo_search_terms

查找與 master 矢量最相似的矢量

疑問，ms到底怎么查出來的，

ms = self.model.most_similar(positive=[master_vector], topn=topn)

是自己調用自己嗎，還是word2vec自帶方法

有可能是自帶方法，但貌似不建議使用，警告：

DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.

方法將在 4.0.0 中刪除，改用self.wv.most_similar()

類分析

import cPickle as pkl#序列化

from where2go_model import Where2Go_Model#模型

def load_pickle():

??? return pkl.load(open('../data/pickles/where2go_model.pkl', 'rb'))

ms = app.where2go.most_similar(data)

運行時只有導入的where2go_model中有Where2Go_Model類，反序列化model也是它

但Where2Go_Model中也加載了其他pkl（找了一會在哪生成的，記憶模糊，拎不清了，離生成這些pkl已經過了很久了，猜測既然where2go_model中沒有把其他code文件導入，應該沒有其他類了，其他類之后抽空再看）

Pkl

Java 中有序列化與反序列化的操作，在 Python 中可以進行同樣的操作。使用 Python 進行對象的序列化（dump）與反序列化（load）操作時，我們不用考慮其中的細節，因為 Python 已經幫我們封裝好了相關的類cPickle。

模型分析

Eda

使用文件/data/wikivoyage.json(第一步Gathering Data的enwikivoyage-latest-pages-articles.xml轉化得來)369M有空去官網對該文件了解

處理數據，作為輸入格式

Model

Where2go基于谷歌創建的名為word2vec的模型。Word2vec 是一個神經網絡，具有 1 個隱藏層，該層具有連續單詞袋（CBOW）或skip-grams實現。where2go 使用的版本使用skip-grams 和hierarchical softmax進行優化。

在我的 where2go_model.py 文件中，我實現了 gensim 的 word2vec 模型，并編寫了矢量化用戶搜索查詢的函數，和將建議篩選到實際地理位置、以geojson格式輸出目的地的函數

模型建立取最相似（word2vec）：

bigram = gensim.models.Phrases(wikivoyage_list, min_count = 10)

model_bigrams = gensim.models.Word2Vec(bigram[wikivoyage_list], min_count=10, size = 200)

Ms = model_bigrams.most_similar(positive=['paris','london','sevilla'], negative = [], topn=20)

top_places = []

for entry in ms:

??? place, sim = entry

模型使用

terms = self.parse_search_query(input)

??????? # Set to make sure the output doesn't include one of the input destinations.

??????? check = set()

??????? # For (multiplier, destination), get the multiplier * vector of that destination.

??????? # Then sum up to the master vector.

??????? for i, term in enumerate(terms):

??????????? multiplier, word = term

??????????? check.add(word)

??????????? if i == 0:

??????????????? master_vector = multiplier * self.model[word]

??????????? else:

??????????????? master_vector += multiplier * self.model[word]

??????? # Find the most similar vectors to the amter vector

??????? ms = self.model.most_similar(positive=[master_vector], topn=topn)

??????? ms_wo_search_terms = [dest for dest in ms if dest[0] not in check]

Word2vec模型實現原理與源碼：

word2vec 算法包括skip-gram & CBOW模型，使用hierarchical softmax or negative sampling

我們這用的是skip-gram+hierarchical softmax

很多人以為 word2vec 是一種模型和方法，其實 word2vec 只是一個工具，背后的模型是 CBOW 或者 Skip-gram，并且使用了 Hierarchical Softmax 或者 Negative Sampling 這些訓練的優化方法。所以準確說來，word2vec 并不是一個模型或算法，只不過 Mikolov 恰好在當時把他開源的工具包起名叫做 word2vec 而已。

softmax(正則的指數函數)是輸出層函數，他可以用于計算至少兩種不同類型的常見詞嵌入：word2vec, FastText。另外，它與sigmoid和tanh函數都是許多種類型的神經網絡架構的激活步驟

這個算法的復雜性就直接是我們單詞表的大小O(V)。事實表明，我們使用二叉樹的結構可以簡化這個復雜性，即分層(hierarchical) softmax

模型需要學習的參數：每個單詞的詞向量Xw + 霍夫曼樹每個內部結點的θ

基于 H-softmax 模型的梯度計算

涉及到的公式太多了，在此直接把劉建平博客里的梯度計算過程貼過來：

spark mllib 里的 word2vec 實現就是采用的此方式，知道了上面梯度公式，spark word2vec源碼就能看懂了。

// 省略了建樹的過程，在建樹的過程中會給每個內部結點編碼 while (pos < sentence.length) {val word = sentence(pos)val b = random.nextInt(window)// Train Skip-gram,// syn0 是詞向量 x 參數數組，長度為 vocab_size * emb_size// syn1 是霍夫曼樹內部結點 w 參數數組，長度同上var a = bwhile (a < window * 2 + 1 - b) {if (a != window) {val c = pos - window + aif (c >= 0 && c < sentence.length) {val lastWord = sentence(c)val l1 = lastWord * vectorSizeval neu1e = new Array[Float](vectorSize)// Hierarchical softmaxvar d = 0while (d < bcVocab.value(word).codeLen) {val inner = bcVocab.value(word).point(d)val l2 = inner * vectorSize// Propagate hidden -> outputvar f = blas.sdot(vectorSize, syn0, l1, 1, syn1, l2, 1) // 計算 x^Twif (f > -MAX_EXP && f < MAX_EXP) {val ind = ((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2.0)).toIntf = expTable.value(ind) // 計算 f = sigmoid(x^Tw)val g = ((1 - bcVocab.value(word).code(d) - f) * alpha).toFloat // 計算梯度 g = (1-d-f) * alpha, d 是該節點的編碼(0/1)，alpha是學習率blas.saxpy(vectorSize, g, syn1, l2, 1, neu1e, 0, 1) // 累加 e = e + gw, e 初始化 0blas.saxpy(vectorSize, g, syn0, l1, 1, syn1, l2, 1) // 更新 w = w + gxsyn1Modify(inner) += 1}d += 1}blas.saxpy(vectorSize, 1.0f, neu1e, 0, 1, syn0, l1, 1) // 更新 x = x + esyn0Modify(lastWord) += 1}}a += 1}pos += 1 }

? ? ? ?

總結

以上是生活随笔為你收集整理的机器学习项目搭建试验 where2go的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：芮勇出任联想CTO，阿里巴巴获CIKM
下一篇： Typora配置图床