日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程语言 > python >内容正文

python

python之33个关键字详解_Python解析、提取url关键字的实例详解

發(fā)布時(shí)間:2023/12/10 python 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python之33个关键字详解_Python解析、提取url关键字的实例详解 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

解析url用的類庫(kù):

python2版本:

from urlparse import urlparse

import urllib

python3版本:

from urllib.parse import urlparse

import urllib.request

研究了不同的url規(guī)則發(fā)現(xiàn):只要在搜索關(guān)鍵字是用=嫁接的,查詢的關(guān)鍵在解析后的query里

如果不是用=嫁接,查詢的關(guān)鍵在解析后的path里。

解析的規(guī)則都是一樣的,正則如下:(6中不同情況的組合)

另外host為‘s.weibo.com'的url編碼與其他不同要另做處理。

代碼如下:有些網(wǎng)站的規(guī)則還不是很清楚,需要花大量時(shí)間找規(guī)則,規(guī)則越清晰,關(guān)鍵字就越清楚,如下規(guī)則已適合絕大部分網(wǎng)站,酌情參考。

# -*- coding:utf-8 -*-

from urlparse import urlparse

import urllib

import re

# url

source_txt = "E:\\python_Anaconda_code\\url.txt"

# 規(guī)則

regular = r'(\w+(%\w\w)+\w+|(%\w\w)+\w+(%\w\w)+|\w+(%\w\w)+|(%\w\w)+\w+|(%\w\w)+|\w+)'

# 存放關(guān)鍵字

kw_list = list()

# key為要研究網(wǎng)站的host,value為關(guān)鍵字的嫁接標(biāo)識(shí)符

dict = {

"www.baidu.com": "wd=",

"news.baidu.com": "word=",

"www.sogou.com": "query=",

"tieba.baidu.com": "kw=",

"wenku.baidu.com": "word=",

"music.sina.com.cn": "k=",

"www.haosou.com": "q=",

"www.lagou.com": "list_",

"www.chunyuyisheng.com": "query=",

"s.weibo.com": "weibo/"

}

def Main():

with open(source_txt, 'r') as f_source_txt:

for url in f_source_txt:

host = url.split("//")[1].split("/")[0]

if host in dict:

flag = dict[host]

if flag.find("=") != -1:

query = urlparse(url).query.replace('+', '')

kw = re.search(flag + regular, query, re.I) # .group(0)

if kw:

kw = urllib.unquote(kw.group(0).split(flag)[1])

print(kw)

else:

path = urlparse(url).path.replace('+', '')

kw = re.search(flag + regular, path.replace("%25", "%"), re.I)

if kw:

kw = urllib.unquote(kw.group(0).split(flag)[1])

print(kw)

if __name__ == '__main__':

Main()

url.txt的內(nèi)容如下:

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=0&rsv_idx=1&ch=&tn=baidu&bar=&wd=python&rn=&oq=&rsv_pq=ece0867c0002c793&rsv_t=edeaQq7DDvZnxq%2FZVra5K%2BEUanlTIUXhGIhvuTaqdfOECLuXR25XKDp%2Bi0I&rqlang=cn&rsv_enter=1&inputT=218

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%E9%87%8C%E7%9A%84%E5%AD%97%E5%85%B8dict&oq=python&rsv_pq=96c160e70003f332&rsv_t=0880NkOvMIr3TvOdDP1t8EbloD8qwr4yeP6CfPjQihQNNhdExfuwyOFMrx0&rqlang=cn&rsv_enter=0&inputT=10411

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python%E9%87%8C%E7%9A%84urlprese&oq=python%25E9%2587%258C%25E7%259A%2584re%25E9%2587%258C%25E7%259A%2584%257C%25E6%2580%258E%25E4%25B9%2588%25E7%2594%25A8&rsv_pq=d1d4e7b90003d391&rsv_t=5ff4Vok4EELK1PgJ4oSk8L0VvKAn51%2BL8ns%2FjSubexg7Lb7znKcTvnVtn8M&rqlang=cn&rsv_enter=1&inputT=2797

https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&ch=&tn=baidu&bar=&wd=python++wo+%E7%88%B1urlprese&oq=python%25E9%2587%258C%25E7%259A%2584urlprese&rsv_pq=eecf45e900033e87&rsv_t=1c70xAYhrvw5JOZA7lpVgt4pw%2BW1TO8hqTejTh67JgEQfqAGyDydd25HAmU&rqlang=cn&rsv_enter=0&inputT=10884

http://news.baidu.com/ns?word=%E8%B6%B3%E7%90%83&tn=news&from=news&cl=2&rn=20&ct=1

http://news.baidu.com/ns?ct=1&rn=20&ie=utf-8&bs=%E8%B6%B3%E7%90%83&rsv_bp=1&sr=0&cl=2&f=8&prevct=no&tn=news&word=++++++%E8%B6%B3++%E7%90%83+++++%E4%BD%A0%E5%A5%BD+%E5%98%9B%EF%BC%9F&rsv_sug3=14&rsv_sug4=912&rsv_sug1=4&inputT=8526

http://tieba.baidu.com/f?ie=utf-8&kw=%E7%BA%A2%E6%B5%B7%E8%A1%8C%E5%8A%A8&fr=search&red_tag=q0224393377

https://www.sogou.com/web?query=ni+zai+%E6%88%91+%E5%BF%83li&_asf=www.sogou.com&_ast=1520388441&w=01019900&p=40040100&ie=utf8&from=index-nologin&s_from=index&sut=9493&sst0=1520388440692&lkt=8%2C1520388431200%2C1520388436842&sugsuv=1498714959961744&sugtime=1520388440692

https://www.lagou.com/jobs/list_python%E5%A4%A7%E6%95%B0%E6%8D%AEmr?labelWords=&fromSearch=true&suginput=

https://www.chunyuyisheng.com/pc/search/?query=%E6%85%A2%E6%80%A7%E4%B9%99%E8%82%9D%

http://s.weibo.com/weibo/%25E5%2594%2590%25E4%25BA%25BA%25E8%25A1%2597%25E6%258E%25A2%25E6%25A1%25882&Refer=index

http://s.weibo.com/weibo/%25E4%25BD%25A0%25E5%25A5%25BD123mm%2520%25E5%2597%25AF%2520mm11&Refer=STopic_box

結(jié)果如下:

如果要研究其他host,可以加到字典dict里。

備注:以上代碼和思路僅供參考,如有更好的方法敬請(qǐng)留言!

以上這篇Python解析、提取url關(guān)鍵字的實(shí)例詳解就是小編分享給大家的全部?jī)?nèi)容了,希望能給大家一個(gè)參考,也希望大家多多支持我們。

本文標(biāo)題: Python解析、提取url關(guān)鍵字的實(shí)例詳解

本文地址: http://www.cppcns.com/jiaoben/python/247463.html

總結(jié)

以上是生活随笔為你收集整理的python之33个关键字详解_Python解析、提取url关键字的实例详解的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。