python2异步编程_python异步编程入门
這幾天看代碼,總是會(huì)接觸到很多異步編程,之前只想著實(shí)現(xiàn)功能,從來(lái)沒(méi)考慮過(guò)代碼的運(yùn)行快慢問(wèn)題,故學(xué)習(xí)一番。
從0到1,了解python異步編程的演進(jìn)
1、urllib與requests爬蟲(chóng)
requests對(duì)請(qǐng)求做了優(yōu)化,因此比urllib快一點(diǎn)。
Requests是Python中的HTTP客戶(hù)端庫(kù),網(wǎng)絡(luò)請(qǐng)求更加直觀方便,它與Urllib最大的區(qū)別就是在爬取數(shù)據(jù)的時(shí)候連接方式的不同。urllb爬取完數(shù)據(jù)是直接斷開(kāi)連接的,而requests爬取數(shù)據(jù)之后可以繼續(xù)復(fù)用socket,并沒(méi)有斷開(kāi)連接。
在python2.7版本下,Python urllib模塊分為兩部分,urllib和urllib2。Python3.5 版本下將python2.7版本的urllib和urllib2 合并在一起成一個(gè)新的urllib。
urllib:
#-*- coding:utf-8 -*-
import urllib.request
import ssl
from lxml import etree
url = 'https://movie.douban.com/top250'
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_1)
def fetch_page(url):
response = urllib.request.urlopen(url, context=context)
return response
def parse(url):
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.read()
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
print(i, title)
def main():
parse(url)
if __name__ == '__main__':
main()
requests代替標(biāo)準(zhǔn)庫(kù)urllib:
import requests
from lxml import etree
from time import time
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
for url in fetch_list:
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
2、lxml庫(kù)與正則表達(dá)式進(jìn)行解析
lxml庫(kù)進(jìn)行解析需要一定時(shí)間,但依賴(lài)正則表達(dá)式的程序會(huì)更加難以維護(hù),擴(kuò)展性不高。
常見(jiàn)的組合是Requests+BeautifulSoup(解析網(wǎng)絡(luò)文本的工具庫(kù)),解析工具常見(jiàn)的還有正則,xpath。
將lxml庫(kù)換成標(biāo)準(zhǔn)的re庫(kù):
#-*- coding:utf-8 -*-
import requests
from time import time
import re
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
fetch_list = set()
result = []
for title in re.findall(rb'(.*)', page):
result.append(title)
for postfix in re.findall(rb'
fetch_list.add(url + postfix.decode())
for url in fetch_list:
response = fetch_page(url)
page = response.content
for title in re.findall(rb'
result.append(title)
for i, title in enumerate(result, 1):
title = title.decode()
# print(i, title)
3、進(jìn)階:多進(jìn)程和多線程
網(wǎng)絡(luò)應(yīng)用方面的編程(如上例中的爬蟲(chóng)),通常瓶頸都在IO層面,解決等待讀寫(xiě)的問(wèn)題比提高文本解析速度來(lái)的更有性?xún)r(jià)比。
程序切換—CPU時(shí)間的分配:操作系統(tǒng)自動(dòng)為每個(gè)程序分配一些 CPU/內(nèi)存/磁盤(pán)/鍵盤(pán)/顯示器 等資源的使用時(shí)間,過(guò)期后自動(dòng)切換到下一個(gè)程序。當(dāng)然,被切換的程序,如果沒(méi)有執(zhí)行完,它的狀態(tài)會(huì)被保存起來(lái),方便下次輪詢(xún)到的時(shí)候繼續(xù)執(zhí)行。
1)進(jìn)程:進(jìn)程就是“程序切換”的第一種方式。進(jìn)程,是執(zhí)行中的計(jì)算機(jī)程序。也就是說(shuō),每個(gè)代碼在執(zhí)行的時(shí)候,首先本身即是一個(gè)進(jìn)程。一個(gè)進(jìn)程具有:就緒,運(yùn)行,中斷,僵死,結(jié)束等狀態(tài)(不同操作系統(tǒng)不一樣)。每個(gè)程序,本身首先是一個(gè)進(jìn)程。
2)線程:線程,也是“程序切換”的一種方式。線程,是在進(jìn)程中執(zhí)行的代碼。一個(gè)進(jìn)程下可以運(yùn)行多個(gè)線程,這些線程之間共享主進(jìn)程內(nèi)申請(qǐng)的操作系統(tǒng)資源。在一個(gè)進(jìn)程中啟動(dòng)多個(gè)線程的時(shí)候,每個(gè)線程按照順序執(zhí)行。現(xiàn)在的操作系統(tǒng)中,也支持線程搶占,也就是說(shuō)其它等待運(yùn)行的線程,可以通過(guò)優(yōu)先級(jí),信號(hào)等方式,將運(yùn)行的線程掛起,自己先運(yùn)行。線程,必須在一個(gè)存在的進(jìn)程中啟動(dòng)運(yùn)行。線程使用進(jìn)程獲得的系統(tǒng)資源,不會(huì)像進(jìn)程那樣需要申請(qǐng)CPU等資源。
3)線程與進(jìn)程的區(qū)別:線程一般以并發(fā)執(zhí)行,正是由于這種并發(fā)和數(shù)據(jù)共享機(jī)制,使多任務(wù)間的協(xié)作成為可能。進(jìn)程一般以并行執(zhí)行,這種并行能使得程序能同時(shí)在多個(gè)CPU上運(yùn)行。
4)協(xié)程:協(xié)程,也是”程序切換“的一種。簡(jiǎn)單說(shuō),協(xié)程也是線程,只是協(xié)程的調(diào)度并不是由操作系統(tǒng)調(diào)度,而是自己”協(xié)同調(diào)度“。也就是”協(xié)程是不通過(guò)操作系統(tǒng)調(diào)度的線程“。協(xié)程,又稱(chēng)微線程。協(xié)程間是協(xié)同調(diào)度的,這使得并發(fā)量數(shù)萬(wàn)以上的時(shí)候,協(xié)程的性能是遠(yuǎn)遠(yuǎn)高于線程。注意這里也是“并發(fā)”,不是“并行”。
多線程有效地解決了阻塞等待的問(wèn)題。
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
from threading import Thread
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def parse(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
def fetch_content(url):
response = fetch_page(url)
page = response.content
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
threads = []
for url in fetch_list:
t = Thread(target=fetch_content, args=[url])
t.start()
threads.append(t)
for t in threads:
t.join()
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
多進(jìn)程,用4個(gè)進(jìn)程的進(jìn)程池來(lái)并行處理網(wǎng)絡(luò)數(shù)據(jù)。
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
from concurrent.futures import ProcessPoolExecutor
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def fetch_content(url):
response = fetch_page(url)
page = response.content
return page
def parse(url):
page = fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
with ProcessPoolExecutor(max_workers=4) as executor:
for page in executor.map(fetch_content, fetch_list):
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
這里多進(jìn)程帶來(lái)的優(yōu)點(diǎn)(cpu處理)并沒(méi)有得到體現(xiàn),反而創(chuàng)建和調(diào)度進(jìn)程帶來(lái)的開(kāi)銷(xiāo)要遠(yuǎn)超出它的正面效應(yīng),拖了一把后腿。即便如此,多進(jìn)程帶來(lái)的效益相比于之前單進(jìn)程單線程的模型要好得多。
多進(jìn)程和多線程除了創(chuàng)建的開(kāi)銷(xiāo)大之外還有一個(gè)難以根治的缺陷,就是處理進(jìn)程之間或線程之間的協(xié)作問(wèn)題,因?yàn)槭且蕾?lài)多進(jìn)程和多線程的程序在不加鎖的情況下通常是不可控的,而協(xié)程則可以完美地解決協(xié)作問(wèn)題,由用戶(hù)來(lái)決定協(xié)程之間的調(diào)度。
基于gevent的異步程序:
#-*- coding:utf-8 -*-
import requests
from lxml import etree
from time import time
import gevent
from gevent import monkey
monkey.patch_all()
url = 'https://movie.douban.com/top250'
def fetch_page(url):
response = requests.get(url)
return response
def fetch_content(url):
response = fetch_page(url)
page = response.content
return page
def parse(url):
page = fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
jobs = [gevent.spawn(fetch_content, url) for url in fetch_list]
gevent.joinall(jobs)
[job.value for job in jobs]
for page in [job.value for job in jobs]:
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
gevent給予了我們一種以同步邏輯來(lái)書(shū)寫(xiě)異步程序的能力,看monkey.patch_all()這段代碼,它是整個(gè)程序?qū)崿F(xiàn)異步的黑科技,當(dāng)我們給程序打了猴子補(bǔ)丁后,Python程序在運(yùn)行時(shí)會(huì)動(dòng)態(tài)地將一些網(wǎng)絡(luò)庫(kù)(例如socket,thread)替換掉,變成異步的庫(kù)。使得程序在進(jìn)行網(wǎng)絡(luò)操作的時(shí)候都變成異步的方式去工作,效率就自然提升很多了。
4、python Async/Await
Python需要一個(gè)獨(dú)立的標(biāo)準(zhǔn)庫(kù)來(lái)支持協(xié)程,于是就有了后來(lái)的asyncio。
把同步的requests庫(kù)改成了支持asyncio的aiohttp庫(kù),使用3.5的async/await語(yǔ)法編寫(xiě)協(xié)程版本的例子。
#-*- coding:utf-8 -*-
from lxml import etree
from time import time
import asyncio
import aiohttp
url = 'https://movie.douban.com/top250'
async def fetch_content(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
async def parse(url):
page = await fetch_content(url)
html = etree.HTML(page)
xpath_movie = '//*[@id="content"]/div/div[1]/ol/li'
xpath_title = './/span[@class="title"]'
xpath_pages = '//*[@id="content"]/div/div[1]/div[2]/a'
pages = html.xpath(xpath_pages)
fetch_list = []
result = []
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for p in pages:
fetch_list.append(url + p.get('href'))
tasks = [fetch_content(url) for url in fetch_list]
pages = await asyncio.gather(*tasks)
for page in pages:
html = etree.HTML(page)
for element_movie in html.xpath(xpath_movie):
result.append(element_movie)
for i, movie in enumerate(result, 1):
title = movie.find(xpath_title).text
# print(i, title)
def main():
loop = asyncio.get_event_loop()
start = time()
for i in range(5):
loop.run_until_complete(parse(url))
end = time()
print ('Cost {} seconds'.format((end - start) / 5))
loop.close()
速度快,且提高了程序的可讀性。
Python Async/Await入門(mén)指南
留坑待續(xù)......
總結(jié)
以上是生活随笔為你收集整理的python2异步编程_python异步编程入门的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 欧美股市最新行情,美股周五开盘吗
- 下一篇: python去重且顺序不变_Python