當前位置：首頁 > 编程语言 > python >内容正文

python

Python 爬虫学习笔记

發(fā)布時間：2025/3/15 python 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 爬虫学习笔记小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

環(huán)境篇

Python3 + Pip 環(huán)境配置

MongoDB 、MYSQL、Redis 環(huán)境配置

爬蟲常用庫安裝

基礎篇

基本原理

什么是爬蟲：請求網(wǎng)站并提取數(shù)據(jù)的自動化程序

爬蟲基本流程：發(fā)起請求獲取響應內(nèi)容解析內(nèi)容保存數(shù)據(jù)

>>> import requests >>> response = requests.get('https://www.baidu.com') >>> print(response.text) <!DOCTYPE html> ...............

抓怎樣的數(shù)據(jù)：HTML 文檔、JSON 格式文本、圖片、視頻、其他

解析方式：直接處理、Json 解析、正則表達式、BeautifulSoup、PyQuery、Xpath

解決 JavaScript 渲染的問題：分析 AJAX 請求、Splash、PyV8、Ghost.py

怎樣保存數(shù)據(jù)：文本（純文本、Json、xml等）、關系型數(shù)據(jù)庫（MySQL、Oracle、SQL server）、非關系型數(shù)據(jù)庫（MongoDB、Redis）

Urllib 庫基本使用

什么是 Urllib：Python 內(nèi)置的 HTTP 請求庫 urllib.request 請求模塊、urllib.error 異常處理模塊、 urllib.parse url 解析模塊

相比 Python2 變化

// Python2 import urllib2 response = urllib2.urlopen('http://www.baidu.com')// Python3 import urllib.request response = urllib.request.urlopen('http://www.baidu.com')

用法詳解

""" 請求 """ import urllib.requestresponse = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8')) """ POST 請求 """ from urllib import request, parse url = 'http://httpbin.org/post' headers = {'User-Agent': 'tttt','Host': 'httpbin.org' } dict = {'name': 'm0bu' } data = bytes(parse.urlencode(dict), encoding='utf-8') req = request.Request(url=url, data=data, headers=headers, method='POST') #req.add_header('','') response = request.urlopen(req) print(response.read().decode('utf-8')) """ 異常處理 """ import socket import urllib.request import urllib.error try:response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) except urllib.error.URLError as e:if isinstance(e.reason, socket.timeout):print('time out') """ 代理 """ from urllib import request proxy_handler = request.ProxyHandler({'http': 'http://127.0.0.1:1080','https': 'https://127.0.0.1:1080' }) opener = request.build_opener(proxy_handler) response = opener.open('http://httpbin.org/get') print(response.read()) """ Cookie """ import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) respone = opener.open('http://www.baidu.com') for item in cookie:print(item.name+"="+item.value)

Requests 庫基本使用

什么是 Requests：Python 實現(xiàn)的簡單易用的 HTTP 庫

""" 帶參數(shù) GET 請求 """ import requests data = {'name': 'm0bu' } response = requests.get("http://httpbin.org/get", params=data) print(response.text) """ 解析 JSON """ import requests import json response = requests.get("http://httpbin.org/get") print(response.json()) print(json.loads(response.text)) """ 二進制數(shù)據(jù) """ import requests response = requests.get('https://github.com/favicon.ico') print(response.content) with open('favicon.ico','wb') as f:f.write(response.content)f.close() """ 添加 headers,POST 請求 """ import requests import json data = {"name": "m0bu"} headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36" } response = requests.post('http://httpbin.org/post', headers=headers, data=data) print(response.json()) """ 狀態(tài)碼判斷 """ import requests response = requests.get('http://httpbin.org/') exit() if not response.status_code == 200 else print('200') """ 文件上傳 """ import requests files ={'file':open('favicon.ico','rb')} response = requests.post("http://httpbin.org/post",files=files) print(response.text) """ 獲取 cookie """ import requests r = requests.get("https://www.baidu.com") print(r.cookies) for key,value in r.cookies.items():print(key +'='+value) """ 會話維持 """ import requests s = requests.Session() s.get("http://httpbin.org/cookies/set/number/12345678") r=s.get('http://httpbin.org/cookies') print(r.text) """ 證書驗證 """ import requests from requests.packages import urllib3 urllib3.disable_warnings() r = requests.get('https://www.12306.cn', verify=False) print(r.status_code) """ 代理設置 """ import requests proxies ={"http":"http://127.0.0.1:1080","https":"https://127.0.0.1:1080" } r = requests.get('http://httpbin.org/ip',proxies=proxies) print(r.text) """ 異常處理 """ import requests from requests.exceptions import ReadTimeout,HTTPError,RequestException try:r = requests.get("http://httpbin.org/get", timeout=0.1)print(r.status_code) except ReadTimeout:print("timeout") except HTTPError:print('http error') except RequestException:print('error')

正則表達式基礎

什么是正則表達式：正則表達式是對字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個“規(guī)則字符串”，這個“規(guī)則字符串”用來表達對字符串的一種過濾邏輯

非 Python 獨有，re 模塊實現(xiàn)

盡量使用泛匹配、使用括號得到匹配目標、盡量使用非貪婪模式、有換行符就用 re.S

為匹配方便，能用 search 就不用 match，group() 打印輸出結果

re.findall 搜索字符串，以列表形式返回全部能匹配的字串

re.compile 將一個正則表達式串編譯成正則對象，以便復用該匹配模式

""" 小練習 """

BeautifulSoup 庫詳解

靈活有方便的網(wǎng)頁解析庫，處理高效，支持多種解析器。利用它不用編寫正則表達式即可方便地實現(xiàn)網(wǎng)頁信息的提取。

解析器：Python 標準庫、lxml HTML 解析器、lxml XML 解析器、html5lib

from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')

推薦使用 lxml 解析庫，必要時使用 html.parser

標簽選擇篩選功能弱但是速度快

建議使用 find()、find_all() 查詢匹配單個結果或者多個結果

如果對 CSS 選擇器熟悉建議使用 select()

記住常用的獲取屬性和文本值的方法

PyQuery 詳解

強大靈活的網(wǎng)頁解析庫。熟悉 jQuery 語法，建議使用 PyQuery

from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head))

Selenium 詳解

自動化測試工具，支持多種瀏覽器。爬蟲中主要用來解決 JavaScript 渲染問題

from selenium import webdriver browser = webdriver.Chrome() browser.get('https://www.taobao.com') print(browser.page_source) browser.close()

官方文檔

實戰(zhàn)篇

Requests + 正則表達式爬取貓眼電影

目標站點分析，流程框架：

1.抓取單頁內(nèi)容

2.正則表達式分析

3.保存至文件

4.開啟循環(huán)及多線程

import requests from multiprocessing import Pool from requests.exceptions import RequestException import re import jsondef get_one_page(url):try:response = requests.get(url)if response.status_code == 200:return response.textreturn Noneexcept RequestException:return Nonedef parse_one_page(html):pattern = re.compile('<dd>.*?board-index.*?(\d+)</i>.*?data-src="(.*?)".*?name"><a'+ '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>'+ '.*?integer">(.*?).*?fraction">(.*?)</i>.*?</dd>', re.S)items = re.findall(pattern, html)for item in items:yield{'index': item[0],'image': item[1],'title': item[2],'actor': item[3].strip()[3:],'time': item[4].strip()[5:],'score': item[5]+item[6]}def write_to_file(content):with open('result.txt', 'a', encoding='utf-8') as f:f.write(json.dumps(content, ensure_ascii=False)+'\n')f.close()def main(offset):url = 'https://maoyan.com/board/4?offset=' + str(offset)html = get_one_page(url)for item in parse_one_page(html):print(item)write_to_file(item)if __name__ == "__main__":pool = Pool()pool.map(main, [i*10 for i in range(10)])print(end-start)

轉(zhuǎn)載于:https://www.cnblogs.com/skrr/p/11055821.html

總結

以上是生活随笔為你收集整理的Python 爬虫学习笔记的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 20165307《网络对抗技术》Exp1
下一篇：《明日方舟》Python版公开招募工具

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python 爬虫学习笔记

環(huán)境篇

基礎篇

實戰(zhàn)篇

總結