當前位置：首頁 > 编程语言 > python >内容正文

python

Python爬虫基础入门

發布時間：2023/12/29 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫基础入门小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

爬取的網址：https://www.23hh.com/book/0/189/

需求：獲取小說的章節目錄及其對應的章節內容
需要的庫：requests、BeautifulSoup和re。利用requests庫發送瀏覽器請求，BeautifulSoup和re庫對獲取到的數據進行分析、提取。
分別使用pip install requests和pip install BeautifulSoup4安裝
對網頁源碼進行分析：

1、創建testcraw包
2、創建craw_site.py文件用于獲取章節目錄及其鏈接

import requests from bs4 import BeautifulSoup import redef getSoup(website):try:res = requests.get(url=website) # 發送請求res.raise_for_status() # 檢測返回狀態碼是否正常res.encoding = res.apparent_encoding # 避免中文亂碼content = res.textsoup = BeautifulSoup(content, 'html.parser')return soup # 返回BeautifulSoup對象except requests.HTTPError as e:return edef result(website):chapter, siteLst = [], []try:soup = getSoup(website)except requests.HTTPError as e:return eelse:for i in soup.find_all('dd'): # 利用BeautifulSoup類的find_all方法對數據進行篩選for j in i.find_all('a'):k = j.stringisExisted = re.match('[\u4e00-\u9fa5]+章', k) # 利用正則表達式篩選if isExisted is not None:chapter.append(j.string)siteLst.append(website + j.attrs['href'][12:]) # 提取小說各個章節得我連接lst = list(zip(chapter, siteLst))for i in range(12):del lst[0]return lst # 返回章節目錄及其連接

3、創建mysql_helper.py文件用于保存數據

import pymysqlclass MysqlTool(object):def getConn(self):conn = Nonetry:conn = pymysql.connect(host='localhost',user='root',password='5180',port=3306,db='fictions')except Exception as e:print('\033[31m{}\033[0m'.format(e))return conndef closeConn(self, conn):try:if conn is not None:conn.commit()conn.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def getCursor(self, conn):cur = Nonetry:if conn is not None:cur = conn.cursor()except Exception as e:print('\033[31m{}\033[0m'.format(e))return curdef closeCursor(self, cur):try:if cur is not None:cur.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def insert(self, cur, chapter='', content=''):sql = 'insert into perfect_world(chapter, content) values(%s, %s);'count = cur.execute(sql, (chapter, content))if count > 0:print('{} 抓取成功'.format(chapter))

創建fictions數據庫和如下所示的表：

4、創建subpage.py文件用于獲取子頁的正文內容

import requests from bs4 import BeautifulSoup from testcraw.craw_site import result from testcraw.mysql_helper import MysqlTool import redef test(website):for i in result(website):chapter, content = i[0], ''site = i[1]res = requests.get(url=site, timeout=60)res.raise_for_status()res.encoding = res.apparent_encodingdemo = res.textsoup = BeautifulSoup(demo, 'html.parser')for i in soup.find_all(attrs={'id': 'content'}):for j in i.stripped_strings:content += (j + '\n')content = re.sub(pattern='純文字在線閱讀本站域名手機同步閱讀請訪問', repl='', string=content, count=1)mt = MysqlTool()conn = mt.getConn()cur = mt.getCursor(conn)mt.insert(cur, chapter, content)mt.closeCursor(cur)mt.closeConn(conn)test('https://www.23hh.com/book/0/189/')

爬取部分內容如下：

總結

以上是生活随笔為你收集整理的Python爬虫基础入门的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： .h .c .hh .cc文件
下一篇： Python 实现大文本文件切割

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

Python爬虫基础入门

總結