Python爬虫基础入门
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫基础入门
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
爬取的網址:https://www.23hh.com/book/0/189/
需求:獲取小說的章節目錄及其對應的章節內容
需要的庫:requests、BeautifulSoup和re。利用requests庫發送瀏覽器請求,BeautifulSoup和re庫對獲取到的數據進行分析、提取。
分別使用pip install requests和pip install BeautifulSoup4安裝
對網頁源碼進行分析:
1、創建testcraw包
2、創建craw_site.py文件用于獲取章節目錄及其鏈接
3、創建mysql_helper.py文件用于保存數據
import pymysqlclass MysqlTool(object):def getConn(self):conn = Nonetry:conn = pymysql.connect(host='localhost',user='root',password='5180',port=3306,db='fictions')except Exception as e:print('\033[31m{}\033[0m'.format(e))return conndef closeConn(self, conn):try:if conn is not None:conn.commit()conn.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def getCursor(self, conn):cur = Nonetry:if conn is not None:cur = conn.cursor()except Exception as e:print('\033[31m{}\033[0m'.format(e))return curdef closeCursor(self, cur):try:if cur is not None:cur.close()except Exception as e:print('\033[31m{}\033[0m'.format(e))def insert(self, cur, chapter='', content=''):sql = 'insert into perfect_world(chapter, content) values(%s, %s);'count = cur.execute(sql, (chapter, content))if count > 0:print('{} 抓取成功'.format(chapter))創建fictions數據庫和如下所示的表:
4、創建subpage.py文件用于獲取子頁的正文內容
import requests from bs4 import BeautifulSoup from testcraw.craw_site import result from testcraw.mysql_helper import MysqlTool import redef test(website):for i in result(website):chapter, content = i[0], ''site = i[1]res = requests.get(url=site, timeout=60)res.raise_for_status()res.encoding = res.apparent_encodingdemo = res.textsoup = BeautifulSoup(demo, 'html.parser')for i in soup.find_all(attrs={'id': 'content'}):for j in i.stripped_strings:content += (j + '\n')content = re.sub(pattern='純文字在線閱讀本站域名手機同步閱讀請訪問', repl='', string=content, count=1)mt = MysqlTool()conn = mt.getConn()cur = mt.getCursor(conn)mt.insert(cur, chapter, content)mt.closeCursor(cur)mt.closeConn(conn)test('https://www.23hh.com/book/0/189/')爬取部分內容如下:
總結
以上是生活随笔為你收集整理的Python爬虫基础入门的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: .h .c .hh .cc文件
- 下一篇: Python 实现大文本文件切割