當(dāng)前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。

發(fā)布時(shí)間：2023/12/9 数据库 64 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

趁著暑假的空閑，把在上個(gè)學(xué)期學(xué)到的Python數(shù)據(jù)采集的皮毛用來試試手，寫了一個(gè)爬取豆瓣圖書的爬蟲，總結(jié)如下：
下面是我要做的事：
1. 登錄
2. 獲取豆瓣圖書分類目錄
3. 進(jìn)入每一個(gè)分類里面，爬取第一頁的書的書名，作者，譯者，出版時(shí)間等信息，放入MySQL中，然后將封面下載下來。

第一步

首先，盜亦有道嘛，看看豆瓣網(wǎng)的robots協(xié)議：

User-agent: * Disallow: /subject_search Disallow: /amazon_search Disallow: /search Disallow: /group/search Disallow: /event/search Disallow: /celebrities/search Disallow: /location/drama/search Disallow: /forum/ Disallow: /new_subject Disallow: /service/iframe Disallow: /j/ Disallow: /link2/ Disallow: /recommend/ Disallow: /trailer/ Disallow: /doubanapp/card Sitemap: https://www.douban.com/sitemap_index.xml Sitemap: https://www.douban.com/sitemap_updated_index.xml # Crawl-delay: 5User-agent: Wandoujia Spider Disallow: /

再看看我要爬取的網(wǎng)站：

https://book.douban.com/tag/?view=type&icn=index-sorttags-allhttps://book.douban.com/tag/?icn=index-navhttps://book.douban.com/tag/[此處是標(biāo)簽名]https://book.douban.com/subject/[書的編號(hào)]/

好了，并沒有違反robots協(xié)議，安心的寫代碼了。

第二步

既然寫了，就做得完整一些，現(xiàn)在先登錄一下豆瓣：
我在這里采用的是cookies登錄的方式，首先用firefox神奇的插件HttpFox獲得一下正常登錄的headers和cookies、

找到這條記錄
查看內(nèi)容

這樣就獲得了cookies和headers，把他們復(fù)制下來，直接復(fù)制到程序里或者用文件存儲(chǔ)，用你喜歡的方式保存下來。
login函數(shù)

def login(url):cookies = {}with open("C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") as file:raw_cookies = file.read();for line in raw_cookies.split(';'):key,value = line.split('=',1)cookies[key] = valueheaders = {'User-Agent':'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)return s

我在這里采用的是將headers復(fù)制到程序里，將cookies放入文件中讀取的方式，同時(shí)注意要將cookies處理成字典的形式，然后用requests庫的get函數(shù)獲得網(wǎng)頁響應(yīng)。

第三步

先進(jìn)入豆瓣讀書的分類目錄
https://book.douban.com/tag/?view=type&icn=index-sorttags-all
我們把這個(gè)網(wǎng)站上的分類鏈接爬取下來：

import requests from bs4 import BeautifulSoup #from login import login #導(dǎo)入上述login函數(shù)url = "https://book.douban.com/tag/?icn=index-nav" web = requests.get(url) #請(qǐng)求網(wǎng)址 soup = BeautifulSoup(web.text,"lxml") #解析網(wǎng)頁信息 tags = soup.select("#content > div > div.article > div > div > table > tbody > tr > td > a") urls = [] #儲(chǔ)存所有鏈接 for tag in tags: tag=tag.get_text() #將列表中的每一個(gè)標(biāo)簽信息提取出來helf="https://book.douban.com/tag/" #觀察一下豆瓣的網(wǎng)址，基本都是這部分加上標(biāo)簽信息，所以我們要組裝網(wǎng)址，用于爬取標(biāo)簽詳情頁url=helf+str(tag) urls.append(url) # 將鏈接存入文件 with open("channel.txt","w") as file:for link in urls:file.write(link+'\n')

上面代碼當(dāng)中用了CSS選擇器，不懂CSS沒關(guān)系，將相應(yīng)的網(wǎng)站頁面用瀏覽器打開，打開開發(fā)者工具，在elements界面右鍵要爬取的內(nèi)容，copy->selector
(我用的是chrome瀏覽器，在正常的圖形網(wǎng)頁里右鍵檢查就能直接定位到對(duì)應(yīng)的elements位置），將CSS選擇器復(fù)制下來，注意如果出現(xiàn)了:nth-child(*)之類的都要去掉，不然會(huì)報(bào)錯(cuò)。

然后我們得到了鏈接的目錄：

第四步

下面先找一找爬取的方法
根據(jù)上面說的CSS選擇器的方法，可以得到書名，作者，譯者，評(píng)價(jià)人數(shù)，評(píng)分，還有這本書的封面鏈接和簡介。

title = bookSoup.select('#wrapper > h1 > span')[0].contents[0] title = deal_title(title) author = get_author(bookSoup.select("#info > a")[0].contents[0]) translator = bookSoup.select("#info > span > a")[0].contents[0] person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0] scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0] coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src']; brief = get_brief(bookSoup.select('#link-report > div > div > p'))

有幾點(diǎn)要注意：

文件名不能含有 :?<>"|\/* 所以用正則表達(dá)式處理一下：

def deal_title(raw_title):r = re.compile('[/\*?"<>|:]')return r.sub('~',raw_title)

然后將封面下載下來：

path = "C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+".png" urlretrieve(coverUrl,path);

作者名字爬取下來格式要處理過，否者會(huì)很難看

def get_author(raw_author):parts = raw_author.split('\n')return ''.join(map(str.strip,parts))

圖書簡介也要處理一下

def get_brief(line_tags):brief = line_tags[0].contentsfor tag in line_tags[1:]:brief += tag.contentsbrief = '\n'.join(brief)return brief

而對(duì)于出版社，出版時(shí)間，ISBN和圖書定價(jià)，則可以用下面更簡潔的方法獲得：

info = bookSoup.select('#info') infos = list(info[0].strings) publish = infos[infos.index('出版社:') + 1] ISBN = infos[infos.index('ISBN:') + 1] Ptime = infos[infos.index('出版年:') + 1] price = infos[infos.index('定價(jià):') + 1]

第五步

先創(chuàng)建數(shù)據(jù)庫和數(shù)據(jù)表

CREATE TABLE `allbooks` (`title` char(255) NOT NULL,`scor` char(255) DEFAULT NULL,`author` char(255) DEFAULT NULL,`price` char(255) DEFAULT NULL,`time` char(255) DEFAULT NULL,`publish` char(255) DEFAULT NULL,`person` char(255) DEFAULT NULL,`yizhe` char(255) DEFAULT NULL,`tag` char(255) DEFAULT NULL,`brief` mediumtext,`ISBN` char(255) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

然后用executemany方法便捷地將數(shù)據(jù)存入。

connection = pymysql.connect(host='你的主機(jī)',user='你的賬號(hào)',password='你的密碼',charset='utf8') with connection.cursor() as cursor:sql = "USE DOUBAN_DB;"cursor.execute(sql)sql = '''INSERT INTO allbooks ( title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()

第六步

到此，我們已經(jīng)找到了全部的方法，就剩寫出完整程序了
還要注意的一點(diǎn)就是要設(shè)置隨機(jī)訪問間隔，以防封IP。
代碼如下，也在github更新，歡迎star，我的github鏈接。

# -*- coding: utf-8 -*- """ Created on Sat Aug 12 13:29:17 2017@author: Throne """ #每一本書的 1cover 2author 3yizhe(not must) 4time 5publish 6price 7scor 8person 9title 10brief 11tag 12ISBNimport requests #用來請(qǐng)求網(wǎng)頁 from bs4 import BeautifulSoup #解析網(wǎng)頁 import time #設(shè)置延時(shí)時(shí)間，防止爬取過于頻繁被封IP號(hào) import pymysql #由于爬取的數(shù)據(jù)太多，我們要把他存入MySQL數(shù)據(jù)庫中，這個(gè)庫用于連接數(shù)據(jù)庫 import random #這個(gè)庫里用到了產(chǎn)生隨機(jī)數(shù)的randint函數(shù)，和上面的time搭配，使爬取間隔時(shí)間隨機(jī) from urllib.request import urlretrieve #下載圖片 import re #處理詭異的書名connection = pymysql.connect(host='localhost',user='root',password='',charset='utf8') with connection.cursor() as cursor:sql = "USE DOUBAN_DB;"cursor.execute(sql) connection.commit()def deal_title(raw_title):r = re.compile('[/\*?"<>|:]')return r.sub('~',raw_title)def get_brief(line_tags):brief = line_tags[0].contentsfor tag in line_tags[1:]:brief += tag.contentsbrief = '\n'.join(brief)return briefdef get_author(raw_author):parts = raw_author.split('\n')return ''.join(map(str.strip,parts))def login(url):cookies = {}with open("C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt") as file:raw_cookies = file.read();for line in raw_cookies.split(';'):key,value = line.split('=',1)cookies[key] = valueheaders = {'User-Agent':'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)return s def crawl():# 獲取鏈接channel = []with open('C:/Users/lenovo/OneDrive/projects/Scraping/channel.txt') as file:channel = file.readlines()for url in channel:data = [] #存放每一本書的數(shù)據(jù)web_data = login(url.strip())soup = BeautifulSoup(web_data.text.encode('utf-8'),'lxml')tag = url.split("?")[0].split("/")[-1]books = soup.select('''#subject_list > ul > li > div.info > h2 > a''')for book in books:bookurl = book.attrs['href']book_data = login(bookurl)bookSoup = BeautifulSoup(book_data.text.encode('utf-8'),'lxml')info = bookSoup.select('#info')infos = list(info[0].strings)try:title = bookSoup.select('#wrapper > h1 > span')[0].contents[0]title = deal_title(title)publish = infos[infos.index('出版社:') + 1]translator = bookSoup.select("#info > span > a")[0].contents[0]author = get_author(bookSoup.select("#info > a")[0].contents[0])ISBN = infos[infos.index('ISBN:') + 1]Ptime = infos[infos.index('出版年:') + 1]price = infos[infos.index('定價(jià):') + 1]person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];brief = get_brief(bookSoup.select('#link-report > div > div > p'))except :try:title = bookSoup.select('#wrapper > h1 > span')[0].contents[0]title = deal_title(title)publish = infos[infos.index('出版社:') + 1]translator = ""author = get_author(bookSoup.select("#info > a")[0].contents[0])ISBN = infos[infos.index('ISBN:') + 1]Ptime = infos[infos.index('出版年:') + 1]price = infos[infos.index('定價(jià):') + 1]person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];brief = get_brief(bookSoup.select('#link-report > div > div > p'))except:continuefinally:path = "C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+".png"urlretrieve(coverUrl,path);data.append([title,scor,author,price,Ptime,publish,person,translator,tag,brief,ISBN]) with connection.cursor() as cursor:sql = '''INSERT INTO allbooks ( title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()del datatime.sleep(random.randint(0,9)) #防止IP被封start = time.clock() crawl() end = time.clock() with connection.cursor() as cursor:print("Time Usage:", end -start)count = cursor.execute('SELECT * FROM allbooks')print("Total of books:", count)if connection.open:connection.close()

結(jié)果展示

文章原創(chuàng)，要轉(zhuǎn)載請(qǐng)聯(lián)系作者

參考博客: http://www.jianshu.com/p/6c060433facf?appinstall=0

總結(jié)

以上是生活随笔為你收集整理的Python爬虫爬取豆瓣图书的信息和封面，放入MySQL数据库中。的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：解决firebug报“illegal c
下一篇： MySQL的常见存储引擎介绍与参数设置调