趁著暑假的空閑,把在上個(gè)學(xué)期學(xué)到的Python數(shù)據(jù)采集的皮毛用來試試手,寫了一個(gè)爬取豆瓣圖書的爬蟲,總結(jié)如下:
下面是我要做的事:
1. 登錄
2. 獲取豆瓣圖書分類目錄
3. 進(jìn)入每一個(gè)分類里面,爬取第一頁的書的書名,作者,譯者,出版時(shí)間等信息,放入MySQL中,然后將封面下載下來。
第一步
首先,盜亦有道嘛,看看豆瓣網(wǎng)的robots協(xié)議:
User-agent: *
Disallow: /subject_search
Disallow: /amazon_search
Disallow: /search
Disallow: /group/search
Disallow: /event/search
Disallow: /celebrities/search
Disallow: /location/drama/search
Disallow: /forum/
Disallow: /new_subject
Disallow: /service/iframe
Disallow: /j/
Disallow: /link2/
Disallow: /recommend/
Disallow: /trailer/
Disallow: /doubanapp/card
Sitemap: https://www
.douban.com/sitemap_index
.xml
Sitemap: https://www
.douban.com/sitemap_updated_index
.xml
# Crawl-delay: 5User-agent: Wandoujia Spider
Disallow: /
再看看我要爬取的網(wǎng)站:
https://book.douban.com/tag/?view=type&icn=index-sorttags-allhttps://book.douban.com/tag/?icn=index-navhttps://book.douban.com/tag/[此處是標(biāo)簽名]https://book.douban.com/subject/[書的編號(hào)]/
好了,并沒有違反robots協(xié)議,安心的寫代碼了。
第二步
既然寫了,就做得完整一些,現(xiàn)在先登錄一下豆瓣:
我在這里采用的是cookies登錄的方式,首先用firefox神奇的插件HttpFox獲得一下正常登錄的headers和cookies、
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt")
as file:raw_cookies = file.read();
for line
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s
我在這里采用的是將headers復(fù)制到程序里,將cookies放入文件中讀取的方式,同時(shí)注意要將cookies處理成字典的形式,然后用requests庫的get函數(shù)獲得網(wǎng)頁響應(yīng)。
第三步
先進(jìn)入豆瓣讀書的分類目錄
https://book.douban.com/tag/?view=type&icn=index-sorttags-all
我們把這個(gè)網(wǎng)站上的分類鏈接爬取下來:
import requests
from bs4 import BeautifulSoup
url =
"https://book.douban.com/tag/?icn=index-nav"
web = requests.
get(url)
soup = BeautifulSoup(web.
text,
"lxml")
tags = soup.select(
"#content > div > div.article > div > div > table > tbody > tr > td > a")
urls = []
for tag
in tags: tag=tag.get_text() helf=
"https://book.douban.com/tag/" url=helf+str(tag) urls.append(url)
with open(
"channel.txt",
"w")
as file:
for link
in urls:
file.
write(link+
'\n')
上面代碼當(dāng)中用了CSS選擇器,不懂CSS沒關(guān)系,將相應(yīng)的網(wǎng)站頁面用瀏覽器打開,打開開發(fā)者工具,在elements界面右鍵要爬取的內(nèi)容,copy->selector
(我用的是chrome瀏覽器,在正常的圖形網(wǎng)頁里右鍵檢查就能直接定位到對(duì)應(yīng)的elements位置),將CSS選擇器復(fù)制下來,注意如果出現(xiàn)了:nth-child(*)之類的都要去掉,不然會(huì)報(bào)錯(cuò)。
然后我們得到了鏈接的目錄:
第四步
下面先找一找爬取的方法
根據(jù)上面說的CSS選擇器的方法,可以得到書名,作者,譯者,評(píng)價(jià)人數(shù),評(píng)分,還有這本書的封面鏈接和簡介。
title = bookSoup.
select('#wrapper > h1 > span')[0].contents[0]
title = deal_title(title)
author = get_author(bookSoup.select("#info > a")[0].contents[0])
translator = bookSoup.select("#info > span > a")[0].contents[0]
person = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[0].contents[0]
scor = bookSoup.select("#interest_sectl > div > div.rating_self.clearfix > strong")[0].contents[0]
coverUrl = bookSoup.select("#mainpic > a > img")[0].attrs['src'];
brief = get_brief(bookSoup.
select('#link-report > div > div > p'))
有幾點(diǎn)要注意:
- 文件名不能含有 :?<>"|\/* 所以用正則表達(dá)式處理一下:
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title)
然后將封面下載下來:
path =
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"
urlretrieve(coverUrl,path);
- 作者名字爬取下來格式要處理過,否者會(huì)很難看
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts))
def get_brief(line_tags):brief = line_tags[
0].contents
for tag
in line_tags[
1:]:brief += tag.contentsbrief =
'\n'.join(brief)
return brief
而對(duì)于出版社,出版時(shí)間,ISBN和圖書定價(jià),則可以用下面更簡潔的方法獲得:
info = bookSoup.
select(
'#info')
infos = list(info[
0].strings)
publish = infos[infos.
index(
'出版社:') +
1]
ISBN = infos[infos.
index(
'ISBN:') +
1]
Ptime = infos[infos.
index(
'出版年:') +
1]
price = infos[infos.
index(
'定價(jià):') +
1]
第五步
先創(chuàng)建數(shù)據(jù)庫和數(shù)據(jù)表
CREATE TABLE `allbooks` (`title` char(255) NOT NULL,`scor` char(255) DEFAULT NULL,`author` char(255) DEFAULT NULL,`price` char(255) DEFAULT NULL,`time` char(255) DEFAULT NULL,`publish` char(255) DEFAULT NULL,`person` char(255) DEFAULT NULL,`yizhe` char(255) DEFAULT NULL,`tag` char(255) DEFAULT NULL,`brief` mediumtext,`ISBN` char(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
然后用executemany方法便捷地將數(shù)據(jù)存入。
connection = pymysql.connect(host=
'你的主機(jī)',user=
'你的賬號(hào)',password=
'你的密碼',charset=
'utf8')
with connection.cursor()
as cursor:sql =
"USE DOUBAN_DB;"cursor.execute(sql)sql =
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()
第六步
到此,我們已經(jīng)找到了全部的方法,就剩寫出完整程序了
還要注意的一點(diǎn)就是要設(shè)置隨機(jī)訪問間隔,以防封IP。
代碼如下,也在github更新,歡迎star,我的github鏈接。
"""
Created on Sat Aug 12 13:29:17 2017@author: Throne
"""
import requests
from bs4
import BeautifulSoup
import time
import pymysql
import random
from urllib.request
import urlretrieve
import re connection = pymysql.connect(host=
'localhost',user=
'root',password=
'',charset=
'utf8')
with connection.cursor()
as cursor:sql =
"USE DOUBAN_DB;"cursor.execute(sql)
connection.commit()
def deal_title(raw_title):r = re.compile(
'[/\*?"<>|:]')
return r.sub(
'~',raw_title)
def get_brief(line_tags):brief = line_tags[
0].contents
for tag
in line_tags[
1:]:brief += tag.contentsbrief =
'\n'.join(brief)
return brief
def get_author(raw_author):parts = raw_author.split(
'\n')
return ''.join(map(str.strip,parts))
def login(url):cookies = {}
with open(
"C:/Users/lenovo/OneDrive/projects/Scraping/doubancookies.txt")
as file:raw_cookies = file.read();
for line
in raw_cookies.split(
';'):key,value = line.split(
'=',
1)cookies[key] = valueheaders = {
'User-Agent':
'''Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'''}s = requests.get(url, cookies=cookies, headers=headers)
return s
def crawl():channel = []
with open(
'C:/Users/lenovo/OneDrive/projects/Scraping/channel.txt')
as file:channel = file.readlines()
for url
in channel:data = [] web_data = login(url.strip())soup = BeautifulSoup(web_data.text.encode(
'utf-8'),
'lxml')tag = url.split(
"?")[
0].split(
"/")[-
1]books = soup.select(
'''#subject_list > ul > li > div.info > h2 > a''')
for book
in books:bookurl = book.attrs[
'href']book_data = login(bookurl)bookSoup = BeautifulSoup(book_data.text.encode(
'utf-8'),
'lxml')info = bookSoup.select(
'#info')infos = list(info[
0].strings)
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') +
1]translator = bookSoup.select(
"#info > span > a")[
0].contents[
0]author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') +
1]Ptime = infos[infos.index(
'出版年:') +
1]price = infos[infos.index(
'定價(jià):') +
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except :
try:title = bookSoup.select(
'#wrapper > h1 > span')[
0].contents[
0]title = deal_title(title)publish = infos[infos.index(
'出版社:') +
1]translator =
""author = get_author(bookSoup.select(
"#info > a")[
0].contents[
0])ISBN = infos[infos.index(
'ISBN:') +
1]Ptime = infos[infos.index(
'出版年:') +
1]price = infos[infos.index(
'定價(jià):') +
1]person = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span")[
0].contents[
0]scor = bookSoup.select(
"#interest_sectl > div > div.rating_self.clearfix > strong")[
0].contents[
0]coverUrl = bookSoup.select(
"#mainpic > a > img")[
0].attrs[
'src'];brief = get_brief(bookSoup.select(
'#link-report > div > div > p'))
except:
continuefinally:path =
"C:/Users/lenovo/OneDrive/projects/Scraping/covers/"+title+
".png"urlretrieve(coverUrl,path);data.append([title,scor,author,price,Ptime,publish,person,translator,tag,brief,ISBN])
with connection.cursor()
as cursor:sql =
'''INSERT INTO allbooks (
title, scor, author, price, time, publish, person, yizhe, tag, brief, ISBN)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'''cursor.executemany(sql, data)connection.commit()
del datatime.sleep(random.randint(
0,
9)) start = time.clock()
crawl()
end = time.clock()
with connection.cursor()
as cursor:print(
"Time Usage:", end -start)count = cursor.execute(
'SELECT * FROM allbooks')print(
"Total of books:", count)
if connection.open:connection.close()
結(jié)果展示
文章原創(chuàng),要轉(zhuǎn)載請(qǐng)聯(lián)系作者
參考博客: http://www.jianshu.com/p/6c060433facf?appinstall=0
總結(jié)
以上是生活随笔為你收集整理的Python爬虫爬取豆瓣图书的信息和封面,放入MySQL数据库中。的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。