日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

自制爬虫例--抓取网站图像与简介

發布時間:2025/1/21 编程问答 26 豆豆
生活随笔 收集整理的這篇文章主要介紹了 自制爬虫例--抓取网站图像与简介 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?#編碼格式要求為utf-8

#coding=UTF-8 #urllib解析網站內容 import urllib2 #soup是特別好使的html解析器 from BeautifulSoup import * #開站,讀內容 c=urllib2.urlopen('http://xxxxxx.html') soup=BeautifulSoup(c.read()) #讀作者 user = soup.find('a',οnclick=re.compile('shareRec'))['onclick'] regex=ur"發現:.*的#" match = re.search(regex, user) user = match.group() user = user[3:] templeng = len(user) templeng = templeng-2 user = user[:templeng] print "author: "+user #讀日期 date = soup.find('span',attrs={'class':'date m_l_5'}).text year = date[:4] month = date[5:7] day = date[8:10] hour = date[11:13] minute = date[14:16] second = date[17:19] print "date: "+date print "year:"+year print "month:"+month print "day:"+day print "hour:"+hour print "minute:"+minute print "second:"+second #讀標題,地區 title = soup.find('div',attrs={'class':'Mztit'}).text print "title:"+title areaid = soup.find('a',href = re.compile('mddid')).attrs[0][1] areaid = areaid[20:] area = soup.findAll('a',href = re.compile('mddid='+areaid))[1].text partid = soup.find('a',href = re.compile('travel-scenic-spot')).text templen = len(partid) templen = templen - 4 part = partid[:templen] print "area:"+area print "part:"+part #讀描述,重點是第一張圖之前的文字內容 description = soup.find('div',attrs={'id':'pnl_contentinfo'}) des = description.contents length = len(des) descrip = " " for d in des: try: if(not(d.find("img") == -1 or d.find("img") == None)): if(length < 4): des_i = d.contents for i in des_i: try: if(not(i.find("img") == -1 or i.find("img") == None)): break else: descrip = descrip + i.text except: i = i.strip() if(not(i.find("img") == -1 or i.find("img") == None)): break else: descrip = descrip + i leng = len(d.contents) if(leng > 15): descrip = descrip + d.text break else: descrip = descrip + d.text except: pass print "description:"+descrip #重點來了,讀取每張圖與其文字 data = soup.findAll('div',attrs={'vaname':user}) txt = [""] p_w_picpath = [""] for d in data: have_jpg = d.find('img',attrs={'src':re.compile('jpeg')}) start = "false" temp_txt = "" if(have_jpg != None): content = d.contents for x in content: try: if((not(x.find("img") == -1 or x.find("img") == None)) and start == "false"): start = "true" if(start == "true"): t = x.contents for tt in t: try: if(tt.find('img',src = re.compile('http.*jpeg')) == None): temp_txt = temp_txt + tt.text else: txt.append(temp_txt) p_w_picpath.append(tt.find('img',src = re.compile('http.*jpeg'))['src']) temp_txt = "" except: ttt = tt.strip() temp_txt = temp_txt + ttt except: pass print len(txt) print len(p_w_picpath)

轉載于:https://blog.51cto.com/tiandinanyu/810169

總結

以上是生活随笔為你收集整理的自制爬虫例--抓取网站图像与简介的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。