日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

建方公寓挂牌房源信息爬取

發(fā)布時間:2024/3/24 编程问答 59 豆豆
生活随笔 收集整理的這篇文章主要介紹了 建方公寓挂牌房源信息爬取 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

爬取建方公寓掛牌房源信息

  • 背景
  • 完整代碼
  • 后話

背景

自從青客公寓分城市掛牌房源和優(yōu)客逸家掛牌房源爬取之后,發(fā)現(xiàn)爬蟲也挺有趣的,于是今天又拿建方公寓練手,差點栽跟頭了,且聽我慢慢道來。有前兩次爬蟲經(jīng)驗,發(fā)現(xiàn)在爬取青客設(shè)計的半自動邏輯較好,所以這次采用了只要輸入城市名稱和城市代碼以及總網(wǎng)頁數(shù)3個參數(shù)然后再執(zhí)行程序,發(fā)現(xiàn)自己挺喜歡這種互動式的模式,有參與感,但是打印整個解析網(wǎng)頁的時候總提示我沒找到我要找到的東西,經(jīng)過一番折騰,發(fā)現(xiàn)是請求頭出問題了,最初只構(gòu)造了一個User-Agent, 很可能別人家服務(wù)器識別為爬蟲程序,于是在網(wǎng)頁源碼Network下面把headers原原本本寫下來

header={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cache-Control": "max-age=0","Connection": "keep-alive","Cookie": "_site_id_cookie=1; clientlanguage=zh_CN; SESSION=62a74a27387f4f4a9ca7cf4e45768631; _cookie_city_name=%E5%B9%BF%E5%B7%9E","Host": "www.giantfind.com.cn","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"} #構(gòu)造請求頭

修改之后打印整個解析網(wǎng)頁,發(fā)現(xiàn)要找的東西都出來,再也沒有提示沒找到我要找到的東西,心情瞬間大好,完整代碼如下

完整代碼

# -*- coding: utf-8 -*- """ project_name:giantfind @author: 帥帥de三叔 Created on Tue Aug 6 09:21:11 2019 """ import requests #導(dǎo)入請求模塊 from bs4 import BeautifulSoup #導(dǎo)入網(wǎng)頁解析模塊 import urllib.parse #url中文編碼 import re #導(dǎo)入正則模塊 import pymysql #導(dǎo)入數(shù)據(jù)庫功能模塊 import time #導(dǎo)入時間模塊 host="http://www.giantfind.com.cn" #主域名 header={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3","Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Cache-Control": "max-age=0","Connection": "keep-alive","Cookie": "_site_id_cookie=1; clientlanguage=zh_CN; SESSION=62a74a27387f4f4a9ca7cf4e45768631; _cookie_city_name=%E5%B9%BF%E5%B7%9E","Host": "www.giantfind.com.cn","Upgrade-Insecure-Requests": "1","User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"} #構(gòu)造請求頭print("connecting mysql……\n") db=pymysql.connect("localhost","root","123456","giantfind",charset='utf8') #鏈接數(shù)據(jù)庫 print("connect successfully\n") cursor=db.cursor() #獲取游標(biāo) cursor.execute("drop table if exists giantfind_gz\n") #重新創(chuàng)建表print("start creating table giantfind_gz") c_sql="""CREATE TABLE giantfind_gz(district varchar(8),title varchar(20),area varchar(6),price varchar(6),house_type varchar(6),floor varchar(6),towards_or_style varchar(4),address varchar(30) )Engine=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=UTF8""" cursor.execute(c_sql) print("table giantfind_gz has been created,please insert into data\n")def generate_page(page_num,city,cityCode): #定義生成總網(wǎng)頁數(shù)url="http://www.giantfind.com.cn/findRoomPc/index_{}.jhtml?city={}&cityCode={}&reservationChannel=21"for next_page in range(1,int(page_num)+1): yield url.format(next_page,city,cityCode,next_page)def get_detail_item(generate_page): #定義獲取詳情頁網(wǎng)址#print("網(wǎng)址是:",generate_page)response=requests.get(generate_page,headers=header) #發(fā)出請求time.sleep(1) #掛起進(jìn)程1秒soup=BeautifulSoup(response.text,'lxml') #解析網(wǎng)頁detail_list=soup.find("div","content").find("div",class_="list-life list-lifen").findAll("a",class_="list-la list-lb stat") #該頁所有房源列表#print(len(detail_list))for content in detail_list:detail_url=host+content['href'] #構(gòu)造詳情頁answer=requests.get(detail_url,headers=header) #進(jìn)入詳情頁answer_json=BeautifulSoup(answer.text,'lxml') #解析詳情頁district=answer_json.find("div",class_="hos-csho").find("p").get_text().replace("建方·家","").replace("建方·寓","").strip() #區(qū)域title=answer_json.find("div",class_="hos-csho").find("h2").find("span").get_text() #房源名稱area=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[0].find("i").find("span").get_text().split(" ")[1].replace("㎡","") #居住面積house_type=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[0].find("i").find("span").get_text().split(" ")[0] #房型pattern_price=re.compile("\d+") #用以正則價格price=re.search(pattern_price,answer_json.find("div",class_="hos-csho").find("div").find("strong").get_text()).group(0) #價格floor=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[1].find("i").get_text().replace("層","") #樓層towards_or_style=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[2].find("i").get_text().strip() #朝向address=answer_json.find("div",class_="hos-csho").find("ul",class_="hos-clist").findAll("li")[4].find("i").get_text().replace(">","").strip() #詳細(xì)地址print(district,title,area,price,house_type,floor,towards_or_style,address) #字段測試insert_data=("INSERT INTO giantfind_gz(district,title,area,price,house_type,floor,towards_or_style,address)""VALUES(%s,%s,%s,%s,%s,%s,%s,%s)") #控制插入格式gaintfind_data=([district,title,area,price,house_type,floor,towards_or_style,address]) #待插入數(shù)據(jù)cursor.execute(insert_data,gaintfind_data) #執(zhí)行插入操作db.commit() #主動提交def main(): #定義一個主函數(shù)整合其他所有函數(shù)city=urllib.parse.quote(input("please input city name:")) #請輸入城市名稱并Unicode編碼cityCode=input("please input city code:") #請輸入城市代碼page_num=input("please input total pages num:")for page_link in generate_page(page_num,city,cityCode):#print(page_link)get_detail_item(page_link)if __name__=="__main__":main()

后話

謹(jǐn)以此篇記錄遇到的header請求頭問題,不做代碼解析,爬蟲僅作為交流,如有冒犯,請告知刪。

延申閱讀
青客公寓掛牌房源分城市爬取
優(yōu)客逸家掛牌房源爬取

總結(jié)

以上是生活随笔為你收集整理的建方公寓挂牌房源信息爬取的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。