當前位置：首頁 > 编程语言 > python >内容正文

python

python3爬取带密码的网站_Python3 爬取网站收藏数超过70的情侣网名

發布時間：2023/12/15 python 38 豆豆

生活随笔收集整理的這篇文章主要介紹了 python3爬取带密码的网站_Python3 爬取网站收藏数超过70的情侣网名小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

[Python] 純文本查看復制代碼# coding=utf-8

# python 3.7

import urllib.request,urllib.error

import re

import threading

import multiprocessing

from time import sleep

# 保存網頁內容

def dataFilter():

# 設置了全局參數(方便)，最好還是傳參數

global htmlQueue # 任務隊列

global some # log文件

global lock # 文件鎖

while True:

html = htmlQueue.get() # 從任務隊列中獲取任務(數據)

if html is None: exit(0) #退出線程

# 正則表達式獲取數據

favorites = re.compile(r'title="收藏">(\d+)').findall(html)

centent = re.compile(r'id="txt-\d+-\d+">(.*?)

').findall(html)

# 收藏值大于 70 的保存

tmp = []

lock.acquire() # 當文件沒有其他線程寫入時寫入文件

for n in range(len(favorites)):

if int(favorites[n]) > 70:

tmp.append(favorites[n])

index = n*2

tmp.append(centent[index])

tmp.append(centent[index+1])

some.write(','.join(tmp) + '\n')

some.flush()

tmp.clear()

lock.release() # 寫入文件完成，報告(默認為 1 )

# 獲取一頁網頁數據

def GetHtml(url):

# 瀏覽器頭部信息

head = {

'Accept-Language': 'zh-CN,zh;q=0.9',

"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.30 Safari/537.36"

}

req = urllib.request.Request(url=url, headers=head)

# 這里可以加入 try 代碼，處理打開網頁失敗情況

response = urllib.request.urlopen(req)

return response.read().decode('utf-8')

# 爬取網站IP

url = 'https://www.woyaogexing.com/name/ql/'

# 第一頁

print(f'正在爬取第一頁 {url}')

parent = GetHtml(url)

# 獲取當前頁面包含的所有下一頁的IP

page = re.compile(r'/name/ql/(index_\d+.html)').findall(parent)

page = set(page) # 集合去重

page = list(page) # 轉換回列表

page.sort() # 排序

# 第一頁包含的所有頁面IP

print(page)

# log保存文件

some = open('some.txt', 'w', encoding='utf-8')

some.write('收藏數, 簽名1, 簽名2\n')

# 寫入文件鎖，只能單個寫入，避免數據混淆

# 默認值為 1 (單個線程寫入文件)

lock = threading.Semaphore()

# 線程任務隊列

htmlQueue = multiprocessing.Queue()

# 管理所有線程的容器

threadList = []

# 開啟4個線程

# 設置為守護線程，主線程退出后結束所有子線程，避免線程懸掛

maxThreadN = 4

for n in range(maxThreadN):

tmp = threading.Thread(target=dataFilter, daemon=True)

threadList.append(tmp)

tmp.start()

# 向線程任務隊列提交任務

for n in page:

print(f'正在爬取 {n}')

sleep(2) # 加上等待時間2秒，我們要對網站溫柔一些

htmlQueue.put(GetHtml(url + n))

# 任務完成，提交數據 None 結束所有線程

for n in range(maxThreadN):

htmlQueue.put(None)

# 等待所有線程結束

for th in threadList:

th.join()

# 最后關閉 log 文件

some.close()

input('所有任務完成！')

總結

以上是生活随笔為你收集整理的python3爬取带密码的网站_Python3 爬取网站收藏数超过70的情侣网名的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： php mysql 测试页_mysql+
下一篇： python项目打包部署到ios_Pyt

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python3爬取带密码的网站_Python3 爬取网站收藏数超过70的 情侣网名

總結

python3爬取带密码的网站_Python3 爬取网站收藏数超过70的情侣网名