當前位置：首頁 > 编程语言 > python >内容正文

python

【python爬虫笔记】验证码

發布時間：2024/1/8 python 43 豆豆

生活随笔收集整理的這篇文章主要介紹了【python爬虫笔记】验证码小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Index of /tesseract (uni-mannheim.de)

https://github.com/tesseract-ocr/tessdata

Python 驗證碼

數字+字母的驗證碼

pytesseract庫的安裝

pillow 庫

Filter

Format

帶干擾的驗證碼識別

獲取 Access Token

通過百度模塊調用驗證碼識別

Python 驗證碼

如果你是一個數據挖掘愛好者，那么驗證碼是你避免不過去的一個天坑，和各種驗證碼斗爭，必然是你成長的一條道路，

本篇文章為你帶來驗證碼的入門案例，來吧，一起 Coding 吧。

數字+字母的驗證碼

我隨便在百度圖片搜索了一個驗證碼，如下

今天要做的是驗證碼識別中最簡單的一種辦法，采用pytesseract解決，它屬于 Python 當中比較簡單的OCR識別庫

pytesseract庫的安裝

超詳細解決pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in yo...

Tesseract OCR V5.0安裝教程（Windows

使用pytesseract之前，你需要通過 pip 安裝一下對應的模塊，需要兩個

pytesseract 庫還有圖像處理的 pillow 庫了

pip install pytesseract pip install pillow

如果你安裝了這兩個庫之后，編寫一個識別代碼，一般情況下會報下面這個錯誤

pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your path

這是由于你還缺少一部分內容

安裝一個 Tesseract-OCR 軟件。這個軟件是由 Google 維護的開源的 OCR 軟件。

下載地址 > https://github.com/tesseract-ocr/tesseract/wiki

中文包的下載地址 > https://github.com/tesseract-ocr/tessdata

選擇你需要的版本進行下載即可

pillow 庫

命令釋義

open()	打開一個圖片 from PIL import Image im = Image.open(“1.png”) im.show()
save()	保存文件
convert()	convert() 是圖像實例對象的一個方法，接受一個 mode 參數，用以指定一種色彩模式，mode 的取值可以是如下幾種： · 1 (1-bit pixels, black and white, stored with one pixel per byte) · L (8-bit pixels, black and white) · P (8-bit pixels, mapped to any other mode using a colour palette) · RGB (3x8-bit pixels, true colour) · RGBA (4x8-bit pixels, true colour with transparency mask) · CMYK (4x8-bit pixels, colour separation) · YCbCr (3x8-bit pixels, colour video format) · I (32-bit signed integer pixels) · F (32-bit floating point pixels)

Filter

from PIL import Image, ImageFilter im = Image.open(‘1.png’) # 高斯模糊 im.filter(ImageFilter.GaussianBlur) # 普通模糊 im.filter(ImageFilter.BLUR) # 邊緣增強 im.filter(ImageFilter.EDGE_ENHANCE) # 找到邊緣 im.filter(ImageFilter.FIND_EDGES) # 浮雕 im.filter(ImageFilter.EMBOSS) # 輪廓 im.filter(ImageFilter.CONTOUR) # 銳化 im.filter(ImageFilter.SHARPEN) # 平滑 im.filter(ImageFilter.SMOOTH) # 細節 im.filter(ImageFilter.DETAIL)

Format

format 屬性定義了圖像的格式，如果圖像不是從文件打開的，那么該屬性值為 None；
size 屬性是一個 tuple，表示圖像的寬和高（單位為像素）；
mode 屬性為表示圖像的模式，常用的模式為：L 為灰度圖，RGB 為真彩色，CMYK 為 pre-press 圖像。如果文件不能打開，則拋出 IOError 異常。

驗證碼識別

注意安裝完畢，如果還是報錯，請找到模塊 pytesseract.py 這個文件，對這個文件進行編輯

一般這個文件在?C:\Program Files\Python36\Lib\site-packages\pytesseract\pytesseract.py?位置

文件中 tesseract_cmd = 'tesseract' 改為自己的地址例如： tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

如果報下面的 BUG，請注意

Error opening data file \Program Files (x86)\Tesseract-OCR\tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable

解決辦法也比較容易，按照它的提示，表示缺失了 TESSDATA_PREFIX 這個環境變量。你只需要在系統環境變量中添加一條即可

將 TESSDATA_PREFIX=C:\Program Files (x86)\Tesseract-OCR 添加環境變量

重啟 IDE 或者重新 CMD，然后繼續運行代碼，這個地方注意需要用管理員運行你的 py 腳本

步驟分為

打開圖片 Image.open()

pytesseract 識別圖片

import pytesseract from PIL import Imagedef main():image = Image.open("1.jpg")text = pytesseract.image_to_string(image,lang="chi_sim")print(text)if __name__ == '__main__':main()

測試英文，數字什么的基本沒有問題，中文簡直慘不忍睹。空白比較大的可以識別出來。唉~不好用
當然剛才那個7364?十分輕松的就識別出來了。

帶干擾的驗證碼識別

接下來識別如下的驗證碼，我們首先依舊先嘗試一下。運行代碼發現沒有任何顯示。接下來需要對這個圖片進行處理

基本原理都是完全一樣的

彩色轉灰度

灰度轉二值

二值圖像識別

彩色轉灰度

im = im.convert('L')

灰度轉二值，解決方案比較成套路，采用閾值分割法，threshold 為分割點

def initTable(threshold=140):table = []for i in range(256):if i < threshold:table.append(0)else:table.append(1)return table

調用

binaryImage = im.point(initTable(), '1')#像素遍歷 binaryImage.show()

調整之后

通過百度提供的接口實現驗證碼識別

今天你要學習的驗證碼采用通過第三方 AI 平臺開放的 OCR 接口實現，OCR 文字識別技術目前已經比較成熟了，而且第三方比較多，今天采用的是百度的。

獲取 Access Token

程序調用接口是要生成access_token

Access Token 用于基于 Token 的認證模式，允許應用訪問一個資源 API。

?先去創建一個應用

?百度智能云-管理中心 (baidu.com)

請求URL數據格式

向授權服務地址https://aip.baidubce.com/oauth/2.0/token發送請求（推薦使用POST），并在URL中帶上以下參數：

grant_type：?必須參數，固定為client_credentials；
client_id：?必須參數，應用的API Key；
client_secret：?必須參數，應用的Secret Key；

注：

API Key、Secret Key?均可在百度智能云控制臺?各技術方向概覽頁的應用列表?處獲取，若無應用請先進行創建；
API Key、Secret Key用于接口調用鑒權，請務必注意保密，不可在公開文檔或代碼中以明文展示，否則可能導致賬號被盜用。

?百度官網獲取access_token示例代碼

# encoding:utf-8 import requests # client_id 為官網獲取的API Key， client_secret 為官網獲取的Secret Key host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=【官網獲取的AK】&client_secret=【官網獲取的SK】' response = requests.get(host) if response:print(response.json())

請求百度orc通用文字識別API

有了access_token之后就可以開始識別請求了。
HTTP 方法：POST
請求URL： https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic
請求頭： Content-Type application/x-www-form-urlencoded
請求參數：

image : 圖像數據，base64編碼，要求base64編碼后大小不超過4M，最短邊至少15px，最長邊最大4096px,支持jpg/png/bmp格式，當image字段存在時url字段失效
url : 圖片完整URL，URL長度不超過1024字節，URL對應的圖片base64編碼后大小不超過4M，最短邊至少15px，最長邊最大4096px,支持jpg/png/bmp格式，當image字段存在時url字段失效

# encoding:utf-8import requests import base64''' 通用文字識別 '''request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic" # 二進制方式打開圖片文件 f = open('[本地文件]', 'rb') img = base64.b64encode(f.read())params = {"image":img} access_token = '[調用鑒權接口獲取的token]' request_url = request_url + "?access_token=" + access_token headers = {'content-type': 'application/x-www-form-urlencoded'} response = requests.post(request_url, data=params, headers=headers) if response:print (response.json())

?{'words_result': [{'words': '已只成此'}, {'words': '是生月蝶柱錦'}, {'words': '當煙明望思瑟'}, {'words': '時此珠帝無'}, {'words': '惘是追情'}, {'words': '已情有春年端'}, {'words': '惘可心莊五'}, {'words': '然當憶可'}, {'words': '然待藍托生十'}, {'words': '成杜曉絃'}, {'words': '追日夢'}, {'words': '時待'}, {'words': '憶暖迷絃'}, {'words': '只玉海蝴'}], 'words_result_num': 14, 'direction': 0, 'log_id': 1592394631485448641}

脫離百度提供的文檔自己編寫：

獲得access_token

def get_accesstoken(self):res = requests.post(self.url.format(self.key,self.secret),headers=self.header)content = res.textif (content):return json.loads(content)["access_token"] import requests import jsonimport base64import urllib.request, urllib.parseclass GetCode(object):def __init__(self):self.url = "https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id={}&client_secret={}"self.api = "https://aip.baidubce.com/rest/2.0/ocr/v1/general_basic?access_token={}"self.header = {"Content-Type":'application/json; charset=UTF-8'}self.key = "你的KEY"self.secret = "你的SECRET"

?驗證碼識別階段

普通沒有干擾的驗證碼，我們直接識別即可，但是有的驗證碼還是有干擾的，在識別之前，需要對它進行基本的處理，我們采用和上篇文章類似的辦法進行，對它進行灰度處理和二值化操作。

def init_table(self,threshold=155):table = []for i in range(256):if i < threshold:table.append(0)else:table.append(1)return tabledef opt_image(self):im = Image.open("OIP-C.jpg")im = im.convert('L')im = im.point(self.init_table(), '1')#遍歷每一個像素點im.save('OIP-C.jpg_s.png')return "OIP-C.jpg_s.png"

調用驗證碼接口

調用百度的驗證碼接口，不使用百度給的模塊，直接編寫。按照它對應的文檔書寫

def get_file_content(self,file_path):with open(file_path, 'rb') as fp:base64_data = base64.b64encode(fp.read())s = base64_data.decode()data = {}data['image'] = sdecoded_data = urllib.parse.urlencode(data)return decoded_datadef show_code(self):image = self.get_file_content(self.opt_image())headers = {"Content-Type": "application/x-www-form-urlencoded"}res = requests.post(self.api.format(self.get_accesstoken()),headers=headers,data=image)print(res.text)

?通過百度模塊調用驗證碼識別

安裝百度 AI

pip install baidu-aip

from aip import AipOcr #baidu_aip包中的AipOCR實現圖像文字識別# 定義常量 APP_ID = '15736693' API_KEY = '你的KEY' SECRET_KEY = '你的SECRET'# 初始化文字識別 aipOcr=AipOcr(APP_ID, API_KEY, SECRET_KEY)# 讀取圖片 filePath = "OIP-C.jpg"def get_file_content(filePath):with open(filePath, 'rb') as fp:return fp.read()# 定義參數變量 options = {'detect_direction': 'true','language_type': 'CHN_ENG', }# 網絡圖片文字文字識別接口 result = aipOcr.webImage(get_file_content(filePath),options)print(result) {'words_result': [{'words': '已只成此'}, {'words': '是生月蝶柱錦'}, {'words': '當煙明望思瑟'}, {'words': '時此珠帝無'}, {'words': '惘是追情'}, {'words': '已情有春年端'}, {'words': '惘可心莊五'}, {'words': '然當憶可'}, {'words': '然待藍托生十'}, {'words': '成杜曉絃'}, {'words': '追日夢'}, {'words': '時待'}, {'words': '憶暖迷絃'}, {'words': '只玉海蝴'}], 'words_result_num': 14, 'direction': 0, 'log_id': 1592394631485448641}

文字識別OCR (baidu.com)

總結

以上是生活随笔為你收集整理的【python爬虫笔记】验证码的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：零基础带你Ubuntu20.04安装Ro
下一篇： python操作word详细操作_Pyt

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

【python爬虫笔记】验证码

Python 驗證碼

數字+字母的驗證碼

pytesseract庫的安裝

pillow 庫

Filter

Format

帶干擾的驗證碼識別

通過百度提供的接口實現驗證碼識別

獲取 Access Token

請求百度orc通用文字識別API

獲得access_token

?驗證碼識別階段

調用驗證碼接口

?通過百度模塊調用驗證碼識別

總結