python判断对错题_Python爬虫自动化获取华图和粉笔网站的错题(推荐)
這篇博客對于考公人或者其他用華圖或者粉筆做題的人比較友好,通過輸入網(wǎng)址可以自動化獲取華圖以及粉筆練習(xí)的錯題。
粉筆網(wǎng)站
我們從做過的題目組中獲取錯題
打開某一次做題組,我們首先進行抓包看看數(shù)據(jù)在哪里
我們發(fā)現(xiàn)現(xiàn)在數(shù)據(jù)已經(jīng)被隱藏,事實上數(shù)據(jù)在這兩個包中:
https://tiku.fenbi.com/api/xingce/questions
https://tiku.fenbi.com/api/xingce/solutions
一個為題目的一個為解析的。此url要通過傳入一個題目組參數(shù)才能獲取到當(dāng)前題目數(shù)據(jù),而題目組參數(shù)在這個包中
以網(wǎng)址的倒數(shù)第二個數(shù)字串有關(guān)
url的規(guī)則為'https://tiku.fenbi.com/api/xingce/exercises/'+str(id_)+'?app=web&kav=12&version=3.0.0.0',id_即為下劃線數(shù)字
通過請求這個包獲取到參數(shù)然后通過參數(shù)請求上面兩個包(
https://tiku.fenbi.com/api/xingce/questions
https://tiku.fenbi.com/api/xingce/solutions
不過粉筆的題目數(shù)據(jù)有些是圖片,而且圖片在題目中,選項中,這里以word文檔存儲操作docx庫有些吃力,于是我想到了直接構(gòu)造HTML代碼,然后通過pdfkit轉(zhuǎn)為pdf(具體如何下載可以參考百度,要下載wkhtmltopdf.exe)即可變?yōu)殄e題集在平板或者其他設(shè)備中看。
(請求時一定要攜帶完整的headers,否則很可能獲取不到數(shù)據(jù))
具體操作看代碼解析
###此函數(shù)用于解析題目和每道題的答案
def jiexi(liebiao):
new = []
timu_last = []
for each in liebiao:
new.append(re.sub(r'flag=\\"tex\\" ','',each))
for each in new:
timu_last.append(re.sub(r'\\','',each))
return timu_last
###此函數(shù)用于解析選項
def xuanxiang(liebiao):
xuanxiang_v2 = []
xuanxiang_v3 = []
for each in liebiao:
a = re.sub('
','',each)
a = re.sub('
','',a)xuanxiang_v2.append(a)
for each in xuanxiang_v2:
each = each+'
'xuanxiang_v3.append(each)
return xuanxiang_v3
import requests
import re
import pdfkit
import os
url = str(input("請輸入練習(xí)的網(wǎng)址:"))
###獲取本節(jié)練習(xí)id
id_ = re.findall(r'https://www.fenbi.com/spa/tiku.*?/xingce/xingce/(.*?)/',url,re.S)[0]
mid_url = 'https://tiku.fenbi.com/api/xingce/exercises/'+str(id_)+'?app=web&kav=12&version=3.0.0.0'
headers = {
#####完整的headers,自己添加
}
response = requests.get(url=mid_url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
###獲取題目組參數(shù)
id_list = re.findall('\"questionIds\"\:\[(.*?)\]\,',page_text,re.S)
###獲取自己的答案
your_answer = re.findall(r'"answer":{"choice":"(.*?)",',page_text,re.S)
###此練習(xí)名稱
name = re.findall(r'"name":"(.*?)",',page_text,re.S)[0]
###真正存儲數(shù)據(jù)的包
timu_url = 'https://tiku.fenbi.com/api/xingce/questions'
params = {
'ids': id_list
}
response = requests.get(url=timu_url,headers=headers,params=params)
response.encoding = 'utf-8'
page_text = response.text
###獲取正確答案
true_answer = re.findall('"correctAnswer":{"choice":"(.*?)"',page_text,re.S)
###真正存儲數(shù)據(jù)的包
solution_url = 'https://tiku.fenbi.com/api/xingce/solutions'
response = requests.get(url=solution_url,headers=headers,params=params)
response.encoding = 'utf-8'
page_text = response.text
###獲取解析
solution_list = re.findall(r'"solution":"(.*?)","userAnswer"',page_text,re.S)
solution_last = jiexi(solution_list)
cailiao = []
timu = []
###獲取單選題題目和復(fù)合題的題目
for each in response.json():
timu.append(each['content'])
try:
cailiao.append(each['material']['content'])
except:
cailiao.append('none')
###獲取選項信息
A_option = re.findall('\"options\"\:\[\"(.*?)\"\,\".*?\"\,\".*?\"\,\".*?\"\]',page_text,re.S)
B_option = re.findall('\"options\"\:\[\".*?\"\,\"(.*?)\"\,\".*?\"\,\".*?\"\]',page_text,re.S)
C_option = re.findall('\"options\"\:\[\".*?\"\,\".*?\"\,\"(.*?)\"\,\".*?\"\]',page_text,re.S)
D_option = re.findall('\"options\"\:\[\".*?\"\,\".*?\"\,\".*?\"\,\"(.*?)\"\]',page_text,re.S)
A_option = xuanxiang(A_option)
B_option = xuanxiang(B_option)
C_option = xuanxiang(C_option)
D_option = xuanxiang(D_option)
A_option = jiexi(A_option)
B_option = jiexi(B_option)
C_option = jiexi(C_option)
D_option = jiexi(D_option)
###構(gòu)造HTML代碼
count = 0
all_content = "\n\n"
for each in true_answer:
if each != your_answer[count]:
###處理復(fù)合題
if cailiao[count] != 'none' and cailiao[count] not in all_content:
all_content += cailiao[count]
all_content += str(count+1)
all_content += '、'
all_content += timu[count][3:]
all_content += 'A、'
all_content += A_option[count]
all_content += 'B、'
all_content += B_option[count]
all_content += 'C、'
all_content += C_option[count]
all_content += 'D、'
all_content += D_option[count]
all_content += '
'
count += 1
count = 0
all_content += '
'
for each in true_answer:
if each != your_answer[count]:
temp = '第'+str(count+1)+'題的正確答案為'
all_content += temp
if true_answer[count]=='0':
all_content += 'A'
elif true_answer[count]=='1':
all_content += 'B'
elif true_answer[count]=='2':
all_content += 'C'
elif true_answer[count]=='3':
all_content += 'D'
all_content += solution_last[count]
all_content += '
'
count += 1
all_content += ''
path_name = name + '.html'
###保存為HTML文件
with open(path_name,'w',encoding='utf-8') as fp:
fp.write(all_content)
confg = pdfkit.configuration(wkhtmltopdf=r'wkhtmltopdf.exe保存的路徑')
pdfkit.from_url(path_name, name+'.pdf',configuration=confg)###把HTML文件轉(zhuǎn)為pdf
print('錯題PDF保存成功')
###刪除HTML文件
os.remove(path_name)
華圖網(wǎng)站
也是答題記錄中自己做過的題目
華圖網(wǎng)站稍微不一樣,他的數(shù)據(jù)直接抓包就可看到
通過請求這個包即可獲取到數(shù)據(jù),接下來就是解析的事情了,這次我用word文檔進行存儲,如果覺得不方便也可以像上文一樣構(gòu)造HTML
##導(dǎo)包
import requests
import lxml.etree
import re
import time
import os
from docx import Document
from docx.shared import Inches
from docx.shared import Pt
from docx.shared import Inches
from docx.oxml.ns import qn
from docx.enum.text import WD_ALIGN_PARAGRAPH
url = str(input("請輸入練習(xí)的網(wǎng)址:"))
headers={
###完整的headers,否則獲取不到數(shù)據(jù)
}
response = requests.get(url = url,headers = headers)
response.encoding='utf-8'
reptext = response.text
tree = lxml.etree.HTML(reptext) #解析網(wǎng)站獲取源碼
dirName="考公圖片"
if not os.path.exists(dirName):
os.mkdir(dirName) #網(wǎng)站圖片保存路徑
jiexi = re.findall(r'
解析.*?。 .*? ', reptext,re.S) #獲取題目解析總結(jié)
以上是生活随笔為你收集整理的python判断对错题_Python爬虫自动化获取华图和粉笔网站的错题(推荐)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python极简讲义一本书入门机器学习和
- 下一篇: Python数据科学家学习计划分几步?