pdf内容差异对比
場景:防止抄襲!word對比就很容易,所以這里介紹pdf內容對比技術。
一、環境
- win10
- python3.9.6
二、思路
pdf轉圖片jpg,圖片轉文字,文字進行對比。
三、實現效果
使用方法:python diff-pdf.py test1.pdf test2.pdf
結果如下:
四、部分程序
主要使用模塊:pdfminer,io中的StringIO。
4.1 讀取pdf
def read_pdf(self, file_name):rsrcmgr = PDFResourceManager()laparams = LAParams()fp = open(file_name, 'rb')parser = PDFParser(fp)document = PDFDocument(parser)device = PDFPageAggregator(rsrcmgr, laparams=laparams)interpreter = PDFPageInterpreter(rsrcmgr, device)pages = {}for page in PDFPage.create_pages(document):dictionary = {}dictionary['textbox'] = []dictionary['textline'] = []interpreter.process_page(page)layout = device.get_result()for item in layout:if isinstance(item, LTTextBox):dictionary['textbox'].append(item)for child in item:if isinstance(child, LTTextLine):dictionary['textline'].append(child)pages[layout.pageid] = dictionaryreturn pages4.2 pdf轉文本
def convert_pdf_to_txt(self, path, page_no=-1):rsrcmgr = PDFResourceManager()retstr = StringIO()laparams = LAParams()device = TextConverter(rsrcmgr, retstr, laparams=laparams)fp = open(path, 'rb')interpreter = PDFPageInterpreter(rsrcmgr, device)password = ""maxpages = 0caching = Truepagenos = set()i = 0for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching,check_extractable=True):i += 1if i != page_no:continueinterpreter.process_page(page)fp.close()device.close()str = retstr.getvalue()retstr.close()return str4.3 pdf比較
def compare_pdf(self, file1, file2, header_text, x_margin=10, compare_margin=0.2):rsrcmgr = PDFResourceManager()retstr = StringIO()laparams = LAParams()fp = open(file1, 'rb')parser = PDFParser(fp)document = PDFDocument(parser)device = PDFPageAggregator(rsrcmgr, laparams=laparams)out = StringIO()layoutmode = 'normal'scale = 1.3fontscale = 1html_coverter = HTMLPrivateConverter(rsrcmgr, out, scale=scale,layoutmode=layoutmode, laparams=laparams, fontscale=fontscale,imagewriter=None, header_text=header_text, x_margin=x_margin)interpreter = PDFPageInterpreter(rsrcmgr, device)testpages = PDFPage.create_pages(document)file_dict = self.read_pdf(file2)for page in testpages:interpreter.process_page(page)layout = device.get_result()html_coverter.page_begin(layout)if file_dict.get(layout.pageid) == None:breakcompare_page = file_dict[layout.pageid]for item in layout:if isinstance(item, LTTextBox):html_coverter.begin_div('textbox', 1, item.x0 + html_coverter.x_margin, item.y1, item.width,item.height,item.get_writing_mode())for child in item:if isinstance(child, LTTextLine):self.compare_textline(child, compare_page, html_coverter, compare_margin)html_coverter.put_newline()html_coverter.end_div()html_coverter.page_end()fp.close()device.close()retstr.close()return out.getvalue()4.4 完整程序
點擊:完整源碼
總結
- 上一篇: 需求分析报告模板(免费)
- 下一篇: [html] 说说你对属性data-的