當前位置：首頁 >

docx、ppt、pdf转txt

發布時間：2024/1/1 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 docx、ppt、pdf转txt 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文檔格式轉換

最近做畢設用到需要用到文檔格式轉換，整理了一些代碼：

Doc、Docx轉txt

#-*- coding: utf-8 -*- from win32com import client as wcword = wc.Dispatch('Word.Application') doc = word.Documents.Open('H:\\a.docx') doc.SaveAs('H:\\a.pdf', 17) #17對應于下表中的pdf文件 doc.SaveAs('H:\\a.txt', 2) #2對應于下表中的txt文件 doc.Close() word.Quit()

PDF轉TXT

# -*- coding: utf-8 -*- from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfpage import PDFTextExtractionNotAllowed from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfdevice import PDFDevice from pdfminer.layout import * from pdfminer.converter import PDFPageAggregator import os#中文路徑問題沒有解決fp = open('E:\\Final_design\\a.pdf', 'rb') #來創建一個pdf文檔分析器 parser = PDFParser(fp) #創建一個PDF文檔對象存儲文檔結構 document = PDFDocument(parser) # 檢查文件是否允許文本提取 if not document.is_extractable: raise PDFTextExtractionNotAllowed else: # 創建一個PDF資源管理器對象來存儲共賞資源 rsrcmgr=PDFResourceManager() # 設定參數進行分析 laparams=LAParams() # 創建一個PDF設備對象 # device=PDFDevice(rsrcmgr) device=PDFPageAggregator(rsrcmgr,laparams=laparams) # 創建一個PDF解釋器對象 interpreter=PDFPageInterpreter(rsrcmgr,device) # 處理每一頁 for page in PDFPage.create_pages(document): interpreter.process_page(page) # 接受該頁面的LTPage對象 layout=device.get_result() for x in layout: if(isinstance(x,LTTextBoxHorizontal)): with open('a.txt','a') as f: f.write(x.get_text().encode('utf-8')+'\n')

PPT轉TXT

#-*- coding: utf-8 -*- import win32com import codecs from win32com.client import Dispatch, constantsppt = win32com.client.Dispatch('PowerPoint.Application') ppt.Visible = 1 pptSel = ppt.Presentations.Open("H:\\b.pptx",ReadOnly=1, Untitled=0, WithWindow=0) # win32com.client.gencache.EnsureDispatch('PowerPoint.Application') #get the ppt's pages f = open("H:\\b.txt","w")slide_count = pptSel.Slides.Count for i in range(1,slide_count + 1): shape_count = pptSel.Slides(i).Shapes.Count print shape_count for j in range(1,shape_count + 1): if pptSel.Slides(i).Shapes(j).HasTextFrame: s = pptSel.Slides(i).Shapes(j).TextFrame.TextRange.Text f.write(s.encode('gbk')+ "\n") #gbk對中文處理比較好 f.close() ppt.Quit()

上面幾種都是利用python的win32com庫，運行時會打開Office軟件（前面兩個沒有明顯打開，PPT轉txt的明顯能看到軟件打開后又關閉的過程），由于安全原因，利用文件格式漏洞隱藏在文件中的惡意代碼可能會在文件被打開時運行。

所以更安全的辦法是利用文件格式解析，將其中文件提取出來，保存成txt。

docx文檔解析，提取文本

利用python的docx庫，安裝時建議不要用：

pip install docx

因為 from docx import Document會報錯：cannot import name Document
改為：

pip install python-docx

#-*- coding: utf-8 -*- from docx import Document #打開文檔 document = Document(u'H:\\a.docx') #讀取每段資料 l = [ paragraph.text.encode('gb2312') for paragraph in document.paragraphs]; #輸出并觀察結果，也可以通過其他手段處理文本即可 for i in l:print i #讀取表格材料，并輸出結果 tables = [table for table in document.tables]; for table in tables:for row in table.rows:for cell in row.cells:print cell.text.encode('gb2312'),'\t',printprint '\n'

總結

以上是生活随笔為你收集整理的docx、ppt、pdf转txt的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。