當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

基于Python实现对各种数据文件的操作

發(fā)布時(shí)間：2024/9/30 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了基于Python实现对各种数据文件的操作小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

本文總結(jié)使用Python對(duì)常見的數(shù)據(jù)文件進(jìn)行讀寫操作。

常見的數(shù)據(jù)文件類型如下：
txt
csv
excel(xls\xlsx)
在線網(wǎng)頁數(shù)據(jù)
pdf\word
其他數(shù)據(jù)軟件格式

1 txt文件

更多參考：https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

文件讀取

# 文件input file_txt = os.path.join(workdir,'Data/demo_text.txt')# 打開文件 f = open(file_txt, encoding='utf-8')# 將每行的文本讀取，并存為列表 # 此處使用.rstrip()去除空格、換行符 lines_raw = [x.rstrip() for x in f] # 或者 # lines_raw = [l.rstrip() for l in f.readlines()]print(lines_raw)# 關(guān)閉文件 f.close()

輸出如下：

[‘010杜甫：佳人’, ‘’, ‘絕代有佳人，幽居在空谷。’, ‘自云良家子，零落依草木。’, ‘關(guān)中昔喪亂，兄弟遭殺戮。’, ‘官高何足論，不得收骨肉。’, ‘世情惡衰歇，萬事隨轉(zhuǎn)燭。’, ‘夫婿輕薄兒，新人美如玉。’, ‘合昏尚知時(shí)，鴛鴦不獨(dú)宿。’, ‘但見新人笑，那聞舊人哭！’, ‘在山泉水清，出山泉水濁。’, ‘侍婢賣珠回，牽蘿補(bǔ)茅屋。’, ‘摘花不插發(fā)，采柏動(dòng)盈掬。’, ‘天寒翠袖薄，日暮倚修竹。’]

也可以用pandas來讀取

df_txt = pd.read_csv(file_txt, names=['txt'], encoding='utf-8') df_txt.head()

輸出如下：

文件輸出

# 文件output file_out = os.path.join(workdir,'Data/out_text.txt')f_out = open(file_out, encoding='utf-8',mode = 'w')f_out.writelines(lines_raw) f_out.close()

2 csv文件

更多參考：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv
csv文件的讀入和寫出相對(duì)簡單，直接調(diào)用pandas的函數(shù)即可。

# 定義文件路徑 file_csv = os.path.join(workdir,'Data/demo_csv.csv')# pandas.read_csv()函數(shù)來讀取文件 df_csv = pd.read_csv(file_csv,sep=',',encoding='utf-8')# dataframe.to_csv()保存csv文件 df_csv.to_csv('out_csv.csv',index=False,encoding='utf-8')# 查看dataframe前3行 df_csv.head(3)

輸出如下：
也可以把csv當(dāng)做文本文件來讀取，不過處理過程稍微復(fù)雜點(diǎn)，尤其是字段內(nèi)的取值中含有分隔符(比如逗號(hào))時(shí)，例如上面的name字段。

3 excel(xls\xlsx)文件在這里插入代碼片

pandas工具包中也提供了相應(yīng)的函數(shù)來讀寫excel文件(pandas.read_excel()和dataframe.to_excel())。
更多參考：https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
不同于csv文件，xlsx文件中會(huì)有多個(gè)sheet，pandas.read_excel函數(shù)默認(rèn)讀取第一個(gè)sheet.

# 定義文件路徑 file_excel = os.path.join(workdir,'Data/demo_xlsx.xlsx')# pandas.read_excel()函數(shù)來讀取文件 # sheet_name=0表示讀取第一個(gè)sheet，也可以指定要讀取的sheet的名稱(字符串格式) # header=0 表示使用第一行作為表頭(列名) # 如果數(shù)據(jù)中沒有列名(表頭)，可以設(shè)置header=None,同時(shí)names參數(shù)來指定list格式的列名 df_excel = pd.read_excel(file_excel,sheet_name=0,header=0,encoding='utf-8')# dataframe.to_csv()保存csv文件 df_excel.to_excel('out_excel.xlsx',index=False,encoding='utf-8')# 查看dataframe前3行 df_excel.head(3)

如果我們是想在單元格顆粒度上進(jìn)行操作，可以考慮兩個(gè)工具包：
xlwings, https://www.xlwings.org/
openpyxl, https://openpyxl.readthedocs.io/en/stable/

這里用xlwings示范自動(dòng)化“填表”，比如現(xiàn)在有3個(gè)項(xiàng)目對(duì)應(yīng)的3個(gè)單元格需要填寫。

import xlwings as xwfile_excel = os.path.join(workdir,'Data/demo_填表.xlsx')# 打開excel文件的時(shí)候不要展示頁面 app = xw.App(visible=False)# 打開工作簿 wb = xw.Book(file_excel)# 打開工作表 # 可以用index，可以指定sheet的名稱 ws = wb.sheets[0]# 讀取對(duì)應(yīng)單元格的值 print(ws.range('A1').value)ws.range('B1').value = 'Ahong' ws.range('B2').value = '男' ws.range('B3').value = 'Pyhon'# 保存工作簿 wb.save() # 也可以保存為新的文件名，e.g.wb.save('new.xlsx')# 關(guān)閉工作簿 wb.close()

如果要批量從多個(gè)統(tǒng)一格式的excel文件中讀取多個(gè)單元格或者寫入數(shù)據(jù)，不妨考慮此方法。

4 在線網(wǎng)頁數(shù)據(jù)

在線網(wǎng)頁數(shù)據(jù)通常需要網(wǎng)絡(luò)爬蟲來抓取，同時(shí)網(wǎng)頁是半結(jié)構(gòu)化的數(shù)據(jù)，需要整理為結(jié)構(gòu)化的數(shù)據(jù)。
注：關(guān)于網(wǎng)絡(luò)爬蟲可以參考O’REILLY的書Web Scraping with Python: Collecting More Data from the Modern Web).
網(wǎng)頁數(shù)據(jù)的爬取和解析常會(huì)用到的工具包：
requests, https://2.python-requests.org//zh_CN/latest/user/quickstart.html
BeautifulSoup, https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html
lxml, https://lxml.de/, 筆者最喜歡用的工具包之一
re, https://docs.python.org/3/library/re.html，正則化是數(shù)據(jù)清洗中必學(xué)的技能之一，更多參考https://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html
json, https://docs.python.org/3/library/json.html, 處理json格式數(shù)據(jù)
pandas, https://pandas.pydata.org/pandas-docs/stable/index.html，將數(shù)據(jù)保存為dataframe

通常網(wǎng)絡(luò)爬蟲的步驟如下：

分析網(wǎng)頁請(qǐng)求規(guī)范，比如是get還是post，請(qǐng)求的url是啥，返回的數(shù)據(jù)是什么格式(json?靜態(tài)html?)，header參數(shù)，url或者post中的變量有什么等；
獲取網(wǎng)頁數(shù)據(jù)，使用requests包；
解析網(wǎng)頁數(shù)據(jù)(將半結(jié)構(gòu)化的網(wǎng)頁數(shù)據(jù)轉(zhuǎn)化為結(jié)構(gòu)化數(shù)據(jù))，BeautifulSoup、lxml、re、json齊上陣；
整合數(shù)據(jù)并存檔，使用pandas對(duì)數(shù)據(jù)進(jìn)行整合并初步清洗。

5 PDF\Word
5.1 讀取PDF文件
對(duì)于pdf文件而言，如果要對(duì)文檔操作(比如合并、篩選、刪除頁面等)，建議使用的工具包：
PyPDF2, http://mstamy2.github.io/PyPDF2/
pdfrw, https://github.com/pmaupin/pdfrw
更多參考：https://www.binpress.com/manipulate-pdf-python/

處理pdf文件時(shí)，要注意文件需要是“已解密”或者“無密碼”狀態(tài)，“加密”狀態(tài)的文件處理時(shí)會(huì)報(bào)錯(cuò)。
pdf解密工具推薦：
http://freemypdf.com/
https://smallpdf.com/unlock-pdf
這里舉例說明PyPDF2的用法，篩選奇數(shù)頁面并保存為新文檔。

import PyPDF2# 讀入文件路徑 file_in = os.path.join(workdir,'Data/demo_pdf.pdf') # 打開要讀取的pdf文件 f_in = open(file_in,'rb') # 讀取pdf文檔信息 pdfReader = PyPDF2.PdfFileReader(f_in)# pdf文件頁面數(shù) page_cnt = pdfReader.getNumPages()pdfWriter = PyPDF2.PdfFileWriter()# 篩選奇數(shù)頁面 for page_idx in range(0,page_cnt,2):page = pdfReader.getPage(page_idx)pdfWriter.addPage(page)# 輸出文檔 file_out = open('pdf_out.pdf', 'wb') pdfWriter.write(file_out)# 關(guān)閉輸出的文件 file_out.close()# 關(guān)閉讀入的文件 pdf_file.close()

如果要解析pdf文件的頁面數(shù)據(jù)(文件上都寫了啥)，推薦的工具包為：

textract, https://textract.readthedocs.io/en/stable/,該工具包支持多種格式文件的數(shù)據(jù)提取
pdfminer.six, https://github.com/pdfminer/pdfminer.six，使用方法同pdfminer是一樣的。pdfminer的使用方法參考http://www.unixuser.org/~euske/python/pdfminer/

安裝好pdfminer.six后，直接在命令行中調(diào)用如下命令即可：

pdf2txt.py demo_pdf.pdf -o demo_pdf.txt
或者參考https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python可以自定義一個(gè)函數(shù)批量對(duì)pdf進(jìn)行轉(zhuǎn)換(文末附有該函數(shù))。
textract使用示例如下

import textract# 文件路徑 file_pdf = os.path.join(workdir,'Data/demo_pdf.pdf')# 提取文本 text_raw = textract.process(file_pdf) # 轉(zhuǎn)碼 text = text_raw.decode('utf-8')

5.2 讀取Word文件

可以使用工具包python-docx,https://python-docx.readthedocs.io/en/latest/
操作word的場景相對(duì)少見，參考網(wǎng)站的示例即可。

6 其他數(shù)據(jù)軟件文件

比如SAS, SPSS,Stata等分析軟件導(dǎo)出的數(shù)據(jù)格式。
可以使用的工具包pyreadstat, https://github.com/Roche/pyreadstat

# 使用Python讀取.sav文件 # https://github.com/Roche/pyreadstat import pyreadstat# 文件路徑 file_data = os.path.join(workdir,'Data/demo_sav.sav')# 讀取文件 df,meta = pyreadstat.read_sav(file_data) # df就是轉(zhuǎn)化后的數(shù)據(jù)框# 查看編碼格式 print(meta.file_encoding)df.head()

示例數(shù)據(jù)下載: https://pan.baidu.com/s/1iGU5vjDrwGzBswbxsC714Q 提取碼: sjgz

更多參考
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
Automate the Boring Stuff with Python: Practical Programming for Total Beginners
附PDF文件轉(zhuǎn)字符串的函數(shù)

----------------------------------------------------- 注：我這有個(gè)學(xué)習(xí)基地，里面有很多學(xué)習(xí)資料，感興趣的+Q群：895817687 ----------------------------------------------------- # ref: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-pythonfrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from io import StringIOdef convert_pdf_to_txt(path):rsrcmgr = PDFResourceManager()retstr = StringIO()codec = 'utf-8'laparams = LAParams()device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)fp = open(path, 'rb')interpreter = PDFPageInterpreter(rsrcmgr, device)password = ""maxpages = 0caching = Truepagenos=set()for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):interpreter.process_page(page)text = retstr.getvalue()fp.close()device.close()retstr.close()return text

總結(jié)

以上是生活随笔為你收集整理的基于Python实现对各种数据文件的操作的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。