當前位置：首頁 > 编程语言 > python >内容正文

python

Python处理大数据量文本数据思路

發布時間：2025/4/5 python 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python处理大数据量文本数据思路小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

最近，需要用python批量處理一些超過4G的文本數據，在此記錄一些處理思路。

1 文本查看

拿到新數據，總是想先打開數據，看看字段和數據情況。然而，我的電腦運存只有16G，超過4G的文本數據如果用記事本或notepad++等文本編輯器直接打開，會一下子涌入運存中，打開很慢或者直接打不開。

EmEditor軟件讀取大文件很方便。不是免費的，需要注冊：EmEditor (Text Editor) – Text Editor for Windows supporting large files and Unicode!

2 文本讀取

2.1 文本分塊讀取

import pandas as pdtable = pd.read_csv(r"G:data.txt",sep = '\t', #制表符分隔header = None, #我這份數據無表頭encoding = 'utf-8',error_bad_lines = False, #遇到錯誤數據行忽略warn_bad_lines = True,iterator=True, #開啟迭代器chunksize=10000 #讀取10000個數據為一個塊)path = r"G:\test"i = 0 for item in table:i += 1print("正在處理第{}個文件".format(i))item.to_csv(path + "_test_" + str(i) + ".csv", index=False,encoding = 'utf-8')

2.2 中文文本編碼獲取

用pandas的read_csv讀取中文文本時，首先要知道文本的編碼是什么，并在encoding這個參數這里設置正確的編碼。否則，讀取到的數據會是亂碼。EmEditor軟件可以直接查看文本編碼和文本分隔符類型。

也可以python中的chardet包來獲取文本編碼。

#方法一 import pandas as pd import os import chardetdef get_encoding(filename): """ 返回文件編碼格式，因為是按行讀取，所以比較適合小文件""" with open(filename,'rb') as f: return chardet.detect(f.read())['encoding']original_file = r"G:\data.txt"print(get_encoding(original_file))#方法二 from chardet.universaldetector import UniversalDetectororiginal_file = r"G:\data.txt"usock = open(original_file, 'rb') detector = UniversalDetector() for line in usock.readlines():detector.feed(line)if detector.done: break detector.close() usock.close() print (detector.result)#chardet不可能總是正確的猜測。如果你需要正確處理樣本，你真的需要知道它們的編碼

2.3 中文文本編碼轉換

EmEditor軟件可以轉換編碼，也可以用如下代碼轉換編碼。下面的代碼是將編碼轉換為“utf-8”。

import codecs def handleEncoding(original_file,newfile):#newfile=original_file[0:original_file.rfind(.)]+'_copy.csv'f=open(original_file,'rb+')content=f.read()#讀取文件內容，content為bytes類型，而非string類型source_encoding='utf-8'#####確定encoding類型try:content.decode('utf-8').encode('utf-8')source_encoding='utf-8'except:try:content.decode('gbk').encode('utf-8')source_encoding='gbk'except:try:content.decode('gb2312').encode('utf-8')source_encoding='gb2312'except:try:content.decode('gb18030').encode('utf-8')source_encoding='gb18030'except:try:content.decode('big5').encode('utf-8')source_encoding='big5'except:try:content.decode('cp936').encode('utf-8')source_encoding='cp936'except:content.decode('gbk').encode('utf-8')source_encoding='gbk'f.close()#####按照確定的encoding讀取文件內容，并另存為utf-8編碼：block_size=10000with codecs.open(original_file,'r',source_encoding) as f:with codecs.open(newfile,'w','utf-8') as f2:while True:content=f.read(block_size)if not content:breakf2.write(content)original_file = r"G:\data.txt" newfile = r"G:\data_new.txt" handleEncoding(original_file,newfile)

2.3 文本并行處理

想對分塊后的數據，同時運行函數Fuction_test(x)，考慮并行處理。

#GPU并行 dask包
還在研究中……

#CPU并行?joblib包的Parallel函數
還在研究中……

from joblib import Parallel, delayed?def Fuction_test(x):y = x + 10return yParallel(n_jobs=-1)(delayed(Fuction_test)(item) for item in table)

總結

以上是生活随笔為你收集整理的Python处理大数据量文本数据思路的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Pandas读取中文文本文件报错:pyt
下一篇： Edit Distance Python