python01g内存读取10g文件并排序_将大文件逐行读取到Python2.7中时的内存使用
堆棧溢出,
我正在從事一個(gè)涉及一些大文件(10-50Gb)的基因組學(xué)項(xiàng)目,我想將這些文件讀入Python 2.7進(jìn)行處理。我不需要將整個(gè)文件讀到內(nèi)存中,而是簡單地逐行讀取每個(gè)文件,執(zhí)行一個(gè)小任務(wù),然后繼續(xù)。
我發(fā)現(xiàn)類似的問題,并嘗試實(shí)施一些解決方案:
當(dāng)我在17Gb文件上運(yùn)行以下代碼時(shí):
腳本1(itertools):
#!/usr/bin/env python2importsysimportstringimportosimportitertoolsif__name__=="__main__":#Read in PosListposList=[]withopen("BigFile")asf:forlineiniter(f):posList.append(line.strip())sys.stdout.write(str(sys.getsizeof(posList)))
腳本2(文件輸入):
#!/usr/bin/env python2importsysimportstringimportosimportfileinputif__name__=="__main__":#Read in PosListposList=[]forlineinfileinput.input(['BigFile']):posList.append(line.strip())sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT3(用于行):
#!/usr/bin/env python2importsysimportstringimportosif__name__=="__main__":#Read in PosListposList=[]withopen("BigFile")asf:forlineinf:posList.append(line.strip())sys.stdout.write(str(sys.getsizeof(posList)))
SCRIPT4(收益):
#!/usr/bin/env python2importsysimportstringimportosdefreadInChunks(fileObj,chunkSize=30):whileTrue:data=fileObj.read(chunkSize)ifnotdata:breakyielddataif__name__=="__main__":#Read in PosListposList=[]f=open('BigFile')forchunkinreadInChunks(f):posList.append(chunk.strip())f.close()sys.stdout.write(str(sys.getsizeof(posList)))
在17Gb文件中,Python中最終列表的大小約為5%(來自sys.getsizeof()),但根據(jù)“頂部”,每個(gè)腳本使用的內(nèi)存都超過43Gb。
我的問題是:為什么內(nèi)存使用率比輸入文件或最終列表高得多?如果最終列表只有5Gb,并且正在逐行讀取17Gb文件輸入,為什么每個(gè)腳本的內(nèi)存使用量都達(dá)到約43Gb?有沒有更好的方法可以讀取大文件而不會(huì)發(fā)生內(nèi)存泄漏(如果那就是它們的話)?
非常感謝。
編輯:
從'/ usr / bin / time -v python script3.py'輸出:
Commandbeing timed:"python script3.py"Usertime(seconds):159.65Systemtime(seconds):21.74Percentof CPU this job got:99%Elapsed(wall clock)time(h:mm:ssorm:ss):3:01.96Averageshared text size(kbytes):0Averageunshared data size(kbytes):0Averagestack size(kbytes):0Averagetotal size(kbytes):0Maximumresident set size(kbytes):181246448Averageresident set size(kbytes):0Major(requiring I/O)page faults:0Minor(reclaiming a frame)page faults:10182731Voluntarycontext switches:315Involuntarycontext switches:16722Swaps:0Filesystem inputs:33831512Filesystem outputs:0Socketmessages sent:0Socketmessages received:0Signalsdelivered:0Pagesize(bytes):4096Exitstatus:0
從頂部輸出:
15816user200727m609m2032R76.80.50:02.31python15816user2001541m1.4g2032R99.61.10:05.31python15816user2002362m2.2g2032R99.61.70:08.31python15816user2003194m3.0g2032R99.62.40:11.31python15816user2004014m3.8g2032R99.630:14.31python15816user2004795m4.6g2032R99.63.60:17.31python15816user2005653m5.3g2032R99.64.20:20.31python15816user2006457m6.1g2032R99.34.90:23.30python15816user2007260m6.9g2032R99.65.50:26.30python15816user2008085m7.7g2032R99.96.10:29.31python15816user2008809m8.5g2032R99.66.70:32.31python15816user2009645m9.3g2032R99.37.40:35.30python15816user20010.3g10g2032R99.680:38.30python15816user20011.1g10g2032R1008.60:41.31python15816user20011.8g11g2032R99.99.20:44.32python15816user20012.7g12g2032R99.39.90:47.31python15816user20013.4g13g2032R99.610.50:50.31python15816user20014.3g14g2032R99.911.10:53.32python15816user20015.0g14g2032R99.311.70:56.31python15816user20015.9g15g2032R99.912.40:59.32python15816user20016.6g16g2032R99.6131:02.32python15816user20017.3g17g2032R99.613.61:05.32python15816user20018.2g17g2032R99.914.21:08.33python15816user20018.9g18g2032R99.614.91:11.33python15816user20019.9g19g2032R10015.51:14.34python15816user20020.6g20g2032R99.316.11:17.33python15816user20021.3g21g2032R99.616.71:20.33python15816user20022.3g21g2032R99.917.41:23.34python15816user20023.0g22g2032R99.6181:26.34python15816user20023.7g23g2032R99.618.61:29.34python15816user20024.4g24g2032R99.619.21:32.34python15816user20025.4g25g2032R99.319.91:35.33python15816user20026.1g25g2032R99.920.51:38.34python15816user20026.8g26g2032R99.921.11:41.35python15816user20027.4g27g2032R99.621.71:44.35python15816user20028.5g28g2032R99.622.31:47.35python15816user20029.2g28g2032R99.922.91:50.36python15816user20029.9g29g2032R99.623.51:53.36python15816user20030.5g30g2032R99.624.11:56.36python15816user20031.6g31g2032R99.624.71:59.36python15816user20032.3g31g2032R10025.32:02.37python15816user20033.0g32g2032R99.625.92:05.37python15816user20033.7g33g2032R99.626.52:08.37python15816user20034.3g34g2032R99.627.12:11.37python15816user20035.5g34g2032R99.627.72:14.37python15816user20036.2g35g2032R99.628.42:17.37python15816user20036.9g36g2032R100292:20.38python15816user20037.5g37g2032R99.629.62:23.38python15816user20038.2g38g2032R99.630.22:26.38python15816user20038.9g38g2032R99.630.82:29.38python15816user20040.1g39g2032R10031.42:32.39python15816user20040.8g40g2032R99.6322:35.39python15816user20041.5g41g2032R99.632.62:38.39python15816user20042.2g41g2032R99.933.22:41.40python15816user20042.8g42g2032R99.633.82:44.40python15816user20043.4g43g2032R99.634.32:47.40python15816user20043.4g43g2032R10034.32:50.41python15816user20038.6g38g2032R10030.52:53.43python15816user20024.9g24g2032R99.719.62:56.43python15816user20012.0g11g2032R1009.42:59.44python
編輯2:
為了進(jìn)一步澄清,這里是問題的擴(kuò)展。我在這里所做的是讀取FASTA文件中的位置列表(Contig1 / 1,Contig1 / 2等)。通過以下方式將其轉(zhuǎn)換為充滿N的字典:
keys=posList
values=['N']*len(posList)speciesDict=dict(zip(keys,values))
然后,我正在讀取多個(gè)物種的堆積文件,再次逐行(將存在相同問題的地方),并通過以下方式獲得最終的堿基檢出:
withopen(path+'/'+os.path.basename(path)+'.pileups',"r")asfilein:forlineiniter(filein):splitline=line.split()iflen(splitline)>4:node,pos,ref,num,bases,qual=line.split()loc=node+'/'+pos
cleanBases=getCleanList(ref,bases)finalBase=getFinalBase_Pruned(cleanBases,minread,thresh)speciesDict[loc]=finalBase
由于特定于物種的堆積文件的長度或順序不同,因此,我正在創(chuàng)建列表以創(chuàng)建一種“公共花園”方式來存儲(chǔ)單個(gè)物種數(shù)據(jù)。如果某個(gè)物種的給定站點(diǎn)沒有可用的數(shù)據(jù),則會(huì)調(diào)用“ N”。否則,將在詞典中為該站點(diǎn)分配一個(gè)堿基。
最終結(jié)果是每個(gè)物種的文件,這些文件是有序的和完整的,我可以從中進(jìn)行下游分析。
因?yàn)橹鹦凶x取正在消耗大量內(nèi)存,所以即使最終數(shù)據(jù)結(jié)構(gòu)比我預(yù)期的所需內(nèi)存小得多,讀取兩個(gè)大文件也會(huì)使我的資源超載(增長列表的大小+單個(gè)內(nèi)存)一次要添加數(shù)據(jù)的行)。
解決方案
sys.getsizeof(posList) is not giving you what I think you think it is: it's telling you the size of the list object containing the lines; this does not include the size of the lines themselves. Below are some outputs from reading a roughly 3.5Gb file into a list on my system:
In[2]:lines=[]In[3]:withopen('bigfile')asinf:...:forlineininf:...:lines.append(line)...:In[4]:len(lines)Out[4]:68318734In[5]:sys.getsizeof(lines)Out[5]:603811872In[6]:sum(len(l)forlinlines)Out[6]:3473926127In[7]:sum(sys.getsizeof(l)forlinlines)Out[7]:6001719285
That's a bit over six billion bytes, there; in top my interpreter was using about 7.5Gb at this point.
Strings have considerable overhead: 37 bytes each, it looks like:
In[2]:sys.getsizeof('0'*10)Out[2]:47In[3]:sys.getsizeof('0'*100)Out[3]:137In[4]:sys.getsizeof('0'*1000)Out[4]:1037
So if your lines are relatively short, a large part of the memory use will be overhead.
總結(jié)
以上是生活随笔為你收集整理的python01g内存读取10g文件并排序_将大文件逐行读取到Python2.7中时的内存使用的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 解决echarts饼图label显示不全
- 下一篇: 探讨Python在开发中的重要性!