【深度学习】使用深度学习阅读和分类扫描文档
收集數(shù)據(jù)
首先,我們要做的第一件事是創(chuàng)建一個(gè)簡單的數(shù)據(jù)集,這樣我們就可以測試我們工作流程的每一部分。理想情況下,我們的數(shù)據(jù)集將包含各種易讀性和時(shí)間段的掃描文檔,以及每個(gè)文檔所屬的高級主題。我找不到具有這些精確規(guī)格的數(shù)據(jù)集,所以我開始構(gòu)建自己的數(shù)據(jù)集。我決定的高層次話題是政府、信件、吸煙和專利,隨機(jī)的選擇這些主要是因?yàn)槊總€(gè)地區(qū)都有各種各樣的掃描文件。
我從這些來源中的每一個(gè)中挑選了 20 個(gè)左右的大小合適的文檔,并將它們放入由主題定義的單獨(dú)文件夾中。
經(jīng)過將近一整天的搜索和編目所有圖像后,我們將它們?nèi)空{(diào)整為 600x800 并將它們轉(zhuǎn)換為 PNG 格式。
簡單的調(diào)整大小和轉(zhuǎn)換腳本如下:
from PIL import Imageimg_folder = r'F:\Data\Imagery\OCR' # Folder containing topic folders (i.e "News", "Letters" ..etc.)for subfol in os.listdir(img_folder): # For each of the topic folderssfpath = os.path.join(img_folder, subfol)for imgfile in os.listdir(sfpath): # Get all images in the topicimgpath = os.path.join(sfpath, imgfile)img = Image.open(imgpath) # Read in the image with Pillowimg = img.resize((600,800)) # Resize the imagenewip = imgpath[0:-4] + ".png" # Convert to PNGimg.save(newip) # Save構(gòu)建OCR管道
光學(xué)字符識別是從圖像中提取文字的過程。這通常是通過機(jī)器學(xué)習(xí)模型完成的,最常見的是通過包含卷積神經(jīng)網(wǎng)絡(luò)的管道來完成。雖然我們可以為我們的應(yīng)用程序訓(xùn)練自定義 OCR 模型,但它需要更多的訓(xùn)練數(shù)據(jù)和計(jì)算資源。相反,我們將使用出色的 Microsoft 計(jì)算機(jī)視覺 API,其中包括專門用于 OCR 的特定模塊。API 調(diào)用將使用圖像(作為 PIL 圖像)并輸出幾位信息,包括圖像上文本的位置/方向作為以及文本本身。以下函數(shù)將接收一個(gè) PIL 圖像列表并輸出一個(gè)大小相等的提取文本列表:
def image_to_text(imglist, ndocs=10):'''Take in a list of PIL images and return a list of extracted text using OCR'''headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': 'YOUR_KEY_HERE',}params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true',})outtext = []docnum = 0for cropped_image in imglist:print("Processing document -- ", str(docnum))# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continueocr_image = cropped_imageimgByteArr = io.BytesIO()ocr_image.save(imgByteArr, format='PNG')imgByteArr = imgByteArr.getvalue()try:conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com')conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers)response = conn.getresponse()data = json.loads(response.read().decode("utf-8"))curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']:curr_text.append(str(w['text']))conn.close()except Exception as e:print("Could not process imageouttext.append(' '.join(curr_text))docnum += 1return(outtext)后期處理
由于在某些情況下我們可能希望在這里結(jié)束我們的工作流程,而不是僅僅將提取的文本作為一個(gè)巨大的列表保存在內(nèi)存中,我們還可以將提取的文本寫入與原始輸入文件同名的單個(gè) txt 文件中。微軟的OCR技術(shù)雖然不錯(cuò),但偶爾也會(huì)出錯(cuò)。????我們可以使用 SpellChecker 模塊減少其中的一些錯(cuò)誤,以下腳本接受輸入和輸出文件夾,讀取輸入文件夾中的所有掃描文檔,使用我們的 OCR 腳本讀取它們,運(yùn)行拼寫檢查并糾正拼寫錯(cuò)誤的單詞,最后將原始txt文件導(dǎo)出目錄。
''' Read in a list of scanned images (as .png files > 50x50px) and output a set of .txt files containing the text content of these scans '''from functions import preprocess, image_to_text from PIL import Image import os from spellchecker import SpellChecker import matplotlib.pyplot as pltINPUT_FOLDER = r'F:\Data\Imagery\OCR2\Images' OUTPUT_FOLDER = r'F:\Research\OCR\Outputs\AllDocuments'## First, read in all the scanned document images into PIL images scanned_docs_path = os.listdir(INPUT_FOLDER) scanned_docs_path = [x for x in scanned_docs_path if x.endswith('.png')] scanned_docs = [Image.open(os.path.join(INPUT_FOLDER, path)) for path in scanned_docs_path]## Second, utilize Microsoft CV API to extract text from these images using OCR scanned_docs_text = image_to_text(scanned_docs)## Third, remove mis-spellings that might have occured from bad OCR readings spell = SpellChecker() for i in range(len(scanned_docs_text)):clean = scanned_docs_text[i]misspelled = spell.unknown(clean)clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled:clean[word] = spell.correction(clean[word])# Get the one `most likely` answerclean = ' '.join(clean)scanned_docs_text[i] = clean## Fourth, write the extracted text to individual .txt files with the same name as input files for k in range(len(scanned_docs_text)): # For each scanned documenttext = scanned_docs_text[k]path = scanned_docs_path[k] # Get the corresponding input filenametext_file_path = path[:-4] + ".txt" # Create the output text filetext_file = open(text_file_path, "wt")n = text_file.write(text) # Write the text to the ouput text filetext_file.close()print("Done")為建模準(zhǔn)備文本
如果我們的掃描文檔集足夠大,將它們?nèi)繉懭胍粋€(gè)大文件夾會(huì)使它們難以分類,并且我們可能已經(jīng)在文檔中進(jìn)行了某種隱式分組。如果我們大致了解我們擁有多少種不同的“類型”或文檔主題,我們可以使用主題建模來幫助自動(dòng)識別這些。這將為我們提供基礎(chǔ)架構(gòu),以根據(jù)文檔內(nèi)容將 OCR 中識別的文本拆分為單獨(dú)的文件夾,我們將使用該主題模型被稱為LDA。為了運(yùn)行這個(gè)模型,我們需要對我們的數(shù)據(jù)進(jìn)行更多的預(yù)處理和組織,因此為了防止我們的腳本變得冗長和擁擠,我們將假設(shè)已經(jīng)使用上述工作流程讀取了掃描的文檔并將其轉(zhuǎn)換為 txt 文件. 然后主題模型將讀入這些 txt 文件,將它們分類到我們指定的任意多個(gè)主題中,并將它們放入適當(dāng)?shù)奈募A中。
我們將從一個(gè)簡單的函數(shù)開始,讀取文件夾中所有輸出的 txt 文件,并將它們讀入包含 (filename, text) 的元組列表。
def read_and_return(foldername, fileext='.txt'):'''Read all text files with fileext from foldername, and place them into a list of tuples as[(filename, text), ... , (filename, text)]'''allfiles = os.listdir(foldername)allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)]alltext = []for filename in allfiles:with open(filename, 'r') as f:alltext.append((filename, f.read()))f.close()return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)]接下來,我們需要確保所有無用的詞(那些不能幫助我們區(qū)分特定文檔主題的詞)。我們將使用三種不同的方法來做到這一點(diǎn):
刪除停用詞
去除標(biāo)簽、標(biāo)點(diǎn)、數(shù)字和多個(gè)空格
TF-IDF 過濾
為了實(shí)現(xiàn)所有這些(以及我們的主題模型),我們將使用 Gensim 包。下面的腳本將對文本列表(上述函數(shù)的輸出)運(yùn)行必要的預(yù)處理步驟并訓(xùn)練 LDA 模型。
from gensim import corpora, models, similarities from gensim.parsing.preprocessing import remove_stopwords, preprocess_stringdef preprocess(document):clean = remove_stopwords(document)clean = preprocess_string(document) return(clean)def run_lda(textlist, num_topics=10,preprocess_docs=True):'''Train and return an LDA model against a list of documents'''if preprocess_docs:doc_text = [preprocess(d) for d in textlist]dictionary = corpora.Dictionary(doc_text)corpus = [dictionary.doc2bow(text) for text in doc_text]tfidf = models.tfidfmodel.TfidfModel(corpus)transformed_tfidf = tfidf[corpus]lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)return(lda, dictionary)使用模型對文檔進(jìn)行分類
一旦我們訓(xùn)練了我們的 LDA 模型,我們就可以使用它來將我們的訓(xùn)練文檔集(以及可能出現(xiàn)的未來文檔)分類為主題,然后將它們放入適當(dāng)?shù)奈募A中。
對新的文本字符串使用經(jīng)過訓(xùn)練的 LDA 模型需要一些麻煩,所有的復(fù)雜性都包含在下面的函數(shù)中:
def find_topic(textlist, dictionary, lda):'''https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensimFor each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while trainingand create text_corpus'''text_corpus = []for query in textlist:temp_doc = tokenize(query.strip())current_doc = []temp_doc = list(temp_doc)for word in range(len(temp_doc)):current_doc.append(temp_doc[word])text_corpus.append(current_doc)'''For each feature vector text, lda[doc_bow] gives the topicdistribution, which can be sorted in descending order to print the very first topic''' tops = []for text in text_corpus:doc_bow = dictionary.doc2bow(text)topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0]tops.append(topics)return(tops)最后,我們需要另一種方法來根據(jù)主題索引獲取主題的實(shí)際名稱:
def topic_label(ldamodel, topicnum):alltopics = ldamodel.show_topics(formatted=False)topic = alltopics[topicnum]topic = dict(topic[1])return(max(topic, key=lambda key: topic[key]))現(xiàn)在,我們可以將上面編寫的所有函數(shù)粘貼到一個(gè)接受輸入文件夾、輸出文件夾和主題計(jì)數(shù)的腳本中。該腳本將讀取輸入文件夾中所有掃描的文檔圖像,將它們寫入txt 文件,構(gòu)建LDA 模型以查找文檔中的高級主題,并根據(jù)文檔主題將輸出的txt 文件歸類到文件夾中。
################################################################# # This script takes in an input folder of scanned documents # # and reads these documents, seperates them into topics # # and outputs raw .txt files into the output folder, seperated # # by topic # #################################################################import os from PIL import Image import base64 import http.client, urllib.request, urllib.parse, urllib.error, base64 import io import json import requests import urllib from gensim import corpora, models, similarities from gensim.utils import tokenize from gensim.parsing.preprocessing import remove_stopwords, preprocess_string import http import shutil import tqdmdef filter_for_english(text):dict_url = 'https://raw.githubusercontent.com/first20hours/' \'google-10000-english/master/20k.txt'dict_words = set(requests.get(dict_url).text.splitlines())english_words = tokenize(text)english_words = [w for w in english_words if w in list(dict_words)]english_words = [w for w in english_words if (len(w)>1 or w.lower()=='i')]return(' '.join(english_words))def preprocess(document):clean = filter_for_english(document)clean = remove_stopwords(clean)clean = preprocess_string(clean) # Remove non-english wordsreturn(clean)def read_and_return(foldername, fileext='.txt', delete_after_read=False):allfiles = os.listdir(foldername)allfiles = [os.path.join(foldername, f) for f in allfiles if f.endswith(fileext)]alltext = []for filename in allfiles:with open(filename, 'r') as f:alltext.append((filename, f.read()))f.close()if delete_after_read:os.remove(filename)return(alltext) # Returns list of tuples [(filename, text), ... (filename,text)]def image_to_text(imglist, ndocs=10):'''Take in a list of PIL images and return a list of extracted text'''headers = {# Request headers'Content-Type': 'application/octet-stream','Ocp-Apim-Subscription-Key': '89279deb653049078dd18b1b116777ea',}params = urllib.parse.urlencode({# Request parameters'language': 'en','detectOrientation ': 'true',})outtext = []docnum = 0for cropped_image in tqdm.tqdm(imglist, total=len(imglist)):# Cropped image must have both height and width > 50 px to run Computer Vision API#if (cropped_image.height or cropped_image.width) < 50:# cropped_images_ocr.append("N/A")# continueocr_image = cropped_imageimgByteArr = io.BytesIO()ocr_image.save(imgByteArr, format='PNG')imgByteArr = imgByteArr.getvalue()try:conn = http.client.HTTPSConnection('westus.api.cognitive.microsoft.com')conn.request("POST", "/vision/v1.0/ocr?%s" % params, imgByteArr, headers)response = conn.getresponse()data = json.loads(response.read().decode("utf-8"))curr_text = []for r in data['regions']:for l in r['lines']:for w in l['words']:curr_text.append(str(w['text']))conn.close()except Exception as e:print("[Errno {0}] {1}".format(e.errno, e.strerror))outtext.append(' '.join(curr_text))docnum += 1return(outtext)def run_lda(textlist, num_topics=10,return_model=False,preprocess_docs=True):'''Train and return an LDA model against a list of documents'''if preprocess_docs:doc_text = [preprocess(d) for d in textlist]dictionary = corpora.Dictionary(doc_text)corpus = [dictionary.doc2bow(text) for text in doc_text]tfidf = models.tfidfmodel.TfidfModel(corpus)transformed_tfidf = tfidf[corpus]lda = models.ldamulticore.LdaMulticore(transformed_tfidf, num_topics=num_topics, id2word=dictionary)input_doc_topics = lda.get_document_topics(corpus)return(lda, dictionary)def find_topic(text, dictionary, lda):'''https://stackoverflow.com/questions/16262016/how-to-predict-the-topic-of-a-new-query-using-a-trained-lda-model-using-gensimFor each query ( document in the test file) , tokenize the query, create a feature vector just like how it was done while trainingand create text_corpus'''text_corpus = []for query in text:temp_doc = tokenize(query.strip())current_doc = []temp_doc = list(temp_doc)for word in range(len(temp_doc)):current_doc.append(temp_doc[word])text_corpus.append(current_doc)'''For each feature vector text, lda[doc_bow] gives the topicdistribution, which can be sorted in descending order to print the very first topic''' tops = []for text in text_corpus:doc_bow = dictionary.doc2bow(text)topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)[0]tops.append(topics)return(tops)def topic_label(ldamodel, topicnum):alltopics = ldamodel.show_topics(formatted=False)topic = alltopics[topicnum]topic = dict(topic[1])import operatorreturn(max(topic, key=lambda key: topic[key]))INPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocuments' OUTPUT_FOLDER = r'F:/Research/OCR/Outputs/AllDocumentsByTopic' TOPICS = 4if __name__ == '__main__':print("Reading scanned documents")## First, read in all the scanned document images into PIL imagesscanned_docs_fol = r'F:/Research/OCR/Outputs/AllDocuments'scanned_docs_path = os.listdir(scanned_docs_fol)scanned_docs_path = [os.path.join(scanned_docs_fol, p) for p in scanned_docs_path]scanned_docs = [Image.open(x) for x in scanned_docs_path if x.endswith('.png')]## Second, utilize Microsoft CV API to extract text from these images using OCRscanned_docs_text = image_to_text(scanned_docs)print("Post-processing extracted text")## Third, remove mis-spellings that might have occured from bad OCR readingsspell = SpellChecker()for i in range(len(scanned_docs_text)):clean = scanned_docs_text[i]misspelled = spell.unknown(clean)clean = clean.split(" ")for word in range(len(clean)):if clean[word] in misspelled:clean[word] = spell.correction(clean[word])# Get the one `most likely` answerclean = ' '.join(clean)scanned_docs_text[i] = cleanprint("Writing read text into files")## Fourth, write the extracted text to individual .txt files with the same name as input filesfor k in range(len(scanned_docs_text)): # For each scanned documenttext = scanned_docs_text[k]text = filter_for_english(text)path = scanned_docs_path[k] # Get the corresponding input filenamepath = path.split("\\")[-1]text_file_path = OUTPUT_FOLDER + "//" + path[0:-4] + ".txt" # Create the output text filetext_file = open(text_file_path, "wt")n = text_file.write(text) # Write the text to the ouput text filetext_file.close()# First, read all the output .txt filesprint("Reading files")texts = read_and_return(OUTPUT_FOLDER)print("Building LDA topic model")# Second, train the LDA model (pre-processing is internally done)print("Preprocessing Text")textlist = [t[1] for t in texts]ldamodel, dictionary = run_lda(textlist, num_topics=TOPICS)# Third, extract the top topic for each documentprint("Extracting Topics")topics = []for t in texts:topics.append((t[0], find_topic([t[1]], dictionary, ldamodel)))# Convert topics to topic namesfor i in range(len(topics)):topnum = topics[i][1][0][0]#print(topnum)topics[i][1][0] = topic_label(ldamodel, topnum)# [(filename, topic), ..., (filename, topic)]# Create folders for the topicsprint("Copying Documents into Topic Folders")foundtopics = []for t in topics:foundtopics+= t[1]foundtopics = set(foundtopics)topicfolders = [os.path.join(OUTPUT_FOLDER, f) for f in foundtopics]topicfolders = set(topicfolders)[os.makedirs(m) for m in topicfolders]# Copy files into appropriate topic foldersfor t in topics:filename, topic = tsrc = filenamefilename = filename.split("\\")dest = os.path.join(OUTPUT_FOLDER, topic[0])dest = dest + "/" + filename[-1]copystr = "copy " + src + " " + destshutil.copyfile(src, dest)os.remove(src)print("Done")本文代碼Github鏈接:
https://github.com/ShairozS/Scan2Topic
往期精彩回顧適合初學(xué)者入門人工智能的路線及資料下載機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印機(jī)器學(xué)習(xí)在線手冊深度學(xué)習(xí)筆記專輯《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 AI基礎(chǔ)下載黃海廣老師《機(jī)器學(xué)習(xí)課程》視頻課黃海廣老師《機(jī)器學(xué)習(xí)課程》711頁完整版課件本站qq群554839127,加入微信群請掃碼:
與50位技術(shù)專家面對面20年技術(shù)見證,附贈(zèng)技術(shù)全景圖總結(jié)
以上是生活随笔為你收集整理的【深度学习】使用深度学习阅读和分类扫描文档的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: vue的mixins属性
- 下一篇: 【深度学习】去掉softmax后Tran