日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

将PDF和Gutenberg文档格式转换为文本:生产中的自然语言处理

發(fā)布時(shí)間:2023/11/29 编程问答 52 豆豆
生活随笔 收集整理的這篇文章主要介紹了 将PDF和Gutenberg文档格式转换为文本:生产中的自然语言处理 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]

據(jù)估計(jì),全球數(shù)據(jù)的70%–85%是文本(非結(jié)構(gòu)化數(shù)據(jù))。 大多數(shù)英語(yǔ)和歐盟業(yè)務(wù)數(shù)據(jù)格式為字節(jié)文本,MS Word或Adobe PDF。 [1]

Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]

組織Web顯示Dobe Postscript文檔格式文檔( PDF )。 [2]

In this blog, I detail the following :

在此博客中,我將詳細(xì)介紹以下內(nèi)容:

  • Create a file path from the web file name and local file name;

    從Web文件名和本地文件名創(chuàng)建文件路徑;
  • Change byte encoded Gutenberg project file into a text corpus;

    將字節(jié)編碼的Gutenberg項(xiàng)目文件更改為文本語(yǔ)料庫(kù);
  • Change a PDF document into a text corpus;

    將PDF文檔更改為文本語(yǔ)料庫(kù);
  • Segment continuous text into a Corpus of word text.

    將連續(xù)文本分割成單詞文本的語(yǔ)料庫(kù)。
  • 將常用文檔格式轉(zhuǎn)換為文本 (Converting Popular Document Formats into Text)

    1.從Web文件名或本地文件名創(chuàng)建本地文件路徑 (1. Create local filepath from the web filename or local filename)

    The following function will take either a local file name or a remote file URL and return a filepath object.

    以下函數(shù)將采用本地文件名或遠(yuǎn)程文件URL并返回文件路徑對(duì)象。

    #in file_to_text.py
    --------------------------------------------
    from io import StringIO, BytesIO
    import urllib
    def file_or_url(pathfilename:str) -> Any:
    """
    Reurn filepath given local file or URL.
    Args:
    pathfilename:
    Returns:
    filepath odject istance
    """
    try:
    fp = open(pathfilename, mode="rb") # file(path, 'rb')
    except:
    pass
    else:
    url_text = urllib.request.urlopen(pathfilename).read()
    fp = BytesIO(url_text)
    return fp

    2.將Unicode字節(jié)編碼的文件更改為Python Unicode字符串 (2. Change Unicode Byte encoded file into a o Python Unicode String)

    You will often encounter text blob downloads in the size 8-bit Unicode format (in the romantic languages). You need to convert 8-bit Unicode into Python Unicode strings.

    您經(jīng)常會(huì)遇到8位Unicode格式的文本blob下載(浪漫語(yǔ)言)。 您需要將8位Unicode轉(zhuǎn)換為Python Unicode字符串。

    #in file_to_text.py
    --------------------------------------------
    def unicode_8_to_text(text: str) -> str:
    return text.decode("utf-8", "replace")import urllib
    from file_to_text import unicode_8_to_texttext_l = 250text_url = r'http://www.gutenberg.org/files/74/74-0.txt'
    gutenberg_text = urllib.request.urlopen(text_url).read()
    %time gutenberg_text = unicode_8_to_text(gutenberg_text)
    print('{}: size: {:g} \n {} \n'.format(0, len(gutenberg_text) ,gutenberg_text[:text_l]))

    output =>

    輸出=>

    CPU times: user 502 μs, sys: 0 ns, total: 502 μs
    Wall time: 510 μs
    0: size: 421927
    
    The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete by
    Mark Twain (Samuel Clemens)
    This eBook is for the use of anyone anywhere at no cost and with almost
    no restrictions whatsoever. You may copy it, give it away or re-use
    it under the terms of the Project Gutenberg License included with this
    eBook or online at www.guten

    The result is that text.decode('utf-8') can format into a Python string of a million characters in about 1/1000th second. A rate that far exceeds our production rate requirements.

    結(jié)果是text.decode('utf-8')可以在大約1/1000秒內(nèi)格式化為一百萬(wàn)個(gè)字符的Python字符串。 生產(chǎn)率遠(yuǎn)遠(yuǎn)超過(guò)我們的生產(chǎn)率要求。

    3.將PDF文檔更改為文本語(yǔ)料庫(kù)。 (3. Change a PDF document into a text corpus.)

    “Changing a PDF document into a text corpus" is one of the most troublesome and common tasks I do for NLP text pre-processing.

    “將PDF文檔轉(zhuǎn)換為文本語(yǔ)料庫(kù) ”是我為NLP文本預(yù)處理所做的最麻煩,最常見(jiàn)的任務(wù)之一。

    #in file_to_text.py
    --------------------------------------------
    def PDF_to_text(pathfilename: str) -> str:
    """
    Chane PDF format to text.
    Args:
    pathfilename:
    Returns:
    """
    fp = file_or_url(pathfilename)
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()
    for page in PDFPage.get_pages(
    fp,
    pagenos,
    maxpages=maxpages,
    password=password,
    caching=caching,
    check_extractable=True,
    ):
    interpreter.process_page(page)
    text = retstr.getvalue()
    fp.close()
    device.close()
    retstr.close()
    return text
    -------------------------------------------------------arvix_list =['https://arxiv.org/pdf/2008.05828v1.pdf'
    , 'https://arxiv.org/pdf/2008.05981v1.pdf'
    , 'https://arxiv.org/pdf/2008.06043v1.pdf'
    , 'tmp/inf_finite_NN.pdf' ]
    for n, f in enumerate(arvix_list):
    %time pdf_text = PDF_to_text(f).replace('\n', ' ')
    print('{}: size: {:g} \n {} \n'.format(n, len(pdf_text) ,pdf_text[:text_l])))

    output =>

    輸出=>

    CPU times: user 1.89 s, sys: 8.88 ms, total: 1.9 s
    Wall time: 2.53 s
    0: size: 42522
    On the Importance of Local Information in Transformer Based Models Madhura Pande, Aakriti Budhraja, Preksha Nema Pratyush Kumar, Mitesh M. Khapra Department of Computer Science and Engineering Robert Bosch Centre for Data Science and AI (RBC-DSAI) Indian Institute of Technology Madras, Chennai, India {mpande,abudhra,preksha,pratyush,miteshk}@
    CPU times: user 1.65 s, sys: 8.04 ms, total: 1.66 s
    Wall time: 2.33 s
    1: size: 30586
    ANAND,WANG,LOOG,VANGEMERT:BLACKMAGICINDEEPLEARNING1BlackMagicinDeepLearning:HowHumanSkillImpactsNetworkTrainingKanavAnand1anandkanav92@gmail.comZiqiWang1z.wang-8@tudelft.nlMarcoLoog12M.Loog@tudelft.nlJanvanGemert1j.c.vangemert@tudelft.nl1DelftUniversityofTechnology,Delft,TheNetherlands2UniversityofCopenhagenCopenhagen,DenmarkAbstractHowdoesauser’sp
    CPU times: user 4.82 s, sys: 46.3 ms, total: 4.87 s
    Wall time: 6.53 s
    2: size: 57204
    0 2 0 2 g u A 3 1 ] G L . s c [ 1 v 3 4 0 6 0 . 8 0 0 2 : v i X r a Of?ine Meta-Reinforcement Learning with Advantage Weighting Eric Mitchell1, Rafael Rafailov1, Xue Bin Peng2, Sergey Levine2, Chelsea Finn1 1 Stanford University, 2 UC Berkeley em7@stanford.edu Abstract Massive datasets have proven critical to successfully
    CPU times: user 12.2 s, sys: 36.1 ms, total: 12.3 s
    Wall time: 12.3 s
    3: size: 89633
    0 2 0 2 l u J 1 3 ] G L . s c [ 1 v 1 0 8 5 1 . 7 0 0 2 : v i X r a Finite Versus In?nite Neural Networks: an Empirical Study Jaehoon Lee Samuel S. Schoenholz? Jeffrey Pennington? Ben Adlam?? Lechao Xiao? Roman Novak? Jascha Sohl-Dickstein {jaehlee, schsam, jpennin, adlam, xlc, romann, jaschasd}@google.com Google Brain

    On this hardware configuration, “Converting a PDF file into Python string” requires 150 seconds per million characters. Not fast enough for a Web interactve production application.

    在此硬件配置上,“ 將PDF文件轉(zhuǎn)換為Python字符串 ”每百萬(wàn)個(gè)字符需要150秒。 對(duì)于Web interactve生產(chǎn)應(yīng)用程序來(lái)說(shuō)不夠快。

    You may want to stage formatting in the background.

    您可能要在后臺(tái)進(jìn)行格式化。

    4.將連續(xù)文本分割成單詞文本的語(yǔ)料庫(kù) (4. Segment continuous text into Corpus of word text)

    When we read https://arxiv.org/pdf/2008.05981v1.pdf', it came back as continuous text with no separation character. Using the package from wordsegment, we separate the continuous string into words.

    當(dāng)我們閱讀https://arxiv.org/pdf/2008.05981v1.pdf'時(shí) ,它以沒(méi)有分隔符的連續(xù)文本形式出現(xiàn)。 使用來(lái)自wordsegment的包我們將連續(xù)的字符串分成單詞。

    from wordsegment import load, clean, segment
    %time words = segment(pdf_text)
    print('size: {:g} \n'.format(len(words)))
    ' '.join(words)[:text_l*4]

    output =>

    輸出=>

    CPU times: user 1min 43s, sys: 1.31 s, total: 1min 44s
    Wall time: 1min 44s
    size: 5005'an and wang loog van gemert blackmagic in deep learning 1 blackmagic in deep learning how human skill impacts network training kanavanand1anandkanav92g mailcom ziqiwang1zwang8tudelftnl marco loog12mloogtudelftnl jan van gemert 1jcvangemerttudelftnl1 delft university of technology delft the netherlands 2 university of copenhagen copenhagen denmark abstract how does a users prior experience with deep learning impact accuracy we present an initial study based on 31 participants with different levels of experience their task is to perform hyper parameter optimization for a given deep learning architecture the results show a strong positive correlation between the participants experience and then al performance they additionally indicate that an experienced participant nds better solutions using fewer resources on average the data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyperparameters our study investigates the subjective human factor in comparisons of state of the art results and scientic reproducibility in deep learning 1 introduction the popularity of deep learning in various elds such as image recognition 919speech1130 bioinformatics 2124questionanswering3 etc stems from the seemingly favorable tradeoff between the recognition accuracy and their optimization burden lecunetal20 attribute their success t'

    You will notice that wordsegment accomplishes a fairly accurate separation into words. There are some errors , or words that we don’t want, that NLP text pre-processing clear away.

    您會(huì)注意到, wordsegment實(shí)現(xiàn)了相當(dāng)準(zhǔn)確的單詞分離。 NLP文本預(yù)處理會(huì)清除一些錯(cuò)誤或我們不希望使用的單詞。

    The Apache wordsegment is slow. It is barely adequate in production for small, less than 1 thousand word documents. Can we find some faster way to segment?

    Apache 單詞段速度很慢。 對(duì)于少于一千個(gè)單詞的小型文檔,它幾乎不能滿足生產(chǎn)要求。 我們可以找到更快的細(xì)分方式嗎?

    4b。 將連續(xù)文本分割成單詞文本的語(yǔ)料庫(kù) (4b. Segment continuous text into Corpus of word text)

    There seems to be a faster method to "Segment continuous text into Corpus of word text."

    似乎有一種更快的方法“將連續(xù)文本分割成單詞文本的語(yǔ)料庫(kù)”。

    As discussed in the following blog:

    如以下博客中所述:

    SymSpell is 100x -1000x faster. Wow!

    SymSpell是100倍-1000x更快。 哇!

    Note: ed: 8/24/2020 Wolf Garbe deserves credit for pointing out

    注意:ed:8/24/2020 Wolf Garbe值得一提

    The benchmark results (100x -1000x faster) given in the SymSpell blog post are referring solely to spelling correction, not to word segmentation. In that post SymSpell was compared to other spelling correction algorithms, not to word segmentation algorithms. — Wolfe Garbe 8/23/2020

    SymSpell博客文章中給出的基準(zhǔn)測(cè)試結(jié)果(快100倍-1000倍)僅指拼寫校正,而不是指分詞。 在那篇文章中,將SymSpell與其他拼寫校正算法進(jìn)行了比較,而不是與分詞算法進(jìn)行了比較。 -Wolfe Garbe 2020年8月23日

    and

    Also, there is an easier way to call a C# library from Python: https://stackoverflow.com/questions/7367976/calling-a-c-sharp-library-from-python — Wolfe Garbe 8/23/2020

    此外,還有一種從Python調(diào)用C#庫(kù)的簡(jiǎn)便方法: https ://stackoverflow.com/questions/7367976/calling-ac-sharp-library-from-python — Wolfe Garbe 8/23/2020

    Note: ed: 8/24/2020. I am going to try Garbe's C## implementation. If I do not get the same results (and probably if I do) I will try cython port and see if I can fit into spacy as a pipeline element. I will let you know my results.

    注意:ed:8/24/2020。 我將嘗試Garbe的C ##實(shí)現(xiàn)。 如果沒(méi)有得到相同的結(jié)果(可能的話),我將嘗試cython port,看看是否可以將spacy作為管??道元素使用。 我會(huì)讓你知道我的結(jié)果。

    However, it is implemented in C#. Since I am not going down the infinite ratholes of:

    但是,它是用C#實(shí)現(xiàn)的。 由于我沒(méi)有遇到以下無(wú)限困難:

    • Convert all my NLP into C#. Not a viable option.

      將我所有的NLP轉(zhuǎn)換為C# 。 不是可行的選擇。

    • Calling C# from Python. I talked to two engineer managers of Python groups. They have Python-C# capability, but it involves :

      Python調(diào)用C# 。 我與兩位Python組的工程師經(jīng)理進(jìn)行了交談。 它們具有Python-C#功能,但涉及到:

    Note:

    注意:

  • Translating to VB-vanilla;

    翻譯成VB -vanilla;

  • Manual intervention and translation must pass tests for reproducibility;

    手動(dòng)干預(yù)和翻譯必須通過(guò)再現(xiàn)性測(cè)試;
  • Translating from VB-vanilla to C;

    VB- vanilla轉(zhuǎn)換為C ;

  • Manual intervention and translation must pass tests for reproducibility.

    手動(dòng)干預(yù)和翻譯必須通過(guò)再現(xiàn)性測(cè)試。
  • Instead, we work with a port to Python. Here is a version:

    相反,我們使用Python的端口。 這是一個(gè)版本:

    def segment_into_words(input_term):
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 0
    prefix_length = 7
    # create object
    sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
    # load dictionary
    dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt")
    bigram_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_bigramdictionary_en_243_342.txt")
    # term_index is the column of the term and count_index is the
    # column of the term frequency
    if not sym_spell.load_dictionary(dictionary_path, term_index=0,
    count_index=1):
    print("Dictionary file not found")
    return
    if not sym_spell.load_bigram_dictionary(dictionary_path, term_index=0,
    count_index=2):
    print("Bigram dictionary file not found")
    returnresult = sym_spell.word_segmentation(input_term)
    return result.corrected_string%time long_s = segment_into_words(pdf_text)
    print('size: {:g} {}'.format(len(long_s),long_s[:text_l*4]))

    output =>

    輸出=>

    CPU times: user 20.4 s, sys: 59.9 ms, total: 20.4 s
    Wall time: 20.4 s
    size: 36585 ANAND,WANG,LOOG,VANGEMER T:BLACKMAGICINDEEPLEARNING1B lack MagicinDeepL earning :HowHu man S kill Imp acts Net work T raining Ka nav An and 1 an and kana v92@g mail . com ZiqiWang1z. wang -8@tu delft .nlM arc oLoog12M.Loog@tu delft .nlJ an van Gemert1j.c. vang emert@tu delft .nl1D elf tUniversityofTechn ology ,D elf t,TheN ether lands 2UniversityofC open hagen C open hagen ,Den mark Abs tract How does a user ’s prior experience with deep learning impact accuracy ?We present an initial study based on 31 participants with different levels of experience .T heir task is to perform hyper parameter optimization for a given deep learning architecture .T here -s ult s show a strong positive correlation between the participant ’s experience and the ?n al performance .T hey additionally indicate that an experienced participant ?nds better sol u-t ions using fewer resources on average .T he data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyper pa-ra meters .Our study investigates the subjective human factor in comparisons of state of the art results and sci enti?c reproducibility in deep learning .1Intro duct ion T he popularity of deep learning in various ? eld s such as image recognition [9,19], speech [11,30], bio informatics [21,24], question answering [3] etc . stems from the seemingly fav or able trade - off b

    SymSpellpy is is about 5x faster implemented in Python.We are not seeing 100x -1000x faster.

    SymSpellpyPython中實(shí)現(xiàn)的速度要快大約5倍。我們看不到100倍-1000倍的速度。

    I guess that SymSpell-C# is comparing to different segmentation algorithms implemented in Python.

    我猜想SymSpell-C#正在與Python中實(shí)現(xiàn)的不同細(xì)分算法進(jìn)行比較。

    Perhaps we see speedup due to C#, a compiled statically typed language. Since C# and C are about the same computing speed, we should expect a speedup of C# 100x -1000x faster than a Python implementation.

    也許由于C# (一種編譯的靜態(tài)類型語(yǔ)言)而使我們看到了加速。 由于C#C的計(jì)算速度大致相同,因此我們應(yīng)該期望C#的加速比Python實(shí)現(xiàn)快100倍-1000倍。

    Note: There is a spacy pipeline implementation spacy_symspell, which directly calls SymSpellpy. I recommend you don’t use spacy_symspell. Spacy first generates tokens as the first step of the pipeline, which is immutable. spacy_symspell generates new text from Segmenting continuous text. It can not generate new tokens in the spacy as spacy already generated tokens. .A spacy pipeline works a token sequence, not a stream of text. One would have to spin off a changed version of spacy. Why bother? Instead, segment continuous text into Corpus of word text. Then correct the text of embedded whitespace in a word and hyphenated words in the text. Do any other raw cleaning you want to do. Then feed the raw text to spacy.

    注意:有一個(gè)spacy管道實(shí)現(xiàn)spacy_symspell,它直接調(diào)用SymSpellpy。 我建議您不要使用spacy_symspell。 Spacy首先生成令牌,這是流水線的第一步,這是不可變的。 spacy_symspell從生成新文本 分割連續(xù)文本。 由于spacy已生成令牌,因此無(wú)法在spacy中生成新令牌 .A spacy管道工程令牌序列,文本不流 人們將不得不衍生出一種變化的spacy版本 何必呢? 相反段連續(xù)的文本到文本字語(yǔ)料庫(kù)。 然后更正單詞中嵌入的空格文本和文本中帶連字符的單詞。 進(jìn)行您想做的其他任何原始清潔。 然后將原始文本輸入spacy

    I show spacy_symspell. Again my advice is not to use it.

    我展示spacy_symspell。 同樣,我的建議是不要使用它。

    import spacy
    from spacy_symspell import SpellingCorrector
    def segment_into_words(input_term):
    nlp = spacy.load(“en_core_web_lg”, disable=[“tagger”, “parser”])
    corrector = SpellingCorrector()
    nlp.add_pipe(corrector)

    結(jié)論 (Conclusion)

    In future blogs, I will detail many common and uncommon Fast Text Pre-Processing Methods. Also, I will show the expected speedup from moving SymSpellpy to cython.

    在以后的博客中,我將詳細(xì)介紹許多常見(jiàn)和不常見(jiàn)的快速文本預(yù)處理方法。 另外,我將展示從SymSpellpy遷移cython的預(yù)期加速

    There will be many more formats and APIs you need to support in the world of “Changing X format into a text corpus.”

    在“將X格式更改為文本語(yǔ)料庫(kù)”的世界中,您將需要支持更多的格式和API。

    I detailed two of the more common document formats, PDF, and Gutenberg Project formats. Also, I gave two NLP utility functions segment_into_words and file_or_url.

    我詳細(xì)介紹了兩種較常見(jiàn)的文檔格式PDFGutenberg Project格式。 另外,我提供了兩個(gè)NLP實(shí)用程序功能segment_into_words和file_or_url.

    I hope you learned something and can use some of the code in this blog.

    希望您學(xué)到了一些知識(shí),并可以使用此博客中的一些代碼。

    If you have some format conversions or better yet a package of them, let me know.

    如果您進(jìn)行了某些格式轉(zhuǎn)換或更好的轉(zhuǎn)換,請(qǐng)告訴我 。

    翻譯自: https://towardsdatascience.com/natural-language-processing-in-production-converting-pdf-and-gutenberg-document-formats-into-text-9e7cd3046b33

    總結(jié)

    以上是生活随笔為你收集整理的将PDF和Gutenberg文档格式转换为文本:生产中的自然语言处理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。