日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pdfparser java_如何使用java从PDF中提取内容?

發(fā)布時間:2025/4/5 编程问答 44 豆豆
生活随笔 收集整理的這篇文章主要介紹了 pdfparser java_如何使用java从PDF中提取内容? 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在Java編程中,如何使用java從PDF中提取內(nèi)容?

項目的目錄結(jié)構(gòu)如下 -

Tika的工具包可從以下網(wǎng)址下載:http://tika.apache.org/download.html ,只下載:tika-app-1.16.jar 和 tika-server-1.16.jar 。

以下是使用java從PDF中提取內(nèi)容的程序 -

import java.io.File;

import java.io.FileInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.pdf.PDFParser;

import org.apache.tika.sax.BodyContentHandler;

public class ExtractContentFromPDF {

public static void main(String[] args) throws Exception {

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

FileInputStream inputstream = new FileInputStream(new File("pdfExample.pdf"));

ParseContext pcontext = new ParseContext();

// parsing the document using PDF parser

PDFParser pdfparser = new PDFParser();

pdfparser.parse(inputstream, handler, metadata, pcontext);

// getting the content of the document

System.out.println("Contents of the PDF :" + handler.toString());

// getting metadata of the document

System.out.println("Metadata of the PDF:");

String[] metadataNames = metadata.names();

for (String name : metadataNames) {

System.out.println(name + " : " + metadata.get(name));

}

}

}

原PDF文件:pdfExample.pdf 的內(nèi)容如下 -

執(zhí)行上面示例代碼,得到以下結(jié)果 -

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-app-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-server-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem

警告: JBIG2ImageReader not loaded. jbig2 files will be ignored

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

TIFFImageWriter not loaded. tiff files will not be processed

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

J2KImageReader not loaded. JPEG2000 files will not be processed.

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem

警告: org.xerial's sqlite-jdbc is not loaded.

Please provide the jar on your classpath to parse sqlite files.

See tika-parsers/pom.xml for the correct version.

Contents of the PDF :

Apache Tika is a library that is used for document type detection and

content extraction from various file formats.

Internally, Tika uses various existing document parsers and

document type detection techniques to detect and extract data.

Using Tika, one can develop a universal type detector and content

extractor to extract both structured text as well as metadata from

different types of documents such as spreadsheets, text documents,

images, PDFs and even multimedia input formats to a certain extent.

Metadata of the PDF:

date : 2017-09-26T20:00:44Z

pdf:PDFVersion : 1.7

pdf:docinfo:title :

xmp:CreatorTool : WPS Office

Company :

Keywords :

access_permission:modify_annotations : true

access_permission:can_print_degraded : true

subject :

dc:creator : Administrator

dcterms:created : 2017-09-26T20:00:44Z

Last-Modified : 2017-09-26T20:00:44Z

dcterms:modified : 2017-09-26T20:00:44Z

dc:format : application/pdf; version=1.7

Last-Save-Date : 2017-09-26T20:00:44Z

pdf:docinfo:creator_tool : WPS Office

access_permission:fill_in_form : true

pdf:docinfo:keywords :

pdf:docinfo:modified : 2017-09-26T20:00:44Z

meta:save-date : 2017-09-26T20:00:44Z

pdf:encrypted : false

modified : 2017-09-26T20:00:44Z

pdf:docinfo:custom:SourceModified : D:20170927041644+08'16'

cp:subject :

pdf:docinfo:subject :

Content-Type : application/pdf

pdf:docinfo:creator : Administrator

creator : Administrator

meta:author : Administrator

dc:subject :

meta:creation-date : 2017-09-26T20:00:44Z

created : Tue Sep 26 16:00:44 BOT 2017

Comments :

access_permission:extract_for_accessibility : true

access_permission:assemble_document : true

xmpTPg:NPages : 1

Creation-Date : 2017-09-26T20:00:44Z

access_permission:extract_content : true

pdf:docinfo:custom:Company :

access_permission:can_print : true

SourceModified : D:20170927041644+08'16'

pdf:docinfo:custom:Comments :

meta:keyword :

Author : Administrator

producer :

access_permission:can_modify : true

pdf:docinfo:producer :

pdf:docinfo:created : 2017-09-26T20:00:44Z

¥ 我要打賞

糾錯/補充

收藏

加QQ群啦,易百教程官方技術(shù)學習群

注意:建議每個人選自己的技術(shù)方向加群,同一個QQ最多限加 3 個群。

總結(jié)

以上是生活随笔為你收集整理的pdfparser java_如何使用java从PDF中提取内容?的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。