當前位置：首頁 >

PDF文件中的图片提取

發布時間：2024/3/26 49 豆豆

生活随笔收集整理的這篇文章主要介紹了 PDF文件中的图片提取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

? pdf中保存的圖像文件有兩種，1種直接是把原圖片的數據復制進文檔中，另外一種是經過壓縮的圖片，這種圖片的提取并非只是簡單的解壓縮就能解決的。

1.直接保存圖片格式的

例如上圖中的例子，直接可以看到JFIF頭，這種直接把 stream 和endstream中間的內容保存下來，去掉頭尾的0xa 就可以看到圖片了。

2.壓縮圖片

像這樣的圖片如果僅僅是解壓縮：

解壓縮之后會是像下邊這樣的數據，

這個數據保存之后是沒辦法看到圖片的。

原因：http://itext.2136553.n4.nabble.com/Extract-FlateDecode-image-td2158485.html

A FlateDecoded image isn't in any standard image format - it's just a ??
"raw 2D array of colorant values".?

You will need to write the (complex) code to convert that array into ??
a standard image format - taking into consideration that images in ??
PDF can be in 9 different colorspaces, with varying bits per ??
components, etc.?
?

FlateDecoded圖像不是任何標準圖像格式-僅僅是 ??
“原始2D著色劑值數組”。?

您將需要編寫（復雜）代碼以將該數組轉換 ??
為標準圖像格式-考慮到
PDF中的圖像 ??可以采用9種不同的色彩空間，每個
組件的位都不同 ??，等等?

好在是已經有人寫了這樣的代碼：https://gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a

使用此代碼需要安裝PyPDF2

Usage:
? ? PDF_extract_images file.pdf page1 page2 page3?

?代碼此地保存一份：

# coding=utf-8from __future__ import print_function """ The MIT License (MIT) Copyright (c) 2018 Louis Abraham <louis.abraham@yahoo.fr> Copyright ?2016 Ronan Paix?o Copyright (c) 2018 Gerald Storer \x1B[34m\033[F\033[F Extract the images from a pdf \x1B[0m\x1B[36m\033[F\033[F Supports most formats, but has some bugs (even pdfimages has). For example, with encoding /CCITTFaxDecode, the image is sometimes flipped. If you have a bug, see https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python for other solutions or drop me an email with your pdf file attached \x1B[0m\x1B[35m\033[F\033[F TODO:- add support for range queries \x1B[0m\033[1m\033[F\033[F Links:PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdfCCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=itemsExtract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-pythonExtract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filterTIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html/Index support: https://github.com/ronanpaixao/PyPDFTK/blob/master/pdf_images.py Usage:PDF_extract_images file.pdf page1 page2 page3 … \033[0m\033[F\033[F """# https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-pythonimport PyPDF2from PIL import Image, ImageOpsimport sys import struct from os import path import warnings import io from collections import namedtuple warnings.filterwarnings("ignore")img_modes = {'/DeviceRGB': 'RGB', '/DefaultRGB': 'RGB','/DeviceCMYK': 'CMYK', '/DefaultCMYK': 'CMYK','/DeviceGray': 'L', '/DefaultGray': 'L','/Indexed': 'P'}PdfImage = namedtuple('PdfImage', ['data', 'format','image_name'])def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):# http://www.fileformat.info/format/tiff/corion.htmfields = 8tiff_header_struct = '<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * fields + 'L'return struct.pack(tiff_header_struct,b'II', # Byte order indication: Little indian42, # Version number (always 42)8, # Offset to first IFDfields, # Number of tags in IFD256, 4, 1, width, # ImageWidth, LONG, 1, width257, 4, 1, height, # ImageLength, LONG, 1, lenght258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero# StripOffsets, LONG, 1, len of header273, 4, 1, struct.calcsize(tiff_header_struct),278, 4, 1, height, # RowsPerStrip, LONG, 1, length279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image0 # last IFD)def extract_images_from_pdf_page(xObject):image_list = []xObject = xObject['/Resources']['/XObject'].getObject()for obj in xObject:o = xObject[obj]if xObject[obj]['/Subtype'] == '/Image':size = (xObject[obj]['/Width'], xObject[obj]['/Height'])# getData() does not work for CCITTFaxDecode or DCTDecode# as of 1 Aug 2018. Not sure about JPXDecode.data = xObject[obj]._datacolor_space = xObject[obj]['/ColorSpace']if '/FlateDecode' in xObject[obj]['/Filter']:if isinstance(color_space, PyPDF2.generic.ArrayObject) and color_space[0] == '/Indexed':color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262mode = img_modes[color_space]data = xObject[obj].getData() # need to use getData() hereimg = Image.frombytes(mode, size, data)if color_space == '/Indexed':img.putpalette(lookup.getData())img = img.convert('RGB')imgByteArr = io.BytesIO()img.save(imgByteArr,format='PNG')image_list.append(PdfImage(data=imgByteArr,format='PNG',image_name=obj[1:]))elif '/DCTDecode' in xObject[obj]['/Filter']:image_list.append(PdfImage(data=io.BytesIO(data),format='JPG',image_name=obj[1:]))elif '/JPXDecode' in xObject[obj]['/Filter']:image_list.append(PdfImage(data=io.BytesIO(data),format='JP2',image_name=obj[1:]))elif '/CCITTFaxDecode' in xObject[obj]['/Filter']:if xObject[obj]['/DecodeParms']['/K'] == -1:CCITT_group = 4else:CCITT_group = 3data = xObject[obj]._data img_size = len(data)tiff_header = tiff_header_for_CCITT(size[0], size[1], img_size, CCITT_group)im = Image.open(io.BytesIO(tiff_header + data))if xObject[obj].get('/BitsPerComponent') == 1:# experimental condition# http://users.fred.net/tds/leftdna/sciencetiff.htmlim = ImageOps.flip(im)imgByteArr = io.BytesIO()img.save(imgByteArr,format='PNG')image_list.append(PdfImage(data=imgByteArr,format='PNG',image_name=obj[1:]))else:print ('Unhandled image type: {}'.format(xObject[obj]['/Filter']))else:image_list += extract_images_from_pdf_page(xObject[obj])return image_listif __name__ == '__main__':try:filename = sys.argv[1]pages = sys.argv[2:]pages = list(map(int, pages))abspath = path.abspath(filename)except BaseException:print(__doc__, file=sys.stderr)sys.exit()file = PyPDF2.PdfFileReader(open(filename, "rb"))number = 0for p in pages:page0 = file.getPage(p - 1)image_list = extract_images_from_pdf_page(page0)number += len(image_list)for pdf_image in image_list:img = Image.open(pdf_image.data)image_path = "{} - p. {} - {}.{}".format(abspath[:-4], p, pdf_image.image_name,pdf_image.format)img.save(image_path)print('-' * 20)print('{} extracted images'.format(number))print('-' * 20)

總結

以上是生活随笔為你收集整理的PDF文件中的图片提取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： unity-打包及发布
下一篇： vba有下拉框的模糊查找_Excel V

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

PDF文件中的图片提取

1.直接保存圖片格式的

2.壓縮圖片

總結