PDF文件中的图片提取
? pdf中保存的圖像文件有兩種,1種直接是把原圖片的數據復制進文檔中,另外一種是經過壓縮的圖片,這種圖片的提取并非只是簡單的解壓縮就能解決的。
1.直接保存圖片格式的
例如上圖中的例子,直接可以看到JFIF頭,這種直接把 stream 和endstream中間的內容保存下來,去掉頭尾的0xa 就可以看到圖片了。
2.壓縮圖片
像這樣的圖片如果僅僅是解壓縮:
解壓縮之后會是像下邊這樣的數據,
這個數據保存之后是沒辦法看到圖片的。
原因:http://itext.2136553.n4.nabble.com/Extract-FlateDecode-image-td2158485.html
A FlateDecoded image isn't in any standard image format - it's just a ??
"raw 2D array of colorant values".?
You will need to write the (complex) code to convert that array into ??
a standard image format - taking into consideration that images in ??
PDF can be in 9 different colorspaces, with varying bits per ??
components, etc.?
?
FlateDecoded圖像不是任何標準圖像格式-僅僅是 ??
“原始2D著色劑值數組”。?
您將需要編寫(復雜)代碼以將該數組轉換 ??
為標準圖像格式-考慮到
PDF中的圖像 ??可以采用9種不同的色彩空間,每個
組件的位都不同 ??,等等?
好在是已經有人寫了這樣的代碼:https://gist.github.com/gstorer/f6a9f1dfe41e8e64dcf58d07afa9ab2a
使用此代碼需要安裝PyPDF2
Usage:
? ? PDF_extract_images file.pdf page1 page2 page3?
?代碼此地保存一份:
# coding=utf-8from __future__ import print_function """ The MIT License (MIT) Copyright (c) 2018 Louis Abraham <louis.abraham@yahoo.fr> Copyright ?2016 Ronan Paix?o Copyright (c) 2018 Gerald Storer \x1B[34m\033[F\033[F Extract the images from a pdf \x1B[0m\x1B[36m\033[F\033[F Supports most formats, but has some bugs (even pdfimages has). For example, with encoding /CCITTFaxDecode, the image is sometimes flipped. If you have a bug, see https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python for other solutions or drop me an email with your pdf file attached \x1B[0m\x1B[35m\033[F\033[F TODO:- add support for range queries \x1B[0m\033[1m\033[F\033[F Links:PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdfCCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=itemsExtract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-pythonExtract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filterTIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html/Index support: https://github.com/ronanpaixao/PyPDFTK/blob/master/pdf_images.py Usage:PDF_extract_images file.pdf page1 page2 page3 … \033[0m\033[F\033[F """# https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-pythonimport PyPDF2from PIL import Image, ImageOpsimport sys import struct from os import path import warnings import io from collections import namedtuple warnings.filterwarnings("ignore")img_modes = {'/DeviceRGB': 'RGB', '/DefaultRGB': 'RGB','/DeviceCMYK': 'CMYK', '/DefaultCMYK': 'CMYK','/DeviceGray': 'L', '/DefaultGray': 'L','/Indexed': 'P'}PdfImage = namedtuple('PdfImage', ['data', 'format','image_name'])def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):# http://www.fileformat.info/format/tiff/corion.htmfields = 8tiff_header_struct = '<' + '2s' + 'H' + 'L' + 'H' + 'HHLL' * fields + 'L'return struct.pack(tiff_header_struct,b'II', # Byte order indication: Little indian42, # Version number (always 42)8, # Offset to first IFDfields, # Number of tags in IFD256, 4, 1, width, # ImageWidth, LONG, 1, width257, 4, 1, height, # ImageLength, LONG, 1, lenght258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero# StripOffsets, LONG, 1, len of header273, 4, 1, struct.calcsize(tiff_header_struct),278, 4, 1, height, # RowsPerStrip, LONG, 1, length279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image0 # last IFD)def extract_images_from_pdf_page(xObject):image_list = []xObject = xObject['/Resources']['/XObject'].getObject()for obj in xObject:o = xObject[obj]if xObject[obj]['/Subtype'] == '/Image':size = (xObject[obj]['/Width'], xObject[obj]['/Height'])# getData() does not work for CCITTFaxDecode or DCTDecode# as of 1 Aug 2018. Not sure about JPXDecode.data = xObject[obj]._datacolor_space = xObject[obj]['/ColorSpace']if '/FlateDecode' in xObject[obj]['/Filter']:if isinstance(color_space, PyPDF2.generic.ArrayObject) and color_space[0] == '/Indexed':color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262mode = img_modes[color_space]data = xObject[obj].getData() # need to use getData() hereimg = Image.frombytes(mode, size, data)if color_space == '/Indexed':img.putpalette(lookup.getData())img = img.convert('RGB')imgByteArr = io.BytesIO()img.save(imgByteArr,format='PNG')image_list.append(PdfImage(data=imgByteArr,format='PNG',image_name=obj[1:]))elif '/DCTDecode' in xObject[obj]['/Filter']:image_list.append(PdfImage(data=io.BytesIO(data),format='JPG',image_name=obj[1:]))elif '/JPXDecode' in xObject[obj]['/Filter']:image_list.append(PdfImage(data=io.BytesIO(data),format='JP2',image_name=obj[1:]))elif '/CCITTFaxDecode' in xObject[obj]['/Filter']:if xObject[obj]['/DecodeParms']['/K'] == -1:CCITT_group = 4else:CCITT_group = 3data = xObject[obj]._data img_size = len(data)tiff_header = tiff_header_for_CCITT(size[0], size[1], img_size, CCITT_group)im = Image.open(io.BytesIO(tiff_header + data))if xObject[obj].get('/BitsPerComponent') == 1:# experimental condition# http://users.fred.net/tds/leftdna/sciencetiff.htmlim = ImageOps.flip(im)imgByteArr = io.BytesIO()img.save(imgByteArr,format='PNG')image_list.append(PdfImage(data=imgByteArr,format='PNG',image_name=obj[1:]))else:print ('Unhandled image type: {}'.format(xObject[obj]['/Filter']))else:image_list += extract_images_from_pdf_page(xObject[obj])return image_listif __name__ == '__main__':try:filename = sys.argv[1]pages = sys.argv[2:]pages = list(map(int, pages))abspath = path.abspath(filename)except BaseException:print(__doc__, file=sys.stderr)sys.exit()file = PyPDF2.PdfFileReader(open(filename, "rb"))number = 0for p in pages:page0 = file.getPage(p - 1)image_list = extract_images_from_pdf_page(page0)number += len(image_list)for pdf_image in image_list:img = Image.open(pdf_image.data)image_path = "{} - p. {} - {}.{}".format(abspath[:-4], p, pdf_image.image_name,pdf_image.format)img.save(image_path)print('-' * 20)print('{} extracted images'.format(number))print('-' * 20)?
總結
以上是生活随笔為你收集整理的PDF文件中的图片提取的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: unity-打包及发布
- 下一篇: vba有下拉框的模糊查找_Excel V