API

The library is constructed from 4 base modules:

parser: contains low-level classes for extracting fundamental data-structures from the documents
interpreter: a interpreter of the pdf-standard commands
converters: contains high-level functions for conversion of the fundamental pdf structures to other formats
imagewriter: contains a simple implementation for converting PDF Image Streams to png/bmp/img formats

parser

WIP

interpreter

a interpreter of the pdf-standard commands

Example

from pdfmajor.interpreter import PDFInterpreter

for page in PDFInterpreter("/path/to/pdf.pdf"):
    print("page start", page.page_num)
    for item in page:
        print(" >", item)
    print("page end", page.page_num)

interpreter.PDFInterpreter

This generator-function yields individual pages which contain their respected items.

Arguments

input_file_path: str
preload: bool defaults to False
maxpages: int defaults to 0
password: str defaults to None
caching: bool defaults to True
check_extractable: bool defaults to True
ignore_bad_chars: bool defaults to False
pagenos: List[int] defaults to None
debug_level: logging.levels defaults logging.WARNING

Yield Value

This function returns a generator that yields PDFInterpreter.

interpreter.PageInterpreter

This generator-function-class yields individual layout items.

Layout Items

All layout items extend the LTItem class. There are two kinds of layout items:

LTComponent: extends the base LTItem class, this class will have additional values such as boundary boxes, height and width
LTContainer: extends the LTComponent class, this class is used to contain elements of the pdf that would have child elements. Iterating on this element will output its child elements.

Layout Containers

All of these classes extend the LTContainer class.

LTXObject: a layout item containing other additional layout items
LTCharBlock: a layout item containing LTChars, this corresponds to whenever a TJ or Tj operators is issued within a text object.
LTTextBlock: a layout item containing LTCharBlocks, note that this directly corresponds to the BT and ET operators pair in the pdf standard

Layout Components

All of these classes extend the LTComponent class.

LTChar: an individual character
LTCurves: represents a collection of svg-paths (available under self.paths)
LTImage: a component containing information regarding an image

converters

Contains high-level functions for conversion of the fundamental pdf structures to other formats. This library includes 4 high-level conversion cases:

HTML
JSON
XML
Text

These formats are all generated using the PageInterpreter. To use them simply call the static method parse_file.

Example

from pdfmajor.converters import convert_file

convert_file(
    "path/to/input/file.pdf",
    "path/to/output/file.html",
    out_type="html"
)

converters.convert_file

A high-level abstraction for the conversion classes.

input_file: TextIOWrapper
output_file: TextIOWrapper
image_folder_path: str defaults to None
codec: str defaults to 'utf-8'
maxpages: int defaults to 0
password: str defaults to None
caching: bool defaults to True
check_extractable: bool defaults to True
pagenos: List[int] defaults to None
out_type: str defaults to 'html'

imagewriter

WIP