API
The library is constructed from 4 base modules:
- parser: contains low-level classes for extracting fundamental data-structures from the documents
- interpreter: a interpreter of the pdf-standard commands
- converters: contains high-level functions for conversion of the fundamental pdf structures to other formats
- imagewriter: contains a simple implementation for converting PDF Image Streams to png/bmp/img formats
parser
WIP
interpreter
a interpreter of the pdf-standard commands
Example
from pdfmajor.interpreter import PDFInterpreter
for page in PDFInterpreter("/path/to/pdf.pdf"):
print("page start", page.page_num)
for item in page:
print(" >", item)
print("page end", page.page_num)
interpreter.PDFInterpreter
This generator-function yields individual pages which contain their respected items.
Arguments
input_file_path
: strpreload
: bool defaults to Falsemaxpages
: int defaults to 0password
: str defaults to Nonecaching
: bool defaults to Truecheck_extractable
: bool defaults to Trueignore_bad_chars
: bool defaults to Falsepagenos
: List[int] defaults to Nonedebug_level
: logging.levels defaults logging.WARNING
Yield Value
This function returns a generator that yields PDFInterpreter.
interpreter.PageInterpreter
This generator-function-class yields individual layout items.
Layout Items
All layout items extend the LTItem
class. There are two kinds of layout items:
- LTComponent: extends the base
LTItem
class, this class will have additional values such as boundary boxes, height and width - LTContainer: extends the
LTComponent
class, this class is used to contain elements of the pdf that would have child elements. Iterating on this element will output its child elements.
Layout Containers
All of these classes extend the LTContainer class.
- LTXObject: a layout item containing other additional layout items
- LTCharBlock: a layout item containing LTChars, this corresponds to whenever a
TJ
orTj
operators is issued within a text object. - LTTextBlock: a layout item containing LTCharBlocks, note that this directly corresponds to the
BT
andET
operators pair in the pdf standard
Layout Components
All of these classes extend the LTComponent class.
- LTChar: an individual character
- LTCurves: represents a collection of svg-paths (available under
self.paths
) - LTImage: a component containing information regarding an image
converters
Contains high-level functions for conversion of the fundamental pdf structures to other formats. This library includes 4 high-level conversion cases:
- HTML
- JSON
- XML
- Text
These formats are all generated using the PageInterpreter. To use them simply call the static method parse_file.
Example
from pdfmajor.converters import convert_file
convert_file(
"path/to/input/file.pdf",
"path/to/output/file.html",
out_type="html"
)
converters.convert_file
A high-level abstraction for the conversion classes.
input_file
: TextIOWrapperoutput_file
: TextIOWrapperimage_folder_path
: str defaults to Nonecodec
: str defaults to 'utf-8'maxpages
: int defaults to 0password
: str defaults to Nonecaching
: bool defaults to Truecheck_extractable
: bool defaults to Truepagenos
: List[int] defaults to Noneout_type
: str defaults to 'html'
imagewriter
WIP