I’m trying to extract text from a pdf file that has been cropped. I.e it has a defined cropbox which only displays a portion of the page.
The problem is that the cropped part still exists in pdf files, its just not visible.
I’ve tried PyPDF2, pdfquery and pdfminer. They all read the entire content including the cropped portion.
PyPDF2 lets me access the dimensions of the cropbox using:
pdfFileObj=open(path,'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pdfReader.getPage(0).cropBox
But I’m not sure what I can do with it. The files are being cropped in java using apache pdfBOX. I’d prefer to only read the uncropped part of the files in python but I can also make changes to the java code cropping the files if that’s the only solution.
Any help is appreciated.