smalot/pdfparser
Standalone PHP PDF parsing library to extract text, pages, and metadata from PDFs. Supports compressed PDFs and various encodings, with configurable parsing options. Note: secured PDFs and form data extraction are not supported.
Refining a change in the latest release:
Ignore Form as well as Image XObjects when assembling the text array for a PDFObject. by @rupertj in https://github.com/smalot/pdfparser/pull/783
When assembling the text array for an object, skip Forms that don't contain any text, instead of all Forms. by @rupertj in https://github.com/smalot/pdfparser/pull/789
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.12.3...v2.12.4
Summary: The fix prevents the RawDataParser.php to enter an endless loop under certain circumstances which would lead to memory exhaustion.
Details: When parsing a specifically crafted, malformed PDF file, the low-level RawDataParser enters a state that leads to uncontrolled memory allocation. This continues until the PHP script exhausts its memory_limit and crashes with a fatal error. An attacker can leverage this vulnerability by submitting a small, malicious PDF file to any service using this library, causing the server process to crash and become unavailable.
Thank you Yang LUO (https://github.com/N0zoM1z0) for reporting this and the provided details on the matter. https://github.com/smalot/pdfparser/pull/787 contains further information.
Ignore Form as well as Image XObjects when assembling the text array for a PDFObject. by @rupertj in https://github.com/smalot/pdfparser/pull/783
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.12.2...v2.12.3
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.12.1...v2.12.2
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.12.0...v2.12.1
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.11.0...v2.12.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v.2.10.0...v2.11.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.9.0...v.2.10.0
Replaced by v2.10.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.8.0...v2.9.0
:exclamation: This release contains a lot of changes in comparison to v2.7.0. We decided to have at least one release candidate before the next production-ready release.
Pull request #634 (Major Update to PDFObject.php + Ancillary) by @GreyWyvern fixes almost 20 issues, brings better parsing and more understandable code. If you wanna find out what exactly changed, have a look.
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.7.0...v2.8.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.8.0-RC1...v2.8.0-RC2
:exclamation: This release contains a lot of changes in comparison to v2.7.0. We decided to have at least one release candidate before the next production-ready release.
Pull request #634 (Major Update to PDFObject.php + Ancillary) by @GreyWyvern fixes almost 20 issues, brings better parsing and more understandable code. If you wanna find out what exactly changed, have a look.
If you find any bugs, please let us know in https://github.com/smalot/pdfparser/issues/650 or open a new issue.
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.7.0...v2.8.0-RC1
Full Changelog: https://github.com/smalot/pdfparser/compare/v.2.6.0...v2.7.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.5.0...v.2.6.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.4.0...v2.5.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.3.0...v2.4.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.2...v2.3.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.1...v2.2.2
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.2.0...v2.2.1
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.1.0...v2.2.0
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.1...v2.1.0
For PHP 7 users: In 2.0.0 we used a function which is PHP 8 only. It was fixed in #486.
Full Changelog: https://github.com/smalot/pdfparser/compare/v2.0.0...v2.0.1
❗All function parameters as well as return types of functions are typed now. That means, if you are using values which do not fit, you may receive Type errors. Most of it was done internally and you should not get bothered. In case you use internal functions, please check your code before go into production.
We initially decided to release 1.2.0 but finally jumped to 2.0.0 to include BC on a major release instead (see https://github.com/smalot/pdfparser/issues/480)
Page->getText() in some cases (thanks to @Nickmanbear, #457)Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)❗Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗
Further information about changes and fixes in 1.2.0 can be found here: https://github.com/smalot/pdfparser/releases/tag/v1.2.0-RC1
❗Not production ready - We reworked our code base and added typed parameters as well as return values. If you find anything, please drop us a comment. Further information can be found https://github.com/smalot/pdfparser/issues/468. Thank you in advance!❗
Highlights:
Page->getText() in some cases (thanks to @Nickmanbear, #457)Document::getFirstFont when no fonts are available (thanks to @PrinsFrank, #461)@j0k3r improved our test backend.
PDFs with images can be parsed with less resource consumption (like memory) from now on. @Connum added a feature with #441 to ignore image data. It must be enabled manually though. You can do it easily:
use Smalot\PdfParser\Config;
use Smalot\PdfParser\Parser;
$config = new Config();
$config->setRetainImageContent(false);
$parser = new Parser([], $config);
// $parser->parseFile (...)
Besides that, we fixed a problem with Scrutinizer (part of our test infrastructure).
Config.php with white space characters: it allows developers to override regex for white space recognition (#411, thanks @LucianoHanna)Features:
Fixes:
Call to a member function getFontSpaceLimit() on null (#406, thanks @xfolder)Uncaught Error: Call to undefined method Smalot\PdfParser\Header::__toString() in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php) (thanks @fsmoak)How can I help you explore Laravel packages today?