* v0.3.3 - added timeout parameter for requests to the ApacheTika server - fallback to local java app if the ApacheTika server request fails or is incomplete - catch parser failure and return undef instead of breaking - removed Blacklist language identifier and use CLD instead * v0.3.2 - reset vocabulary and language model for each call - set locale to utf8 before calling external programs * v0.3.1 - fixed lowercase option * v0.3.0 Sat Jan 5 14:02:06 EET 2019 - moved options and functions to library - enable different output options - enabled calls to Apache Tika server if available * v0.2.8 Fri Aug 17 10:04:58 EEST 2018 - update to Apache Tika 1.18 - hide warnings and error messages from external tools * v0.2.7 Sat Jul 1 21:25:09 EEST 2017 - added an ugly workaround to find java 1.6 for pdfxtk * v0.2.6 Thu Jan 9 16:10:29 CET 2014 - integrated language detection (-d) - language filter using language detection (-D lang) * v0.2.5 - merge paragraph heuristics for putting unfinished sentences together - better approach for finding word boundaries based on a unigram LM and dynamic programming - better de-hyphenation in pdfxtk-mode * v0.2.4 Fri Mar 15 10:34:42 CET 2013 - pdfxtk as default - heuristics to handle ligatures - dehyphenation and other heurstics in pdfxtk-mode (-X) - now also splits strings into characters to find known words (solves a problem with pdfxtk conversions) * v0.2.3 Wed Mar 6 23:12:22 CET 2013 - fixed test suite * v0.2.2 Wed Feb 27 20:33:09 CET 2013 - fixed problem with wrong shared-dir settings - make word-merging a bit more efficient * v0.2.1 Fri Feb 15 16:29:41 CET 2013 - add pdfXtk as another option for converting pdf files (see http://sourceforge.net/projects/pdfxtk/) * v0.2 - Thu Feb 7 10:51:16 CET 2013 - running without pdftotext is now possible - added lowercasing (can be switched off) * v0.1 - Tue Jan 29 20:31:42 CET 2013 - initial release