0.35 26 April 2002 Updated some omissions to the NexTrieve.pm documentation. Added scripts "targz_collect" and "targz_count". Fixed errors caused by differently operating "pdftotext" program on some systems in the test-suite of PDF.pm. Fixed problem with new default case of "add_file" of Targz.pm. 0.34 25 April 2002 Added default case to "add_file" of Targz.pm to more easily handle incoming mail messages. 5 April 2002 Changed some documentation after discussion with Mark Overmeer at the Amsterdam.pm meeting. 0.33 4 April 2002 Fixed some annoying errors when manifying Text sequence by changing that to 1234 in HTML.pm, RFC822.pm and Message.pm. Added mime-handler "_pdf" to MIME.pm for handling "application/pdf" MIME-types of RFC822.pm, Message.pm and Mbox.pm indirectly. This means that emails with PDF-files attach will now also index the PDF-files. First releasable version of PDF.pm completed including (limited) test-suite (t/18pdf.t). Also added "pdf2ntvml" script plus test-suite (t/75pdf.t). Changed "add_news" and "_resync_news" methods in Targz.pm to allow for automatic recovery from a Net::NNTP object that has gone stale. 2 April 2002 Added "_fetch_file" method to NexTrieve.pm for fetching data as an external file. Added "DESTROY" method to NexTrieve.pm for automatically removing temporary files added by _fetch_file and possibly others in the future. Commenced work on PDF.pm, based on "pdfinfo" and "pdftotext" programs of the xpdf package, located at http://www.foolabs.com/xpdf/ . Added all the hooks and documentation in associated packages. 0.32 1 April 2002 Fixed problem in method "ResourceFromIndex" in Index.pm. Some versions of NexTrieve give error message that would trigger the "ok" check. This is now fixed. Changed method "_create_tarfile" in Targz.pm to first create the tarfile and then gzip it. This approach allows incremental updates of the tarfile, allowing unlimited number of files to be added to the tarfile (it would bomb on huge numbers of messages in a single day before). Adapted documentation to indicate a "gzip" program with the "--best" parameter is also needed. This should probably lead to better compression of the gzipped tarfiles. 31 March 2002 Some more tuning in "_resync_news" of Targz.pm. Now correctly handles the case with a lot of missing messages: if the date of a message is two days or more before the last date of a message, then a collect is started from the message after that message. 25 March 2002 Fixed a small problem in internal "_resync_news" method of Targz.pm that would loop on missing messages in the target zone. 0.31 25 March 2002 Refined the internal "_resync_news" method to quickly handle "holes" in the message stream. Now also uses a binary chop approach to find the last message that's on the news server that is already in the targz. This all applies to Targz.pm of course. 24 March 2002 Added and documented method "add_news" to Targz.pm. Takes a Net::NNTP object and reads messages from there, adding them to the targz. Handles re-syncing with newsgroups by a mix of date and message-id checks. Added and documented method "name" to Targz.pm. Added and documented method "count_storable", which is the same as "count" but uses the Storable module for persistency to prevent having to unpack tarfiles that haven't changed. Added checks to test-suite. Modified internal method "auto_clean" to "no_auto_clean" and documented it. Modified internal method "clean" to only work as an object method and documented it. Both in Targz.pm. Simplified some internals in Targz.pm. The tar program must now also be able to handle the "--directory" directive. 23 March 2002 Added and documented a "tarfile" method to Targz.pm. Made the datestamp checking routine in Targz.pm a little smarter so that it now also recognizes and handles NNTP-Posting-Date: and X-Trace: headers. Added support for an external hash to "count" method of Targz.pm: using an external hash can make things a lot faster because it does not need to read tar-files that haven't changed. Made directory parameter to Targz method of NexTrieve.pm default to the current directory. 22 March 2002 Adapted the undocumented "files" method of Docseq.pm so that it can accept a processor routine parameter. Also documented the method now. It is now useful as a basic conversion feature for any type of conversion by other modules. 0.30 18 March 2002 First version of Targz.pm completed including documentation. You can now quickly store both messages as well as unix mailboxes in the NexTrieve::Targz archive format. Added return value for success to method "splat" in NexTrieve.pm. Added "filename:id" feature to _fetch_content_from_filename in NexTrieve.pm, allowing filenames to be specified with an ":id" suffix, which would then fill the "id" key in the content hash. So you can now specify an absolute (temporary) filename with an ID specification in one go. This applies to RFC822.pm and HTML.pm Feature created to fix the re-XMLing process of Targz.pm. 17 March 2002 First version of Targz.pm almost ready. Only a few cleanup issues to be fixed. Create specific method "write_file" to Document.pm so that the encoding information is saved when a single document is written out. All other methods to get at the XML of a Document object still return the XML _without_ the processor instruction for easy inclusion in document sequences. Added dependency on Cwd and File::Copy to Makefile.PL. Needed for Targz.pm. Added additional key-value pairs specification to the Document method of the RFC822.pm. Needed for Targz.pm. 16 March 2002 Started work on Targz.pm based on the scripts developed the past year. Bolted dependency for IO::File, IO::Socket and Date::Parse into NexTrieve.pm. They seem to have been around forever: no need for cleverness there. 0.29 11 March 2002 Some documentation fixes to Message.pm and NexTrieve.pm. Renamed Overview.pod back to Overview.pm, as that _will_ show up for reading on the various CPAN related websites. 0.28 11 March 2002 Finished initial version of Message.pm after some more discussions with Mark Overmeer. There doesn't seem to be a need for a Mail::Box interface yet, so that source will be dumped now. Changed Overview.pm to Overview.pod. 10 March 2002 Created MIME.pm module as a stash for MIME-conversion routines. Adapted RFC822.pm so that it uses the new MIME.pm module, removed its own versions of _plain and _html. Started work on Message.pm for converting Perl Mail::Message objects to document sequences. Added test-suite for it as well. Initially developed as NexTrieve::Mail::Box.pm, but this turned out to be too much double work. After discussions with Mark Overmeer, the author of Mail::Box and Mail::Message, it seemed to make much more sense to interface at the message level rather than at the mailbox level. Oops. Lost the NAME and SYNOPSIS section in Overview.pm while copying the text that was made off-line. Restored again now. This caused the Overview.pm to become "invisible" on CPAN, which is a pity for a module that consists of documentation only. 0.27 9 March 2002 Added documentation for methods "texttype" and "texttypes" to the Query.pm module: they were missing. Added Overview.pm documentation module. Moved some of the documentation from NexTrieve.pm to it. 0.26 6 March 2002 Finished first complete documentation of Resource.pm. Removed the "basedir" method from Resource.pm. The NexTrieve "basedir" feature is on the way out and shouldn't have existed in the Perl modules in the first place. Needed to adapt quite some tests in the test-suite as they used "basedir" as an example method. 0.25 5 March 2002 Finished first complete documentation of Query.pm, Querylog.pm, Replay.pm and Search.pm. Added Query method to Replay.pm. Added documentation for "ampersandize" and "normalize" to NexTrieve.pm. 0.24 4 March 2002 Finished first complete documentation of Docseq.pm, Document.pm, Hitlist.pm, Hitlist::Hit.pm, Index.pm, Mbox.pm. Changed method "ResourceFromIndex" in Index.pm to use "ntvcheck" rather than "ntvopt": the --xml functionality should be there. Adapted test-suite so it now correctly handles the absence of --xml functionality in ntvcheck. 3 March 2002 Finished first complete documentation of Daemon.pm. Adapted method "executable" in NexTrieve.pm to return the license expiration info as a datestamp: YYYYMMDD. Changed method "PrintError" in NexTrieve.pm to accept the "cluck" keyword. If specified, the $SIG{__WARN__} handler is set to Carp::cluck. Changed method "RaiseError" in NexTrieve.pm to accept the "confess" keyword. If specified, the $SIG{__DIE__} handler is set to Carp::confess. Changed method "ResourceFromIndex" in Index.pm to use "ntvopt" rather than "ntvcheck": the --xml functionality seems to have moved. Finished first complete documentation of DBI.pm. Finished first complete documentation of RFC822.pm. Removed "use NexTrieve::Resource" from HTML.pm and RFC822.pm. They are only needed when the "Resource" method would be called, which is not too often. The NexTrieve::Resource module must now be explicitely specified in the "use NexTrieve qw()" list when needed. Adapted the test-suite accordingly. Added "mailsimple" method to RFC822.pm. Same as default settings of the "mailbox2ntvml" script. Finished first complete documentation of HTML.pm. Added "embed" to _default_removecontainers in NexTrieve.pm. Minor fix to _intext_recode of NexTrieve.pm to handle the case when no input is given. This was causing a lot of warnings in the test-suite if MIME::xxx were not installed. Minor fix to _plain and _html in RFC822.pm to allow handling of empty text and html (which could be caused by MIME::Base64 and MIME::QuotedPrint not being installed). Added support for handling the case when MIME::Base64 and MIME::QuotedPrint are not installed. They were handled by the modules already, but not in the test-suite, causing errors when they shouldn't. 28 February 2002 First half of more complete documentation of HTML.pm. 0.23 28 February 2002 Added flag to internal method "_recoding_error" so that a different error message is displayed when some data was actually returned. Adapted method "_iconv" to use this new feature. Changed handling of calling external "iconv" from a piped open to a system with temporary input and output files. Apparently, that is the only way to reliably obtain exit codes from iconv in older versions of Perl. Changed the handling of recoding =?encoding?Q?string?= strings inside strings to _process_container. This should make the handling much more general, and possibly less CPU-intensive as it is only done on elements from the content-hash that are actually converted to attributes or texttypes. Added "t/headerenc.mbox" and "t/asia.mbox" test-cases. 0.22 27 February 2002 Added "archive" method to Mbox.pm. When an archive is specified, it is assumed to be either a handle or a filename to be opened for appending. Just before a message is processed, it will be written to the archive, allowing developers to use this for a simple mail archiving system. Added t/74mbox.t test for this functionality. Fixed bug in Mbox that would occur if the same $docseq would be used in multiple runs togethev with a conceptualmailbox and a baseoffset. The second run, the baseoffset of the first run would be used. Now the baseoffset is updated in the object after a run when a conceptual mailbox is used. Changed Mbox.pm also so that a conceptualmailbox is just that and that you need to specify an offset in that case (if it's different from 0 that is). Adapted t/14mbox.t accordingly. Made the use of -o obligatory when using -c. No longer looks up offset assuming conceptualmailbox is a real file somewhere. Adapted test-suite t/72mbox.t accordingly. This was in "mailbox2ntvml" of course. Fixed minor nit in "mailbox2ntvml": if defined($baseoffset) was not needed at all. 0.21 26 February 2002 Fixed problem in the "mailbox2ntvml" script that would ignore the -o (baseoffset) parameter. Added two test-suites for checking the functionality of the -c and -o parameters of that script. Added script "dbi2ntvml" for executing a query in a database and having a document sequence created for the result. Fixed problems with broken attachments that don't finish with a newline in RFC822.pm by fixing the "next" and "nextnonewline" of the hidden NexTrieve::handle object in NexTrieve.pm. Added a test-file "badmime.mbox" to test for this eventuality. Fixed problem in scripts "mailbox2ntvml" and "html2ntvml": the -E flag for specifying the default input encoding, did not work. The default input encoding was always set to 'iso-8859-1'. Further refined the ucs-4 and ucs-2 encoding issues: made the "utf3216check" method a lot smarter. It is now able to detect big and little endian and sets the encoding information appropriately. Added support for "ucs-2le" and "ucs-4le" to UTF8.pm. Added heuristics to _normalize_encoding to convert "utf-32" and "utf-16" to the appropriate "ucs*" version. Added HTML-files with little-endian 2 and 4 byte encodings to the test-suite. Removed "header2attribute" and "header2texttype" methods from RFC822.pm. Instead, the inheritable "field2attribute" and "field2texttype" should now be used. Changed the documentation, the test-suite and scripts accordingly. Changed name of "ShowErrorsAsWarnings" method in NexTrieve.pm to "PrintError" to conform with the generally accepted way that the "Perl" DBI.pm works. Changed all occurrences in the modules, scripts and test-suite to reflect this change. Changed name of "DieOnError" method in NexTrieve.pm to "RaiseError" to conform with the generally accepted way that the "Perl" DBI.pm works. Changed all occurrences in the modules, scripts and test-suite to reflect this change. Added NexTrieve::DBI.pm module for creating document sequences out of DBI statement handles (actually, any object that has a method that can be called repeatedly and which returns a reference to a hash). It is now easy to create document sequences out of databases! Added small test-suite for it: t/15dbi.t. Moved "field2attribute" and "field2texttype" methods from HTML.pm to NexTrieve.pm, so they can be inherited by DBI.pm and other modules. Removed the methods from HTML.pm as they are now inherited. Removed now obsolete "titlemax" method from RFC822.pm. Found that documents encoded in utf-32 or utf-16 were not being handled correctly by html2ntvml. Fixed this by adding a method "utf3216check" to NexTrieve.pm that will check its input for utf-32 or utf-16 encoding (by checking the first 8, respectively 4 bytes of the text) and convert that to utf-8 when deemed to be utf-32/utf-16. Added call to this method to HTML.pm and added two test-cases, right out of the standard Apache distribution, for these encodings. Added the conversion from utf-32 and utf-16 (actually: ucs-2be and ucs4-be) to UTF8.pm, so that these conversions are done internally. 0.20 25 February 2002 Generalize the handling of pairs in HTML.pm. Added "author" and "generator" to the content hash as extra keys if available. Other keys should now be trivial to add and should possibly be customizable externally. Sometimes the _iconv method of NexTrieve.pm seems to not be able to create the file. It now silently exists without invoking _iconv. Should probably be handled differently. Added "x-mac-roman" and "windows-874" as a standard encoding that can be handled by UTF8.pm. This should allow processing of most MAC and some documents with Thai characters. Added feature to _fetch_content in NexTrieve.pm that checks for protocol-type specifications in the id specified and, if found, forces a "URL" type fetch. This change allows URL's to be specified on input anywhere, but most specifically in the "html2ntvml" script. Fixed problem in _fetch_from_url in NexTrieve.pm that would cause URL's of the form "http://www.nextrieve.com" (note the missing slash at the end) to fail. Removed some superfluous tables from NexTrieve.pm that weren't necessary anymore. Fixed baseoffset problem in script "mailbox2ntvml" if the referenced mailbox file didn't exist. Also killed warning in that case in HTML.pm. Found one case of badly formatted HTML that exposed various problems in the Document method of HTML.pm. Fixed the problems and added a test-case for it in the test-suite. Fixed the same problems in the HTML-attachment handling of RFC822.pm. Changed method "tempfilename" in NexTrieve.pm to use the complete hex address in the filename rather than just the numeric part. Added iso-885\d-* as misspellings for iso-8859-* to _normalize_encoding in NexTrieve.pm. Also added "html" as a misspelling for "iso-8859-1". Added checks in the test-suite to test for these misspellings. Added source specification to several error messages in HTML.pm. Changed the "create_module" script so that the UTF-8 values are generated at module creation time rather than when substituting the values in strings. Updated UTF8.pm accordingly. Should make things significantly faster. 0.19 24 February 2002 Added -a and -p flag to "html2ntvml" script to activate the ASP-style and PHP-style tag removal. Most of the test-suite scripts will now show the XML if there was an unexpected XML found in any conversion. Made the general conversion of containers somewhat stricter in HTML.pm so that there is less chance of throwing away valuable stuff. Added methods "asp" and "php" to add a pre-processor subroutine to the HTML-object for removing ASP-style tags in the form <%...%> and PHP-style tags in the form ...?> from the HTML. Added checks to make sure that it works. Generalized checking of t/70html.t and t/71mbox.t so that regular expressions can be placed in the stderr file, allowing for natural language independent checking of error messages. This change was inspired by Arnaud ASSAD's report of a problem with a French "speaking" iconv. Completed first phase of more or less complete documentation of the NexTrieve.pm module, including small descriptions of the input and output parameters of methods, rather than just an example call. Fixed problem with the "encoding" method of NexTrieve.pm: setting an encoding on an object that already has an encoding, now properly saves the XML in the object of which the encoding was changed. Added file VERSION so that stuff is easier to keep in CVS. Added check for right version of modules to all of the scripts. Now, a warning will be output if the script notices it is using a version of the modules for which it was not designed. Removed -c flag from call to "iconv": there are too many iconv's out there that don't support it. 0.18 23 February 2002 Added "-c" flag to call to "iconv" so that it will not bomb on invalid characters. Hopefully -c is valid to all versions of iconv out there. Swiped iso-8859-* and windows-152* to UTF-8 conversion lists from the Internet and created a conversion program that creates the source code to the new NexTrieve::UTF8.pm module. From now on, all conversions from iso-8859-* and windows-125* to UTF-8 are done natively, i.e. without any external programs. Removed all the stuff related to recoding that wasn't necessary anymore from NexTrieve.pm. 0.17 22 February 2002 Completely rewritten recoding in NexTrieve.pm. Lost the recoding hash as well as the methods "_text_icon", "_default_recoding_handler", "recode_handler" and "find_recoding". Instead of being recoding method centric, a "from->to" centric approach has been taken. For each pair of "from->to" recoding, a handler written in Perl is by default available (e.g. for "iso-8859-1" to "utf-8"). If an encoding pair is not found, first it is checked whether Text::Iconv can handle that recoding. If so, a closure to the object doing that conversion is created and saved. If that fails, a closure to an external "iconv" program is created, using the generic "_iconv" method. This should make recoding faster in many cases, and also handle dependencies on external ways of doing recoding, much better. Added some smart alecky way for RFC822.pm to allow the first attachment to set the encoding of the document, rather than assuming iso-8859-1 and causing recodings to be done for windows-1252 attachments. 21 February 2002 Added stuff to NexTrieve.pm, HTML.pm, RFC822.pm and Mbox.pm so that if there is a conversion error, the filename and line number (in case of a mailbox) is shown in the error line. Added conversion from "windows-1252" to "iso-8859-1" encoding to the default recode handler in NexTrieve.pm. Fixed problem with "Text::Iconv" recode handler if specified directly rather than "found", in NexTrieve.pm. Added some more checks to _normalize_encoding in NexTrieve.pm so that "iso8859-1" and "iso_8859_1" are converted to "iso-8859-1". Added some checks for this to t/01basic.t. Added ^K as an extra null byte to be removed, in HTML.pm 20 February 2002 Removed character range 0x80-0x9f from illegal character range, as these are valid windows-1252 characters and are no problem in in iso-8859-1 even if they are supposed to be undefined. Added _default_recoding_handler to NexTrieve.pm. This should be able to convert from iso-8859-1 and windows-1252 to utf-8 by itself. Allow this recoding method to be selected by the key "default". Added a test file "win1252.html" to the test-suite. Added ^L as an extra null byte to be removed, in HTML.pm Fixed "find_recoding" to use the keys in the known recoding methods hash. 0.16 20 February 2002 Adapted the check for an external "iconv" in NexTrieve.pm to do an actual conversion, rather than checking for the -V flag. Should really fix problem spotted by Nyk Cowham on a Mac OSX. 19 February 2002 Fixed problem in "xmllint" of NexTrieve.pm: value was being set even if xmllint would not be available on a platform, causing the test-suite to break. Spotted by Arnaud ASSAD. Added method "shorten" to NexTrieve.pm for shortening strings and making sure there are no broken entities at the end. Thought it would be nice for processing routines, such as in "html2ntvml" script. Since strings passed to processor routines are not normalized yet, this is not a problem and for that reason this method is not needed. Left in the source anyway as it seems to be a handy routine to have anyway. Fixed additional problem with