How to Deal with a Corrupt Message Blocking Indexing

  • 7019106
  • 02-Dec-2015
  • 07-Aug-2017


Retain 3.x


I can't search items past a certain date but browse is fine.


If you look in the Indexer log you will find a error along these lines.

04:43:57,283 TextExtractionService - IOException while converting Reader - Input document may be malformed. String will be truncated or incomplete
04:43:57,287 LuceneDocumentUtil - indexing: TEXT.htm, have sizelimit: -1, filesize: 545, handlerclass: com.gwava.extractor.TextExtractor hash: 5E439513536BAD9B56F071C7ED02B5104DE883A7E89821F140F4D78D33D34396
04:43:57,368 LuceneDocumentUtil - indexing: mail.txt, have sizelimit: -1, filesize: 243, handlerclass: com.gwava.extractor.HTMLExtractor hash: 579857260D64406F83ED5DB8309C8B225CE6D26FBAD1437D8955FECAFC013043
04:46:37,218 AbstractBackgroundIndexer - [BGINDEXER] close indexing writer this run...
04:46:37,218 LuceneIndexingLocker - [IMANAGER] indexWriter released from BGINDEXER1448185292111.1448185292112 -- [] -- removed=true
04:46:37,218 LuceneIndexingLocker - [IMANAGER] IndexModificationLock released from BGINDEXER1448185292111
04:46:37,219 AbstractBackgroundIndexer - reportError: IndexThreadProtection :: :: EXCEPTION : java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
    at org.apache.lucene.index.IndexWriter.prepareCommit(
    at org.apache.lucene.index.IndexWriter.commitInternal(
    at org.apache.lucene.index.IndexWriter.commit(
    at org.apache.lucene.index.IndexWriter.commit(
    at com.gwava.indexing.lucene.impl.LuceneBackgroundIndexer.flushIndexWriter(
    at com.gwava.indexing.lucene.impl.LuceneBackgroundIndexer.doThreadWork(
    at com.gwava.indexing.common.IndexingThread$

04:46:37,230 AbstractBackgroundIndexer - [BGINDEXER] Serious error! This catch should never be reached!
04:46:37,230 AbstractBackgroundIndexer - [BGINDEXER] Thread ended.
04:46:37,230 IndexingThread - Index slave has stopped

We went into the database to find the document_id from the hash:
 select document_id from t_document where hash ="5E439513536BAD9B56F071C7ED02B5104DE883A7E89821F140F4D78D33D34396";
Then we found the message_id from the document_id:
 select message_id from t_message_attachments where document_id ="11887702";
Then we checked if the message_id was connected to other messages:
 select count(*) from t_message where message_id = '8777964';
Both hashes from the error were connected to this one message.
And checked if the item was indexed already:
 select f_indexed from t_message where message_id = '8777964';
Which it was not, so we updated the message_id to an invalid number (-3) so we could continue the index operations:
 update t_message set f_indexed = '-3' where message_id = '8777964';

We started the indexer and it began indexing normally.

Additional Information

This article was originally published in the GWAVA knowledgebase as article ID 2668.