Archive Jobs Running Very Slowly

  • 7020610
  • 13-Sep-2013
  • 07-Aug-2017

Environment


Retain 3.x/4.x

Situation

My archive jobs that normally took 45 minutes were now taking 15 hours to 2 days.

Resolution


There can be multiple reasons for jobs taking so long to complete.  Here is a compilation of suggestions from multiple support engineers:


1.  First question:  Is the job running slowly due to throughput issues (messages per second ("m/s") on the Worker window)? 

If so, proceed to step #2; otherwise, focus in on the retention flag date for the mailboxes (GroupWise) or the item store flag date (all other systems).  There could be multiple mailboxes for which flags are not getting advanced due to errors on items going back to a distant date; thus, months worth of items are getting re-processed each time the mailbox gets processed because Retain starts at the flag date.  See "How Retention Services and Item Store Flags Work".  Fix the job errors so that flags get advanced and that should take care of the issue; otherwise, proceed to step #2.

2.  The issue could be on the mail server.

Possible problems could include disk I/O, heavy load on the mail server, or a saturated connection from the mail server to the Worker. 

You could try the diagnostic test described in this KB:  "Diagnostic Option for Testing Source of Job Performance Issues".

For Exchange, see the KB articles under the Performance category.

3.  Performance issues between the Worker and the Server (not the physical server per se, but the RetainServer software).

This part depends on where the Worker is installed (this also applies to issue #2). If it's on the mail server, then you could also run into network bandwidth issues.

Another factor to consider:  Retain 3.0 and earlier versions defaulted to having the Worker and the Server communicate through the web server (port 80), which we have found can slow down a job.  Bypassing the web server (Apache, IIS, etc) can speed things up because it eliminates the middle man:

a.  Modify the Worker's configuration, specifically setting the communication port to 48080 from the 2.x default of 80.

b.  Download a new Worker bootstrap.

Go to the bootstrap tab of the Worker configuration and download the new bootstrap file.  Follow the instructions in this KB article for replacing the bootstrap:  https://support.microfocus.com/kb/doc.php?id=7019768

c.  Set Worker logging to diagnostic

Data Collection | Worker | Logging

Support could go over it to see if there are any obvious signs of a particular piece being slow (the mail server or the Retain server).

4.  Retain server performance.

This part is where Retain takes apart the message and stores it in the archive.. This process is dependent on disk I/O, available RAM, and database responsiveness (if it is on the same server as Retain, then the same questions of I/O apply).

Disk I/O
One way to test disk responsiveness can be found here: https://support.microfocus.com/kb/doc.php?id=7020251.

See our Retain Planning and Design Best Practices article.

RAM
See our Retain Planning and Design Best Practices article.

5.  Database performance.

You can also see how long each message is taking to archive in the logs.  You'd need to have it in diagnostic mode to see it properly.  See "Location of Log Files".

You can do some simple queries to give you an idea of how quickly it's responding. For example with MySQL: SELECT COUNT(f_indexed) FROM t_message;  This actually has the additional perk of giving you an idea of how many emails you've collected with Retain.

You can also enable MySQL slow query logging.  MS SQL and Oracle database likely have similar options.  Enabling slow query logging will show you any query that takes over 1 second to process, thus pointing you to the table that is having issues.  It could be a missing index.  See "Enabling MySQL Slow Query Logging for Troubleshooting Job Performance Issues".

If this starting happening right after migrating from Retain 2.x to Retain 3.x, one of your tables could be missing an index that should have been created during the migration.  See "Archive Job Performance Decreases Dramatically After Retain 2 to Retain 3 Migration".

As a troubleshooting step, SQL logging with log4jdbc can be enabled, specifically configuring it to implement the sqltiming log.  In one customer's case, it showed a 3 - 4 second delay on a particular query.  This led to a query to determine database index fragmentation (see below). 

Retain is pretty much database agnostic, so it does not attempt to clean up a database's indexes.  It is the DBA's responsibility to watch for fragmentation and other database performance issues.  Here is a sample query that was used to see the fragmentation with a customer's MS SQL 2008 database.  The same concept can be applied to any other database, like MySQL: 

SELECT
 OBJECT_NAME(object_id),
 index_id,
 avg_fragmentation_in_percent,
 fragment_count,
 avg_fragment_size_in_pages
FROM sys.dm_db_index_physical_stats(DB_ID('retain'), NULL, NULL, NULL , 'LIMITED')
order by avg_fragmentation_in_percent desc;

If a larger table has a significant percentage of fragmentation, rebuilding the indexes can resolve the issue as it did for one larger customer.  See Microsoft Technet article, "Reorganize and Rebuild Indexes".

For pointers on MySQL maintenance for performance optimization, click here.

If either your disk I/O or database is not very responsive, that can account for slow archiving.

The "top" command in SLES can also be helpful in these circumstances. High CPU utilization on one CPU thread for MySQL can indicate that the CPU is waiting for the HDD to respond or finish the tasks that it has been given.

see also

Slow Exchange Jobs because of long enterMailbox waits

Additional Information

This article was originally published in the GWAVA knowledgebase as article ID 2203.