Remove snapshot task fails to complete

  • 7010316
  • 16-Mar-2012
  • 26-Apr-2012

Environment

PlateSpin Forge
PlateSpin Protect
ESX 3.x or ESX 4.x Containers

Situation

In some cases where a workload is retaining recovery points and at least one recovery point snapshot is large, the remove snapshot task sent to VMware ESX can hang, causing the replication to enter a Recoverable Error state.

Resolution

If the remove snapshot task is still running when the ESX kernel completes the remove snapshot process then the task will eventually complete.
 
If the task fails (which may cause the replication in Forge or Protect to fail),  ESX should be given 24 hours to allow the kernel process to complete before the replication is attempted again.

Cause

This is a known issue with VMWare ESX 3.x and 4.x.
 
In order to provide the recovery point functionality, PlateSpin Forge and PlateSpin Protect leverage ESX's VM snapshot technology.  During the process of replication those snapshots have to be removed and possibly remade.  This happens by API call to ESX to submit the task to remove the snapshot from the VM and a task is created in the VMware Client.
 
This task will process in the VMware Client up until 95% which is when it is commited to the kernel; the kernel then compares all the delta information in every snapshot and the base VM and will combine or destroy information as needed.
 
This task can take a very long time which increases with the amount of snapshots (recovery points) and delta information in each.  This process can take so long that even the VMware Client's task to track the progress times out and fails.