aauth service stops after upgrade to 6.3.4.1

  • 7025121
  • 27-May-2021
  • 27-May-2021

Environment

Advanced Authentication 6.3.4.1

Situation

After upgrade the aauth service will start for a few minutes and then stops.
In an SSH session the command "docker ps" shows that aaf-aucore docker image is not up.
Attempts to load aaf-aucore docker image will start but a minute or two after startup the image unloads.
Trying to navigate to AAF server displays a message "Appliance is under maintenance / starting up".

Resolution

Note this solution will require some down time. During this process users will not be able to authenticate to AAF.
Before starting this procedure it is recommended that customers determine the time when the least amount of user authentications are occurring.
To determine the period of lowest users' activity as recommended in the AAF documentation.
Click the settings icon on the Dashboard of the Administrative Portal and set the Relative interval to the Last 24 hours, then look at the Authentications widget to estimate the lowest time of authentications.

Before any upgrade it is recommended to first create a snapshot of the image. If a catastrophic failure is encountered then the snapshot can be restored to get the server back to a working state prior to the upgrade attempt.

Steps for upgrade:
1. Stop aauth on all web servers. This can be done in one of two ways.
        a. From the appliance console, navigate to System Services, select the Advanced Authentication service, then click
            on Action/Stop
        b. From an SSH session, issue the command "systemctl stop aauth"
2. Disable any events in the admin portal where integrations are pointed to the GMS or DBS
3. Proceed to upgrade the global master (GMS) first.
4. If the environment contains site clusters then upgrade each GMS for that specific site. This process needs to be done individually and not simultaneously.
5. After upgrading the GMS then proceed to upgrade each DBS in sequence. This process needs to be done individually and not simultaneously. After a successful upgrade of one server, move to the next server.
      For environments with one or more site clusters upgrade each DBS in each cluster one by one until all database servers have been upgraded. This process needs to be done individually and not simultaneously.
6. If any events were disabled prior to upgrade then these can now be re-enabled to allow for user authentications to begin.
7. After all of the GMS and DBS servers for each site cluster have been upgraded, upgrading of the Web servers can be performed. These upgrades can be performed simultaneously to limit the down time for the system.

Cause

This issue occurs because of locked records in the database used by AAF.
When the upgrade is attempted on a GMS or DBS, if a user is attempting to authenticate, then a record lock is set in the database. The upgrade attempts to modify the database and encounters a deadlock condition and the upgrade fails. Upon restart of the server the database does not contain the expected format and will abort load and shut down the aauth service.

Status

Top Issue

Additional Information

The steps below can be used to validate that this issue is being encountered.

Ensure the aucore container is in an "exited" state:
 # docker ps -a | grep aucore
 
If it's exited, try to start it:
 # docker start aaf_aucore_1
 
When it's exited (expect it in two minutes, check the state using "docker ps -a | grep aucore" again).

Get the docker log:

 # docker logs aaf_audb_1 &> /tmp/audb.log


If this is the same issue being seen then the audb.log will contain one or more entries like this...

2021-05-12 13:42:51.505 UTC [124] ERROR:  deadlock detected

2021-05-12 13:42:51.505 UTC [124] DETAIL:  Process 124 waits for AccessExclusiveLock on relation 20554 of database 16385; blocked by process 154.