Introduction
This document describes the usage of the hidden CLI command repairqueue and the actions that occurs when this the command is issued from the CLI of a Cisco Email Security Appliance (ESA).
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- System capacity, system monitoring, system health, and overall processing of messages through the ESA workqueue.
- Overall ESA administration.
Note: Please consult the ESA User Guide or the Online Help from the ESA GUI for further details.
Components Used
The information in this document is based on these software and hardware versions:
- ESA, all hardware and virtual appliances running AsyncOS 11.0.0-264 or newer
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Problem
Reasons to run the repairqueue command:
- Error stating that the workqueue is not mounted. This usually is a result of queue corruption post-improper power cycle or reboot of the appliance.
- Known defect requires this as a workaround (such as CSCuw22284 - Email queue corrupts after hermes crash or improper shutdown).
- Application faults, such as those referencing "gcq.py", or the queue management subsystem.
- Status Detail or workqueue > rate are reporting negative numbers.
- Status or Status Detail reports "Oldest Message" older than your bounce profile. The default value for this is 3 days. You can verify from bounceconfig > edit and choose the Default profile. You will be looking for the "Please enter the maximum number of seconds a message may stay in the queue before being hard bounced" line, which by default is 259200 seconds, or 3 days. This excludes the virtual delivery domains, the.<destination>.queue such as the.cpq.queue, the.euq.queue, the.cpq.release.host.
Reasons to NOT run the repairqueue command:
- Slow workqueue processing is not a valid reason to run a queue repair. Administrators often confuse slow workqueue processing as queue corruption. A slow workqueue usually is due to repeat processing of the same message(s) due to service over-utilization of system resources. Often these repeated processing scenarios are not things that are repaired by simply running repairqueue. Further troubleshooting of the service(s) that a message would be "hung" on during processing would be required.
Usage of the command repairqueue
Running the CLI command repairqueue may not repair all workqueue issues or corruptions. This utility does a best effort to repair the workqueue.
Warning: ESA administrators should take note, there is the possibility of losing active messages from a workqueue.
When running repairqueue, the first process run will prompt for permission once to proceed and execute the repair:
myesa.local> repairqueue
Do you want to repair or clean the queue?
1. Repair.
2. Clean.
[1]> 1
The mail flow will be stopped through out the repair/cleanup process
WARNING:
This utility does a best effort to repair the queue.
Not all queues corruptions can be repaired.
Are you sure you want to proceed? [N]> y
Checking generation checksum files
...
<<<SNIP FOR BREVITY>>>
...
done
Repair succeeded
Starting Hermes
Hermes Started
Log into the system and verify the status of the system.
Note: On a virtual ESA, ignore the following output, known defect (CSCuz28415): "Waiting for the queue to mount: Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory"
Once the repair process is completed, the workqueue will be repaired, however the appliance will still retain an old checkpoint of the previous workqueue. In order to resume writing a new checkpoint for workqueue processing, run the repairqueue again, and issue the command to Clean:
myesa.local> repairqueue
Do you want to repair or clean the queue?
1. Repair.
2. Clean.
[1]> 2
The mail flow will be stopped through out the repair/cleanup process
WARNING:
There is a backup found this may be the only backup.
This will to remove the old queue.
Are you sure you want to proceed? [N]> y
Double confirmation. Are you sure you want to proceed? [N]> y
Removing old queue
Cleanup finished
Verify
Once the repairqueue is completed, please do each of the following in order to validate the workqueue is back online and the appliance is processing mail:
- Verify system status by running the status detail command from the CLI, or Monitor > System Status from the GUI. The appliance should reflect a system status of Online.
- Review the mail logs on the appliance to assure mail processing as expected. This can be accomplished from the CLI by running the tail mail_logs command.
- Run the workqueue command from the CLI, choosing the Rate option with default rate of 10 seconds. As long as the appliance is processing mail in and/or mail out, the rate each 10 seconds should be fairly equal for "In/Out" ratio. Appliances that have a large pending processing workqueue may take some time to empty the workqueue out, and resume normal processing.
FAQ
What if my ESA is not running 11.0.0-264 or newer?
Customers who have appliances running older versions of AsyncOS that do not have the repairqueue hidden CLI command option should open a support case in order to have a Cisco Support engineer assist. A support tunnel will need to be opened and available for Cisco Support to access the appliance and run the repair queue process. Please contact Cisco Support to open an active support case.
Does workqueue ''corruption" mean mail loss?
In most cases, corruption does not equal mail loss. The queue is corrupt due to meta-data related to message(s) processing that are no longer on the appliance. This is a book-keeping processing between the queue and reporting, message tracking, etc. Running the repairqueue will rebuild the ESA meta-data and clean-up any misreporting between the services and processing.
Are there any repercussions to workqueue corruption?
The ESA may be able to run for a long time on a corrupted queue and most messages may process fine, but the appliance may appear sluggish, or certain messages may never clear out, as indicated by the "Oldest Message" in the status command --- significantly older than the bounceconfig should allow. When AsyncOS is actually restarted with a corrupted queue, the queue may or may not be able to mount. The corruption may have occurred some time ago and appears to be fine until the appliance is restarted, at which point it is unable to mount the queue.
What causes queue corruption?
The two most common causes of 'queue corruption' are:
- Unexpected reboots of the appliance. Power interruptions or holding down the power button will result in an improper shutdown and may corrupt the queue, depending on what backend processes were doing at the time. The appliance may recover and the queue may come back up corrupted, or the queue may not be mountable upon reboot. If this is true, ESA Administrators will see "queue not mounted" alerts and/or "daemon not responding" when running status from the CLI.
myesa.local> status
Enter "status detail" for more information.
Couldn't obtain mail stats - my.esa: The daemon is not responding.
myesa.local> status
Enter "status detail" for more information.
Couldn't obtain mail stats - the queue is not mounted
- Out-of-bound RAM usage by the appliance. This is most likely caused by a misconfiguration of the listener and/or mail flow policies, usually seen with too many inbound connection/injections allowed. Cisco recommends to review your listenerconfig for max inbound connections. Cisco recommends this be set at 300.
How long should the repair script take to complete?
Repairing the workqueue can take anywhere from 10 seconds to several hours, depending on the state of the ESA and how many message are currently processing through an active workqueue. A workqueue repair on lower-end appliance with full queues at the time of corruption could take several hours.
What happens if the repairqueue cannot run or does not complete?
In certain situations, (e.g, over-full queue on an appliance) the repairqueue will not be able to complete. If the repairqueue does not complete after 4 hours, the queue is most likely unrepairable and the only recourse is to build a new queue by running the hidden CLI command resetqueue. For advanced issues, please contact Cisco Support to open an active support case and have a Cisco Support assist.
Related Information