Introduction
This document describes the steps that are required to stop-start a faulty Compute server in an Ultra-M setup that hosts Cisco Policy Suite (CPS) Virtual Network Functions (VNFs).
Note: Ultra M 5.1.x release is considered in order to define the procedures in this document. This document is intended for the Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps required to be carried out at OpenStack and CPS VNF level at the time of the Compute Server stop-start.
Prerequisites
Backup
Before you stop-start a Compute node, it is important to check the current state of your Red Hat OpenStack Platform environment. It is recommended that you check the current state in order to avoid complications.
In case of recovery, Cisco recommends to take a backup of the OSPD database with the use of these steps.
<[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
[root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql
/etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack
tar: Removing leading `/' from member names
This process ensures that a node can be replaced without affecting the availability of any instances. Also, it is recommended to backup the CPS configuration.
Use this configuration in order to back up CPS VMs from Cluster Manager Virtual Machine (VM).
[root@CM ~]# config_br.py -a export --all /mnt/backup/CPS_backup_28092016.tar.gz
Identify the VMs Hosted in the Compute Node
Identify the VMs that are hosted on the Compute server.
[stack@director ~]$ nova list --field name,host,networks | grep compute-10
| 49ac5f22-469e-4b84-badc-031083db0533 | VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d | pod1-compute-10.localdomain | Replication=10.160.137.161; Internal=192.168.1.131; Management=10.225.247.229; tb1-orch=172.16.180.129
Note: In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output will be used in subsequent sections.
Disable the PCRF Services Residing on the VM to be Shutdown
1. Login to the management IP of the VM.
[stack@XX-ospd ~]$ ssh root@<Management IP>
[root@XXXSM03 ~]# monit stop all
2. If theVMis anSM,OAMorArbiter, in addition, stop the sessionmgr services.
[root@XXXSM03 ~]# cd /etc/init.d
[root@XXXSM03 init.d]# ls -l sessionmgr*
-rwxr-xr-x 1 root root 4544 Nov 29 23:47 sessionmgr-27717
-rwxr-xr-x 1 root root 4399 Nov 28 22:45 sessionmgr-27721
-rwxr-xr-x 1 root root 4544 Nov 29 23:47 sessionmgr-27727
3. Forevery file titled sessionmgr-xxxxx run service sessionmgr-xxxxx stop.
[root@XXXSM03 init.d]# service sessionmgr-27717 stop
Graceful Power Off
Shutdown VM from ESC
1. Log in to the ESC node that corresponds to the VNF and check the status of the VM.
[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
<vm_name>VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229</vm_name>
<state>VM_ALIVE_STATE</state>
<vm_name> VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d</vm_name>
<state>VM_ALIVE_STATE</state>
<snip>
2. Stop the VM with the use of its VM Name. (VM Name noted from section " Identify the VMs hosted in the Compute Node").
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli vm-action STOP VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
3. Once it is stopped, the VM must enter the SHUTOFF state.
[admin@VNF2-esc-esc-0 ~]$ cd /opt/cisco/esc/esc-confd/esc-cli
[admin@VNF2-esc-esc-0 esc-cli]$ ./esc_nc_cli get esc_datamodel | egrep --color "<state>|<vm_name>|<vm_id>|<deployment_name>"
<snip>
<state>SERVICE_ACTIVE_STATE</state>
<vm_name>VNF2-DEPLOYM_c1_0_df4be88d-b4bf-4456-945a-3812653ee229</vm_name>
<state>VM_ALIVE_STATE</state>
<vm_name>VNF2-DEPLOYM_c3_0_3e0db133-c13b-4e3d-ac14-
<state>VM_ALIVE_STATE</state>
<vm_name>VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d</vm_name>
<state>VM_SHUTOFF_STATE</state>
<snip>
Compute Node Stop-Start
The steps mentioned in this section are common irrespective of the VMs hosted in the compute node.
Stop-Start Compute Node from the OSPD
1. Check the status and then stop-start the node.
[stack@director ~]$ nova list | grep compute-10
| 03f15071-21aa-4bcf-8fdd-acdbde305168 | pod1-stack-compute-10 | ACTIVE | - | Running | ctlplane=192.200.0.106 |
[stack@director ~]$ nova stop pod1-stack-compute-10
2. Wait for the Compute to be in Shutoff state & then start it again.
[stack@director ~]$ nova start pod1-stack-compute-10
3. Check that the new compute node is in the Active state.
[stack@director ~]$ source stackrc
[stack@director ~]$ nova list |grep compute-10
| 03f15071-21aa-4bcf-8fdd-acdbde305168 | pod1-stack-compute-10 | ACTIVE | - | Running | ctlplane=192.200.0.106 |
[stack@director ~]$ source pod1-stackrc-Core
[stack@director ~]$ openstack hypervisor list |grep compute-10
| 6 | pod1-compute-10.localdomain |
Restore the VMs
VM Recovery from ESC
1. Ideally, from OSPD if you check nova list, the VMs should be in Shut state. In this case, you need to start the VMs from ESC.
[admin@VNF2-esc-esc-0 ~]$ sudo /opt/cisco/esc/esc-confd/esc-cli/esc_nc_cli vm-action START VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
[sudo] password for admin:
2. Or, if the VM is in error state in the nova list, perform this configuration.
[stack@director ~]$ nova list |grep VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
| 49ac5f22-469e-4b84-badc-031083db0533 | VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d | ERROR | - | NOSTATE |
3. Now, recover the VM from the ESC.
[admin@VNF2-esc-esc-0 ~]$ sudo /opt/cisco/esc/esc-confd/esc-cli/esc_nc_cli recovery-vm-action DO VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d
[sudo] password for admin:
Recovery VM Action
/opt/cisco/esc/confd/bin/netconf-console --port=830 --host=127.0.0.1 --user=admin --privKeyFile=/root/.ssh/confd_id_dsa --privKeyType=dsa --rpc=/tmp/esc_nc_cli.ZpRCGiieuW
<?xml version="1.0" encoding="UTF-8"?>
<rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="1">
<ok/>
</rpc-reply>
4. Monitor the yangesc.log.
admin@VNF2-esc-esc-0 ~]$ tail -f /var/log/esc/yangesc.log
…
14:59:50,112 07-Nov-2017 WARN Type: VM_RECOVERY_COMPLETE
14:59:50,112 07-Nov-2017 WARN Status: SUCCESS
14:59:50,112 07-Nov-2017 WARN Status Code: 200
14:59:50,112 07-Nov-2017 WARN Status Msg: Recovery: Successfully recovered VM [VNF2-DEPLOYM_s9_0_8bc6cc60-15d6-4ead-8b6a-10e75d0e134d].
Check the PCRF Services Residing on the VM
Note: If the VM is in the SHUTOFF state, then Power it ON with the use of esc_nc_cli from ESC. Check the diagnostics.sh from cluster manager VM and if you come across any error found for the VMs which are recovered then.
1. Login to the respective VM.
[stack@XX-ospd ~]$ ssh root@<Management IP>
[root@XXXSM03 ~]# monit start all
2. If theVMis anSM,OAMorArbiter, in addition, start the sessionmgr services which stopped earlier. Forevery file titled sessionmgr-xxxxx, run service sessionmgr-xxxxx start.
[root@XXXSM03 init.d]# service sessionmgr-27717 start
3. If still diagnostic is not clear, then perform build_all.sh from Cluster Manager VM and the perform VM-init on the respective VM.
/var/qps/install/current/scripts/build_all.sh
ssh VM e.g. ssh pcrfclient01
/etc/init.d/vm-init