Motherboard Replacement in Ultra-M UCS 240M4 Server - CPAR

Available Languages

Download Options

PDF (3.5 MB)
View with Adobe Reader on a variety of devices
ePub (3.3 MB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (2.7 MB)
View on Kindle device or Kindle app on multiple devices

Updated:September 10, 2018

Document ID:213669

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Background Information

Abbreviations

Workflow of the MoP

Motherboard Replacement in Ultra-M Setup

Prerequisites

Motherboard Replacement in Compute Node

Identify the VMs Hosted in the Compute Node

Backup: Snapshot Process

Step 1. CPAR Application Shutdown.

Recover an Instance through Snapshot

Recovery Process

Create and Assign a Floating IP Address

Enabling SSH

Establish an SSH Session

CPAR Instance Start

Post-activity Health Check

Motherboard Replacement in OSD Compute Node

Identify the VMs Hosted in the Osd-Compute Node

Backup: Snapshot Process

CPAR Application Shutdown

VM Snapshot task

VM Snapshot

Put CEPH in Maintenance Mode

Graceful Power Off

Replace Motherboard

Move CEPH out of Maintenance Mode

Restore the VMs

Recover an Instance through Snapshot

Create and Assign a Floating IP Address

Enabling SSH

Establish an SSH session

CPAR Instance Start

Post-activity Health Check

Motherboard Replacement in Controller Node

Verify Controller Status and put Cluster in Maintenance Mode

Replace Motherboard

Restore Cluster Status

Introduction

This document describes the steps required to replace faulty Motherboard of a server in an Ultra-M setup.

This procedure applies for an Openstack environment using NEWTON version where ESC is not managing CPAR and CPAR is installed directly on the VM deployed on Openstack.

Background Information

Ultra-M is a pre-packaged and validated virtualized mobile packet core solution that is designed in order to simplify the deployment of VNFs. OpenStack is the Virtualized Infrastructure Manager (VIM) for Ultra-M and consists of these node types:

Compute
Object Storage Disk - Compute (OSD - Compute)
Controller
OpenStack Platform - Director (OSPD)

The high-level architecture of Ultra-M and the components involved are depicted in this image:

This document is intended for Cisco personnel who are familiar with Cisco Ultra-M platform and it details the steps that are required to be carried out at OpenStack and Redhat OS.

Note: Ultra M 5.1.x release is considered in order to define the procedures in this document.

Abbreviations

MOP	Method of Procedure
OSD	Object Storage Disks
OSPD	OpenStack Platform Director
HDD	Hard Disk Drive
SSD	Solid State Drive
VIM	Virtual Infrastructure Manager
VM	Virtual Machine
EM	Element Manager
UAS	Ultra Automation Services
UUID	Universally Unique IDentifier

Workflow of the MoP

Motherboard Replacement in Ultra-M Setup

In an Ultra-M setup, there can be scenarios where a motherboard replacement is required in the following server types: Compute, OSD-Compute and Controller.

Note: The boot disks with the openstack installation are replaced after the replacement of the motherboard. Hence there is no requirement to add the node back to overcloud. Once the server is powered ON after the replacement activity, it would enrol itself back to the overcloud stack.

Prerequisites

Before you replace a Compute node, it is important to check the current state of your Red Hat OpenStack Platform environment. It is recommended you check the current state in order to avoid complications when the Compute replacement process is on. It can be achieved by this flow of replacement.

In case of recovery, Cisco recommends to take a backup of the OSPD database with the use of these steps:

[root@director ~]# mysqldump --opt --all-databases > /root/undercloud-all-databases.sql
[root@director ~]# tar --xattrs -czf undercloud-backup-`date +%F`.tar.gz /root/undercloud-all-databases.sql 
/etc/my.cnf.d/server.cnf /var/lib/glance/images /srv/node /home/stack
tar: Removing leading `/' from member names

This process ensures that a node can be replaced without affecting the availability of any instances.

Note: Make sure you have the snapshot of the instance so that you can restore the VM when needed. Follow this procedure on how to take snapshot of the VM.

Motherboard Replacement in Compute Node

Before the activity, the VMs hosted in the Compute node are gracefully shutoff. Once the Motherboard has been replaced, the VMs are restored back.

Identify the VMs Hosted in the Compute Node

[stack@al03-pod2-ospd ~]$ nova list --field name,host

+--------------------------------------+---------------------------+----------------------------------+

| ID                                   | Name                      | Host                             |

+--------------------------------------+---------------------------+----------------------------------+

| 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | pod2-stack-compute-4.localdomain |

| 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21                   | pod2-stack-compute-3.localdomain |

| f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june                 | pod2-stack-compute-3.localdomain |

+--------------------------------------+---------------------------+----------------------------------+

Note: In the output shown here, the first column corresponds to the Universally Unique IDentifier (UUID), the second column is the VM name and the third column is the hostname where the VM is present. The parameters from this output are used in subsequent sections.

Backup: Snapshot Process

Step 1. CPAR Application Shutdown.

Step 1.Open any ssh client connected to the network and connect to the CPAR instance.

It is important not to shutdown all 4 AAA instances within one site at the same time, do it in a one by one fashion.

Step 2.Shut Down CPAR application with this command:

/opt/CSCOar/bin/arserver stop

A Message stating “Cisco Prime Access Registrar Server Agent shutdown complete.” Should show up

If a user left a CLI session open, the arserver stop command won’t work and this message is displayed:

ERROR:    You can not shut down Cisco Prime Access Registrar while the

          CLI is being used.   Current list of running

          CLI with process id is:

 2903 /opt/CSCOar/bin/aregcmd –s

In this example, the highlighted process id 2903 needs to be terminated before CPAR can be stopped. If this is the case please terminate this process with this command:

kill -9 *process_id*

Then repeat the step 1.

Step 3.Verify that CPAR application was indeed shutdown by issuing the command:

/opt/CSCOar/bin/arstatus

This messages should appear:

Cisco Prime Access Registrar Server Agent not running
Cisco Prime Access Registrar GUI not running

VM Snapshot Task

Step 1.Enter the Horizon GUI website that corresponds to the Site (City) currently being worked on.

When accessing Horizon, this screen is observed:

Step 2.Navigate to Project > Instances, as shown in the image.

If the user used was CPAR, then only the 4 AAA instances appear in this menu.

Step 3.Shut down only one instance at a time, please repeat the whole process in this document.

In order to shutdown the VM, navigate to Actions > Shut Off Instance and confirm your selection.

Step 4.Validate that the instance was indeed shut down by checking the Status = Shutoff and Power State = Shut Down.

This step ends the CPAR shutdown process.

VM Snapshot

Once the CPAR VMs are down, the snapshots can be taken in parallel, as they belong to independent computes.

The four QCOW2 files will be created in parallel.

Taking a snapshot of each AAA instance (25 minutes -1 hour) (25 minutes for instances that used a qcow image as a source and 1 hour for instances that user a raw image as a source)

Step 1. Login to POD’s Openstack’s HorizonGUI.

Step 2. Once logged in, proceed to the Project > Compute > Instances section on the top menu and look for the AAA instances.

Step 3. Click on the Create Snapshot button to proceed with snapshot creation (this needs to be executed on the corresponding AAA instance).

Step 4. Once the snapshot runs, navigate to the IMAGES menu and verify that all finish and report no problems.

Step 5. The next step is to download the snapshot on a QCOW2 format and transfer it to a remote entity in case the OSPD is lost during this process. In order to achieve this, identify the snapshot with this command glance image-list at OSPD level.

[root@elospd01 stack]# glance image-list

+--------------------------------------+---------------------------+

| ID                                   | Name                      |             +--------------------------------------+---------------------------+

| 80f083cb-66f9-4fcf-8b8a-7d8965e47b1d | AAA-Temporary             |             | 22f8536b-3f3c-4bcc-ae1a-8f2ab0d8b950 | ELP1 cluman 10_09_2017    |

| 70ef5911-208e-4cac-93e2-6fe9033db560 | ELP2 cluman 10_09_2017    |

| e0b57fc9-e5c3-4b51-8b94-56cbccdf5401 | ESC-image                 |

| 92dfe18c-df35-4aa9-8c52-9c663d3f839b | lgnaaa01-sept102017       |

| 1461226b-4362-428b-bc90-0a98cbf33500 | tmobile-pcrf-13.1.1.iso   |

| 98275e15-37cf-4681-9bcc-d6ba18947d7b | tmobile-pcrf-13.1.1.qcow2 |

+--------------------------------------+---------------------------+

Step 6. Once identified the snapshot to be downloaded (in this case is going to be the one marked above in green), download it on a QCOW2 format using the command glance image-download as shown here.

[root@elospd01 stack]# glance image-download 92dfe18c-df35-4aa9-8c52-9c663d3f839b --file /tmp/AAA-CPAR-LGNoct192017.qcow2 &

The “&” sends the process to background. It will take some time to complete this action, once it is done, the image can be located at /tmp directory.
On sending the process to background, if connectivity is lost, then the process is also stopped.
Execute the command “disown -h” so that in case of SSH connection is lost, the process still runs and finishes on the OSPD.

Step 7. Once the download process finishes, a compression process needs to be executed as that snapshot may be filled with ZEROES because of processes, tasks and temporary files handled by the Operating System. The command to be used for file compression is virt-sparsify.

[root@elospd01 stack]# virt-sparsify AAA-CPAR-LGNoct192017.qcow2 AAA-CPAR-LGNoct192017_compressed.qcow2

This process takes some time (around 10-15 minutes). Once finished, the resulting file is the one that needs to be transferred to an external entity as specified on next step.

Verification of the file integrity is required, in order to achieve this, execute the next command and look for the “corrupt” attribute at the end of its output.

[root@wsospd01 tmp]# qemu-img info AAA-CPAR-LGNoct192017_compressed.qcow2
image: AAA-CPAR-LGNoct192017_compressed.qcow2
file format: qcow2
virtual size: 150G (161061273600 bytes)
disk size: 18G
cluster_size: 65536
Format specific information:

    compat: 1.1

    lazy refcounts: false

    refcount bits: 16

    corrupt: false

In order to avoid a problem where the OSPD is lost, the recently created snapshot on QCOW2 format needs to be transferred to an external entity. Before to start the file transfer we have to check if the destination has enough available disk space, use the command “df –kh”in order to verify the memory space. Our advice is to transfer it to another site’s OSPD temporarily by using SFTP “sftproot@x.x.x.x” where x.x.x.x is the IP of a remote OSPD. In order to speed up the transfer, the destination can be sent to multiple OSPDs. In the same way, we can use the following command scp *name_of_the_file*.qcow2 root@ x.x.x.x:/tmp (where x.x.x.x is the IP of a remote OSPD) to transfer the file to another OSPD.

Graceful Power Off

Power off Node

To power off the instance : nova stop <INSTANCE_NAME>
Now you will see the instance name with the status shutoff.

[stack@director ~]$ nova stop  aaa2-21

Request to stop server aaa2-21 has been accepted.

[stack@director ~]$ nova list

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

| ID                                   | Name                      | Status  | Task State | Power State | Networks                                                                                                   |

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

| 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | ACTIVE  | -          | Running     | tb1-mgmt=172.16.181.14, 10.225.247.233; radius-routable1=10.160.132.245; diameter-routable1=10.160.132.231 |

| 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21                   | SHUTOFF | -          | Shutdown    | diameter-routable1=10.160.132.230; radius-routable1=10.160.132.248; tb1-mgmt=172.16.181.7, 10.225.247.234  |

| f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june                 | ACTIVE  | -          | Running     | diameter-routable1=10.160.132.233; radius-routable1=10.160.132.244; tb1-mgmt=172.16.181.10                 |

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

Replace Motherboard

The steps in order to replace the motherboard in a UCS C240 M4 server can be referred from Cisco UCS C240 M4 Server Installation and Service Guide

Login to the server with the use of the CIMC IP.
Perform BIOS upgrade if the firmware is not as per the recommended version used previously. Steps for BIOS upgrade are given here: Cisco UCS C-Series Rack-Mount Server BIOS Upgrade Guide

Restore the VMs

Recover an Instance through Snapshot

Recovery Process

It is possible to redeploy the previous instance with the snapshot taken in previous steps.

Step 1 [OPTIONAL].If there is no previous VMsnapshot available then connect to the OSPD node where the backup was sent and sftp the backup back to its original OSPD node. Using “sftproot@x.x.x.x” where x.x.x.x is the IP of the original OSPD. Save the snapshot file in /tmp directory.

Step 2.Connect to the OSPD node where the instance is re-deploy.

Source the environment variables with this command:

# source /home/stack/pod1-stackrc-Core-CPAR

Step 3.To use the snapshot as an image is necessary to upload it to the horizon as such. Use the next command to do so.

#glance image-create -- AAA-CPAR-Date-snapshot.qcow2 --container-format bare --disk-format qcow2 --name AAA-CPAR-Date-snapshot

The process can be seen on the horizon.

Step 4.In Horizon, navigate to Project > Instances and click on Launch Instance.

Step 5.Fill in the instance name and choose the availability zone.

Step 6.In the Source tab, choose the image to create the instance. In the Select Boot Source menu select image, a list of images is shown here, choose the one that was previously uploaded as you click on + sign.

Step 7.In the Flavor tab, choose the AAA flavor as you click on + sign.

Step 8.Finally, navigate to the network tab and choose the networks that the instance needs as you click on + sign. For this case select diameter-soutable1, radius-routable1 and tb1-mgmt.

Step 9. Finally, click on Launch instance to create it. The progress can be monitored in Horizon:

After a few minutes, the instance is completely deployed and ready for use.

Create and Assign a Floating IP Address

A floating IP address is a routable address, which means that it’s reachable from the outside of Ultra M/Openstack architecture, and it’s able to communicate with other nodes from the network.

Step 1.In the Horizon top menu, navigate toAdmin > Floating IPs.

Step 2. Click on the buttonAllocateIP to Project.

Step 3. In theAllocate Floating IPwindow select thePoolfrom which the new floating IP belongs, theProjectwhere it is going to be assigned, and the newFloating IP Addressitself.

For example:

Step 4.Click onAllocateFloating IPbutton.

Step 5. In the Horizon top menu, navigate toProject > Instances.

Step 6.In theActioncolumn click on the arrow that points down in theCreate Snapshotbutton, a menu should be displayed. SelectAssociate Floating IPoption.

Step 7. Select the corresponding floating IP address intended to be used in theIP Addressfield, and choose the corresponding management interface (eth0) from the new instance where this floating IP is going to be assigned in thePort to be associated. Please refer to the next image as an example of this procedure.

Step 8.Finally, click onAssociatebutton.

Enabling SSH

Step 1.In the Horizon top menu, navigate toProject > Instances.

Step 2.Click on the name of the instance/VM that was created in sectionLunch a new instance.

Step 3. Click onConsoletab. This will display the command line interface of the VM.

Step 4.Once the CLI is displayed, enter the proper login credentials:

Username:root

Password:cisco123

Step 5.In the CLI enter the commandvi /etc/ssh/sshd_configto edit ssh configuration.

Step 6. Once the ssh configuration file is open, pressIto edit the file. Then look for the section showed below and change the first line fromPasswordAuthentication notoPasswordAuthentication yes.

Step 7.PressESCand enter:wq!to save sshd_config file changes.

Step 8. Execute the commandservice sshd restart.

Step 9.In order to test SSH configuration changes have been correctly applied, open any SSH client and try to establish a remote secure connectionusing the floating IPassigned to the instance (i.e. 10.145.0.249) and the userroot.

Establish an SSH Session

Open an SSH session using the IP address of the corresponding VM/server where the application is installed.

CPAR Instance Start

Please follow the below steps, once the activity has been completed and CPAR services can be re-established in the Site that was shut down.

In order to login back to Horizon, navigate to Project > Instance > Start Instance.
Verify that the status of the instance is active and the power state is running:

Post-activity Health Check

Step 1.Execute the command /opt/CSCOar/bin/arstatus at OS level.

[root@aaa04 ~]# /opt/CSCOar/bin/arstatus
Cisco Prime AR RADIUS server running       (pid: 24834)
Cisco Prime AR Server Agent running        (pid: 24821)
Cisco Prime AR MCD lock manager running    (pid: 24824)
Cisco Prime AR MCD server running          (pid: 24833)
Cisco Prime AR GUI running                 (pid: 24836)
SNMP Master Agent running                (pid: 24835)
[root@wscaaa04 ~]#

Step 2.Execute the command /opt/CSCOar/bin/aregcmd at OS level and enter the admin credentials. Verify that CPAR Health is 10 out of 10 and the exit CPAR CLI.

[root@aaa02 logs]# /opt/CSCOar/bin/aregcmd
Cisco Prime Access Registrar 7.3.0.1 Configuration Utility
Copyright (C) 1995-2017 by Cisco Systems, Inc.  All rights reserved.
Cluster:
User: admin
Passphrase:
Logging in to localhost
[ //localhost ]

    LicenseInfo = PAR-NG-TPS 7.2(100TPS:)

                  PAR-ADD-TPS 7.2(2000TPS:)

                  PAR-RDDR-TRX 7.2()

                  PAR-HSS 7.2()

    Radius/

    Administrators/
Server 'Radius' is Running, its health is 10 out of 10
--> exit

Step 3.Run the command netstat | grep diameter and verify that all DRA connections are established.

The output mentioned below is for an environment where Diameter links are expected. If fewer links are displayed, this represents a disconnection from the DRA that needs to be analyzed.

[root@aa02 logs]# netstat | grep diameter
tcp         0            0 aaa02.aaa.epc.:77 mp1.dra01.d:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:36 tsa6.dra01:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:47 mp2.dra01.d:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:07 tsa5.dra01:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:08 np2.dra01.d:diameter ESTABLISHED

Step 4.Check that the TPS log shows requests being processed by CPAR. The values highlighted represent the TPS and those are the ones we need to pay attention to.

The value of TPS should not exceed 1500.

[root@wscaaa04 ~]# tail -f /opt/CSCOar/logs/tps-11-21-2017.csv
11-21-2017,23:57:35,263,0
11-21-2017,23:57:50,237,0
11-21-2017,23:58:05,237,0
11-21-2017,23:58:20,257,0
11-21-2017,23:58:35,254,0
11-21-2017,23:58:50,248,0
11-21-2017,23:59:05,272,0
11-21-2017,23:59:20,243,0
11-21-2017,23:59:35,244,0
11-21-2017,23:59:50,233,0

Step 5.Look for any “error” or “alarm” messages in name_radius_1_log

[root@aaa02 logs]# grep -E "error|alarm" name_radius_1_log

Step 6.Verify the amount of memory that the CPAR process is using by issuing the following command:

top | grep radius

[root@sfraaa02 ~]# top | grep radius
27008 root      20   0 20.228g 2.413g  11408 S 128.3  7.7   1165:41 radius

This highlighted value should be lower than: 7Gb, which is the maximum allowed at an application level.

Motherboard Replacement in OSD Compute Node

Before the activity, the VMs hosted in the Compute node are gracefully shutoff and the CEPH is put into maintenance mode. Once the Motherboard has been replaced, the VMs are restored back and CEPH is moved out of maintenance mode.

Identify the VMs Hosted in the Osd-Compute Node

Identify the VMs that are hosted on the OSD compute server.

[stack@director ~]$ nova list --field name,host | grep osd-compute-0
| 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | pod2-stack-compute-4.localdomain |

Backup: Snapshot Process

CPAR Application Shutdown

Step 1.Open any ssh client connected to the network and connect to the CPAR instance.

It is important not to shutdown all 4 AAA instances within one site at the same time, do it in a one by one fashion.

Step 2.Shut Down CPAR application with this command:

/opt/CSCOar/bin/arserver stop

A Message stating “Cisco Prime Access Registrar Server Agent shutdown complete.” Should show up

Note: If a user left a CLI session open, the arserver stop command won’t work and the following message will be displayed:

ERROR:    You can not shut down Cisco Prime Access Registrar while the

          CLI is being used.   Current list of running

          CLI with process id is:

 2903 /opt/CSCOar/bin/aregcmd –s

In this example, the highlighted process id 2903 needs to be terminated before CPAR can be stopped. If this is the case please terminate this process with this command:

kill -9 *process_id*

Then repeat the step 1.

Step 3.Verify that CPAR application was indeed shutdown with this command:

/opt/CSCOar/bin/arstatus

These messages appear:

Cisco Prime Access Registrar Server Agent not running
Cisco Prime Access Registrar GUI not running

VM Snapshot task

Step 1.Enter the Horizon GUI website that corresponds to the Site (City) currently being worked on.

When accessing Horizon, the image shown is observed:

Step 2. Navigate to Project > Instances, as shown in the image.

If the user used was CPAR, then only the 4 AAA instances appear in this menu.

Step 3.Shut down only one instance at a time, please repeat the whole process in this document.

In order to shutdown the VM, navigate to Actions > Shut Off Instance and confirm your selection.

Step 4.Validate that the instance was indeed shut down by checking the Status = Shutoff and Power State = Shut Down.

This step ends the CPAR shutdown process.

VM Snapshot

Once the CPAR VMs are down, the snapshots can be taken in parallel, as they belong to independent computes.

The four QCOW2 files are created in parallel.

Take a snapshot of each AAA instance (25 minutes -1 hour) (25 minutes for instances that used a qcow image as a source and 1 hour for instances that user a raw image as a source)

Step 1. Login to POD’s Openstack’s HorizonGUI.

Step 2. Once logged in, proceed to the Project > Compute > Instances section on the top menu and look for the AAA instances.

Step 3. Click on the Create Snapshot button to proceed with snapshot creation (this needs to be executed on the corresponding AAA instance).

Step 4. Once the snapshot runs, navigate to the IMAGES menu and verify that all finish and report no problems.

[root@elospd01 stack]# glance image-list

+--------------------------------------+---------------------------+

| ID                                   | Name                      |             +--------------------------------------+---------------------------+

| 80f083cb-66f9-4fcf-8b8a-7d8965e47b1d | AAA-Temporary             |             | 22f8536b-3f3c-4bcc-ae1a-8f2ab0d8b950 | ELP1 cluman 10_09_2017    |

| 70ef5911-208e-4cac-93e2-6fe9033db560 | ELP2 cluman 10_09_2017    |

| e0b57fc9-e5c3-4b51-8b94-56cbccdf5401 | ESC-image                 |

| 92dfe18c-df35-4aa9-8c52-9c663d3f839b | lgnaaa01-sept102017       |

| 1461226b-4362-428b-bc90-0a98cbf33500 | tmobile-pcrf-13.1.1.iso   |

| 98275e15-37cf-4681-9bcc-d6ba18947d7b | tmobile-pcrf-13.1.1.qcow2 |

+--------------------------------------+---------------------------+

Step 6. Once identified the snapshot is to be downloaded (in this case is going to be the one marked above in green), now download it on a QCOW2 format with this command glance image-download as shown in here.

[root@elospd01 stack]# glance image-download 92dfe18c-df35-4aa9-8c52-9c663d3f839b --file /tmp/AAA-CPAR-LGNoct192017.qcow2 &

The “&” sends the process to background. It will take some time to complete this action, once it is done, the image can be located at /tmp directory.
On sending the process to background, if connectivity is lost, then the process is also stopped.
Execute the command “disown -h” so that in case of SSH connection is lost, the process still runs and finishes on the OSPD.

7. Once the download process finishes, a compression process needs to be executed as that snapshot may be filled with ZEROES because of processes, tasks and temporary files handled by the Operating System. The command to be used for file compression is virt-sparsify.

[root@elospd01 stack]# virt-sparsify AAA-CPAR-LGNoct192017.qcow2 AAA-CPAR-LGNoct192017_compressed.qcow2

This process takes some time (around 10-15 minutes). Once finished, the resulting file is the one that needs to be transferred to an external entity as specified on next step.

Verification of the file integrity is required, in order to achieve this, run the next command and look for the “corrupt” attribute at the end of its output.

[root@wsospd01 tmp]# qemu-img info AAA-CPAR-LGNoct192017_compressed.qcow2
image: AAA-CPAR-LGNoct192017_compressed.qcow2
file format: qcow2
virtual size: 150G (161061273600 bytes)
disk size: 18G
cluster_size: 65536
Format specific information:

    compat: 1.1

    lazy refcounts: false

    refcount bits: 16

    corrupt: false

Put CEPH in Maintenance Mode

Step 1. Verify ceph osd tree status are up in the server

[heat-admin@pod2-stack-osd-compute-0 ~]$ sudo ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 13.07996 root default 
-2 4.35999 host pod2-stack-osd-compute-0 
 0 1.09000 osd.0 up 1.00000 1.00000 
 3 1.09000 osd.3 up 1.00000 1.00000 
 6 1.09000 osd.6 up 1.00000 1.00000 
 9 1.09000 osd.9 up 1.00000 1.00000 
-3 4.35999 host pod2-stack-osd-compute-1 
 1 1.09000 osd.1 up 1.00000 1.00000 
 4 1.09000 osd.4 up 1.00000 1.00000 
 7 1.09000 osd.7 up 1.00000 1.00000 
10 1.09000 osd.10 up 1.00000 1.00000 
-4 4.35999 host pod2-stack-osd-compute-2 
 2 1.09000 osd.2 up 1.00000 1.00000 
 5 1.09000 osd.5 up 1.00000 1.00000 
 8 1.09000 osd.8 up 1.00000 1.00000 
11 1.09000 osd.11 up 1.00000 1.00000

Step 2. Login to the OSD Compute node and put CEPH in the maintenance mode.

[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd set norebalance
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd set noout

[root@pod2-stack-osd-compute-0 ~]# sudo ceph status

 cluster eb2bb192-b1c9-11e6-9205-525400330666
 health HEALTH_WARN
 noout,norebalance,sortbitwise,require_jewel_osds flag(s) set
 monmap e1: 3 mons at {pod2-stack-controller-0=11.118.0.10:6789/0,pod2-stack-controller-1=11.118.0.11:6789/0,pod2-stack-controller-2=11.118.0.12:6789/0}
 election epoch 10, quorum 0,1,2 pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2
 osdmap e79: 12 osds: 12 up, 12 in
 flags noout,norebalance,sortbitwise,require_jewel_osds
 pgmap v22844323: 704 pgs, 6 pools, 804 GB data, 423 kobjects
 2404 GB used, 10989 GB / 13393 GB avail
 704 active+clean
 client io 3858 kB/s wr, 0 op/s rd, 546 op/s wr

Note: When CEPH is removed, VNF HD RAID goes into the Degraded state but hd-disk must still be accessible

Graceful Power Off

Power off Node

To power off the instance : nova stop <INSTANCE_NAME>
You see the instance name with the status shutoff.

[stack@director ~]$ nova stop  aaa2-21

Request to stop server aaa2-21 has been accepted.

[stack@director ~]$ nova list

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

| ID                                   | Name                      | Status  | Task State | Power State | Networks                                                                                                   |

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

| 46b4b9eb-a1a6-425d-b886-a0ba760e6114 | AAA-CPAR-testing-instance | ACTIVE  | -          | Running     | tb1-mgmt=172.16.181.14, 10.225.247.233; radius-routable1=10.160.132.245; diameter-routable1=10.160.132.231 |

| 3bc14173-876b-4d56-88e7-b890d67a4122 | aaa2-21                   | SHUTOFF | -          | Shutdown    | diameter-routable1=10.160.132.230; radius-routable1=10.160.132.248; tb1-mgmt=172.16.181.7, 10.225.247.234  |

| f404f6ad-34c8-4a5f-a757-14c8ed7fa30e | aaa21june                 | ACTIVE  | -          | Running     | diameter-routable1=10.160.132.233; radius-routable1=10.160.132.244; tb1-mgmt=172.16.181.10                 |

+--------------------------------------+---------------------------+---------+------------+-------------+------------------------------------------------------------------------------------------------------------+

Replace Motherboard

The steps in order to replace the motherboard in a UCS C240 M4 server can be referred from Cisco UCS C240 M4 Server Installation and Service Guide

Login to the server with the use of the CIMC IP.
Perform BIOS upgrade if the firmware is not as per the recommended version used previously. Steps for BIOS upgrade are given here: Cisco UCS C-Series Rack-Mount Server BIOS Upgrade Guide

Move CEPH out of Maintenance Mode

[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd unset norebalance
[root@pod2-stack-osd-compute-0 ~]# sudo ceph osd unset noout

[root@pod2-stack-osd-compute-0 ~]# sudo ceph status

 cluster eb2bb192-b1c9-11e6-9205-525400330666
 health HEALTH_OK
 monmap e1: 3 mons at {pod2-stack-controller-0=11.118.0.10:6789/0,pod2-stack-controller-1=11.118.0.11:6789/0,pod2-stack-controller-2=11.118.0.12:6789/0}
 election epoch 10, quorum 0,1,2 pod2-stack-controller-0,pod2-stack-controller-1,pod2-stack-controller-2
 osdmap e81: 12 osds: 12 up, 12 in
 flags sortbitwise,require_jewel_osds
 pgmap v22844355: 704 pgs, 6 pools, 804 GB data, 423 kobjects
 2404 GB used, 10989 GB / 13393 GB avail
 704 active+clean
 client io 3658 kB/s wr, 0 op/s rd, 502 op/s wr

Restore the VMs

Recover an Instance through Snapshot

Recovery Process:

It is possible to redeploy the previous instance with the snapshot taken in previous steps.

Step 2.Connect to the OSPD node where the instance is re-deployed.

Source the environment variables with this command:

# source /home/stack/pod1-stackrc-Core-CPAR

Step 3.To use the snapshot as an image is necessary to upload it to the horizon as such. Use the next command to do so.

#glance image-create -- AAA-CPAR-Date-snapshot.qcow2 --container-format bare --disk-format qcow2 --name AAA-CPAR-Date-snapshot

The process can be seen on the horizon.

Step 4.In Horizon, navigate to Project > Instances and click on Launch Instance.

Step 5.Fill in the instance name and choose the availability zone.

Step 6.In the Source tab, choose the image to create the instance. In the Select Boot Source menu select image, a list of images are shown here, choose the one that was previously uploaded as you click on + sign.

Step 7.In the Flavor tab, choose the AAA flavor as you click on the + sign.

Step 8.Finally, navigate to the network tab and choose the networks that the instance needs as you click on the + sign. For this case select diameter-soutable1, radius-routable1 and tb1-mgmt.

Step 9. Finally, click on Launch instance to create it. The progress can be monitored in Horizon:

After a few minutes, the instance is completely deployed and ready for use.

Create and Assign a Floating IP Address

A floating IP address is a routable address, which means that it’s reachable from the outside of Ultra M/Openstack architecture, and it’s able to communicate with other nodes from the network.

Step 1.In the Horizon top menu, navigate toAdmin > Floating IPs.

Step 2.Click on the buttonAllocateIP to Project.

Step 3. In theAllocate Floating IPwindow select thePoolfrom which the new floating IP belongs, theProjectwhere it is going to be assigned, and the newFloating IP Addressitself.

For example:

Step 4.Click onAllocateFloating IPbutton.

Step 5. In the Horizon top menu, navigate toProject > Instances.

Step 6. In theActioncolumn click on the arrow that points down in theCreate Snapshotbutton, a menu should be displayed. SelectAssociate Floating IPoption.

Step 8.Finally, click onthe Associatebutton.

Enabling SSH

Step 1.In the Horizon top menu, navigate toProject > Instances.

Step 2.Click on the name of the instance/VM that was created in sectionLunch a new instance.

Step 3.Click onConsoletab. This displays the CLI of the VM.

Step 4. Once the CLI is displayed, enter the proper login credentials:

Username:root

Password:cisco123

Step 5.In the CLI enter the commandvi /etc/ssh/sshd_configto edit ssh configuration.

Step 6. Once the ssh configuration file is open, pressIto edit the file. Then look for the section shown here and change the first line fromPasswordAuthentication notoPasswordAuthentication yes.

Step 7.PressESCand enter:wq!to save sshd_config file changes.

Step 8. Run the commandservice sshd restart.

Establish an SSH session

Open an SSH session using the IP address of the corresponding VM/server where the application is installed.

CPAR Instance Start

Please follow these steps, once the activity has been completed and CPAR services can be re-established in the Site that was shut down.

Login back to Horizon, navigate to Project > Instance > Start Instance.
Verify that the status of the instance is active and the power state is running:

Post-activity Health Check

Step 1. Run the command /opt/CSCOar/bin/arstatus at OS level.

[root@aaa04 ~]# /opt/CSCOar/bin/arstatus
Cisco Prime AR RADIUS server running       (pid: 24834)
Cisco Prime AR Server Agent running        (pid: 24821)
Cisco Prime AR MCD lock manager running    (pid: 24824)
Cisco Prime AR MCD server running          (pid: 24833)
Cisco Prime AR GUI running                 (pid: 24836)
SNMP Master Agent running                (pid: 24835)
[root@wscaaa04 ~]#

Step 2. Run the command /opt/CSCOar/bin/aregcmd at OS level and enter the admin credentials. Verify that CPAR Health is 10 out of 10 and the exit CPAR CLI.

[root@aaa02 logs]# /opt/CSCOar/bin/aregcmd
Cisco Prime Access Registrar 7.3.0.1 Configuration Utility
Copyright (C) 1995-2017 by Cisco Systems, Inc.  All rights reserved.
Cluster:
User: admin
Passphrase:
Logging in to localhost
[ //localhost ]

    LicenseInfo = PAR-NG-TPS 7.2(100TPS:)

                  PAR-ADD-TPS 7.2(2000TPS:)

                  PAR-RDDR-TRX 7.2()

                  PAR-HSS 7.2()

    Radius/

    Administrators/
Server 'Radius' is Running, its health is 10 out of 10
--> exit

Step 3.Run the command netstat | grep diameter and verify that all DRA connections are established.

The output mentioned here is for an environment where Diameter links are expected. If fewer links are displayed, this represents a disconnection from the DRA that needs to be analyzed.

[root@aa02 logs]# netstat | grep diameter
tcp         0            0 aaa02.aaa.epc.:77 mp1.dra01.d:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:36 tsa6.dra01:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:47 mp2.dra01.d:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:07 tsa5.dra01:diameter ESTABLISHED
tcp         0            0 aaa02.aaa.epc.:08 np2.dra01.d:diameter ESTABLISHED

Step 4.Check that the TPS log shows requests being processed by CPAR. The values highlighted represent the TPS and those are the ones we need to pay attention to.

The value of TPS should not exceed 1500.

[root@wscaaa04 ~]# tail -f /opt/CSCOar/logs/tps-11-21-2017.csv
11-21-2017,23:57:35,263,0
11-21-2017,23:57:50,237,0
11-21-2017,23:58:05,237,0
11-21-2017,23:58:20,257,0
11-21-2017,23:58:35,254,0
11-21-2017,23:58:50,248,0
11-21-2017,23:59:05,272,0
11-21-2017,23:59:20,243,0
11-21-2017,23:59:35,244,0
11-21-2017,23:59:50,233,0

Step 5.Look for any “error” or “alarm” messages in name_radius_1_log

[root@aaa02 logs]# grep -E "error|alarm" name_radius_1_log

Step 6.Verify the amount of memory that the CPAR process uses with this command:

top | grep radius

[root@sfraaa02 ~]# top | grep radius
27008 root      20   0 20.228g 2.413g  11408 S 128.3  7.7   1165:41 radius

This highlighted value should be lower than: 7Gb, which is the maximum allowed at an application level.

Motherboard Replacement in Controller Node

Verify Controller Status and put Cluster in Maintenance Mode

From OSPD, login to the controller and verify pcs is in good state – all three controllers Online and galera showing all three controllers as Master.

[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:02:52 2018Last change: Mon Jul 2 12:49:52 2018 by root via crm_attribute on pod2-stack-controller-0

3 nodes and 19 resources configured

Online: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]

Full list of resources:

 ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Clone Set: haproxy-clone [haproxy]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 Master/Slave Set: galera-master [galera]
 Masters: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Master/Slave Set: redis-master [redis]
 Masters: [ pod2-stack-controller-2 ]
 Slaves: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
 ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

Put the cluster in maintenance mode

[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs cluster standby

[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:03:10 2018Last change: Fri Jul 6 09:03:06 2018 by root via crm_attribute on pod2-stack-controller-0

3 nodes and 19 resources configured

Node pod2-stack-controller-0: standby
Online: [ pod2-stack-controller-1 pod2-stack-controller-2 ]

Full list of resources:

 ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Clone Set: haproxy-clone [haproxy]
 Started: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
 Stopped: [ pod2-stack-controller-0 ]
 Master/Slave Set: galera-master [galera]
 Masters: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Master/Slave Set: redis-master [redis]
 Masters: [ pod2-stack-controller-2 ]
 Slaves: [ pod2-stack-controller-1 ]
 Stopped: [ pod2-stack-controller-0 ]
 ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

Replace Motherboard

Procedure to replace the motherboard in a UCS C240 M4 server can be referred from Cisco UCS C240 M4 Server Installation and Service Guide

Login to the server with the use of the CIMC IP.
Perform BIOS upgrade if the firmware is not as per the recommended version used previously. Steps for BIOS upgrade are given here:

Cisco UCS C-Series Rack-Mount Server BIOS Upgrade Guide

Restore Cluster Status

Login to the impacted controller, remove standby mode by setting unstandby. Verify controller comes Online with cluster and galera shows all three controllers as Master. This may take a few minutes.

[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs cluster unstandby

[heat-admin@pod2-stack-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: pod2-stack-controller-2 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jul 6 09:03:37 2018Last change: Fri Jul 6 09:03:35 2018 by root via crm_attribute on pod2-stack-controller-0

3 nodes and 19 resources configured

Online: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]

Full list of resources:

 ip-11.120.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Clone Set: haproxy-clone [haproxy]
 Started: [ pod2-stack-controller-0 pod2-stack-controller-1 pod2-stack-controller-2 ]
 Master/Slave Set: galera-master [galera]
 Masters: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
 Slaves: [ pod2-stack-controller-0 ]
 ip-192.200.0.110(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 ip-11.120.0.44(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 ip-11.118.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
 Started: [ pod2-stack-controller-1 pod2-stack-controller-2 ]
 Stopped: [ pod2-stack-controller-0 ]
 ip-10.225.247.214(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-1
 Master/Slave Set: redis-master [redis]
 Masters: [ pod2-stack-controller-2 ]
 Slaves: [ pod2-stack-controller-0 pod2-stack-controller-1 ]
 ip-11.119.0.49(ocf::heartbeat:IPaddr2):Started pod2-stack-controller-2
 openstack-cinder-volume(systemd:openstack-cinder-volume):Started pod2-stack-controller-1

Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

Contributed by Cisco Engineers

Karthikeyan Dachanamoorthy
Cisco Advance Services
Harshita Bhardwaj
Cisco Advance Services

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

Prime Access Registrar

Motherboard Replacement in Ultra-M UCS 240M4 Server - CPAR

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Background Information

Abbreviations

Workflow of the MoP

Motherboard Replacement in Ultra-M Setup

Prerequisites

Motherboard Replacement in Compute Node

Identify the VMs Hosted in the Compute Node

Backup: Snapshot Process

Step 1. CPAR Application Shutdown.

VM Snapshot Task

VM Snapshot

Graceful Power Off

Replace Motherboard

Restore the VMs

Recover an Instance through Snapshot

Recovery Process

Create and Assign a Floating IP Address

Enabling SSH

Establish an SSH Session

CPAR Instance Start

Post-activity Health Check

Motherboard Replacement in OSD Compute Node

Identify the VMs Hosted in the Osd-Compute Node

Backup: Snapshot Process

CPAR Application Shutdown

VM Snapshot task

VM Snapshot

Put CEPH in Maintenance Mode

Graceful Power Off

Replace Motherboard

Move CEPH out of Maintenance Mode

Restore the VMs

Recover an Instance through Snapshot

Create and Assign a Floating IP Address

Enabling SSH

Establish an SSH session

CPAR Instance Start

Post-activity Health Check

Motherboard Replacement in Controller Node

Verify Controller Status and put Cluster in Maintenance Mode

Replace Motherboard

Restore Cluster Status

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products