Recently during the Exadata patch, one database node reported an issue during the patchmgr and stopped the patch apply. The error was related to missing volumes (LVDoNotRemoveOrUse) at LVM. During the post, you can check the error, but please take attention that it changes some LVM config file contents. So, check correctly the step executed and (if possible) open pro-active SR to be sure what you will be doing.
The error
During the patchmgr from nopde01 to node02 of dbnodes I got the error below:
[root@exavm01s4 dbnodeupdate]# ./dbserver_patch_20.210314/patchmgr --dbnodes /u01/patches/exadatapt/dbnode_exavm_exavm02s4 --upgrade --iso_repo /u01/patches/exadatapt/domU/p32459080_201000_Linux-x86-64.zip --target_version 20.1.8.0.0.210317 --skip_gi_db_validation ************************************************************************************************************ NOTE patchmgr release: 21.210314 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip) NOTE NOTE Database nodes will reboot during the update process. NOTE WARNING Do not interrupt the patchmgr session. WARNING Do not resize the screen. It may disturb the screen layout. WARNING Do not reboot database nodes during update or rollback. WARNING Do not open logfiles in write mode and do not try to alter them. ************************************************************************************************************ 2021-04-22 14:42:48 +0200 :INFO : Checking hosts connectivity via ICMP/ping 2021-04-22 14:42:49 +0200 :INFO : Hosts Reachable: [exavm02s4] 2021-04-22 14:42:49 +0200 :INFO : All hosts are reachable via ping/ICMP 2021-04-22 14:42:49 +0200 :Working: Verify SSH equivalence for the root user to exavm02s4 2021-04-22 14:42:50 +0200 :INFO : SSH equivalency verified to host exavm02s4 2021-04-22 14:42:50 +0200 :SUCCESS: Verify SSH equivalence for the root user to exavm02s4 2021-04-22 14:42:52 +0200 :Working: Initiate prepare steps on node(s). 2021-04-22 14:42:53 +0200 :Working: Check free space on exavm02s4 2021-04-22 14:42:57 +0200 :SUCCESS: Check free space on exavm02s4 2021-04-22 14:43:23 +0200 :SUCCESS: Initiate prepare steps on node(s). 2021-04-22 14:43:23 +0200 :Working: Initiate update on 1 node(s). 2021-04-22 14:43:24 +0200 :Working: dbnodeupdate.sh running a backup on 1 node(s). 2021-04-22 16:45:26 +0200 :ERROR : dbnodeupdate.sh backup failed on one or more nodes SUMMARY OF ERRORS FOR exavm02s4: exavm02s4: ERROR: Backup failed investigate logfiles /var/log/cellos/dbnodeupdate.log and /var/log/cellos/dbserver_backup.sh.log 2021-04-22 16:45:33 +0200 :FAILED : dbnodeupdate.sh running a backup on 1 node(s). [INFO ] Collected dbnodeupdate diag in file: Diag_patchmgr_dbnode_upgrade_220421144247.tbz -rw-r--r-- 1 root root 1701698 Apr 22 16:45 Diag_patchmgr_dbnode_upgrade_220421144247.tbz 2021-04-22 16:45:35 +0200 :ERROR : FAILED run of command:./dbserver_patch_20.210314/patchmgr --dbnodes /u01/patches/exadatapt/dbnode_exavm_exavm02s4 --upgrade --iso_repo /u01/patches/exadatapt/domU/p32459080_201000_Linux-x86-64.zip --target_version 20.1.8.0.0.210317 --skip_gi_db_validation 2021-04-22 16:45:35 +0200 :INFO : Upgrade attempted on nodes in file /u01/patches/exadatapt/dbnode_exavm_exavm02s4: [exavm02s4] 2021-04-22 16:45:35 +0200 :INFO : Current image version on dbnode(s) is: 2021-04-22 16:45:35 +0200 :INFO : exavm02s4: 19.2.19.0.0.201013 2021-04-22 16:45:35 +0200 :INFO : For details, check the following files in /u01/patches/exadatapt/dbnodeupdate/dbserver_patch_20.210314: 2021-04-22 16:45:35 +0200 :INFO : - <dbnode_name>_dbnodeupdate.log 2021-04-22 16:45:35 +0200 :INFO : - patchmgr.log 2021-04-22 16:45:35 +0200 :INFO : - patchmgr.trc 2021-04-22 16:45:35 +0200 :INFO : Exit status:1 2021-04-22 16:45:35 +0200 :INFO : Exiting.
If you check the patchmgr.log we can check the same error message. But looking at /var/log/cellos/dbnodeupdate.log (at target node that will be patched) the true error appears:
[root@exavm02s4 ~]# vi /var/log/cellos/dbnodeupdate.log ... ... Setting interval between checks to 0 seconds [INFO] Mount spare root partition /dev/VGExaDb/LVDbSys2 to /mnt_spare Failed to find logical volume "VGExaDb/LVDoNotRemoveOrUse" [INFO] Preserve and then reset label for the root partition /dev/VGExaDb/LVDbSys1 [INFO] Total amount of space available for snapshot: 1 GB [INFO] Will be using snapshot of size: 1 GB [INFO] Create LVM snapshot with 1 GB size of the root partition /dev/VGExaDb/LVDbSys1 WARNING: Missing device /dev/xvda2 reappeared, updating metadata for VG VGExaDb to version 44. WARNING: Device /dev/xvda2 still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. WARNING: Missing device /dev/xvdd1 reappeared, updating metadata for VG VGExaDb to version 44. WARNING: Device /dev/xvdd1 still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. Cannot change VG VGExaDb while PVs are missing. Consider vgreduce --removemissing. Cannot process volume group VGExaDb Unable to create LVM snapshot with 1Gb size of the root partition /dev/VGExaDb/LVDbSys1 [1619095405][2021-04-22 16:45:18 +0200][INFO][./dbnodeupdate.sh][DiaryEntry][] Entering PrintGenError Backup failed investigate logfiles /var/log/cellos/dbnodeupdate.log and /var/log/cellos/dbserver_backup.sh.log ...
The error is clear: “Missing device” and “Cannot change VG VGExaDb while PVs are missing”. So, basically, the LVM is reporting missing volumes and we need to recreate/reimport it again.
Recovering
The start point is to get a baseline from one correct node. If you don’t know, one example from correct LVM volumes for Exadata VM are:
[root@exavm01s4 dbserver_patch_20.210314]# pvs PV VG Fmt Attr PSize PFree /dev/xvda2 VGExaDb lvm2 a-- <24.50g 0 /dev/xvdd1 VGExaDb lvm2 a-- <62.00g 0 /dev/xvdf VGExaDb lvm2 a-- <50.00g 0 /dev/xvdg VGExaDb lvm2 a-- <150.00g 0 [root@exavm01s4 dbserver_patch_20.210314]# vgs VG #PV #LV #SN Attr VSize VFree VGExaDb 4 5 0 wz--n- 286.48g 0 [root@exavm01s4 dbserver_patch_20.210314]# [root@exavm01s4 dbserver_patch_20.210314]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LVDbOra1 VGExaDb -wi-ao---- 221.48g LVDbSwap1 VGExaDb -wi-ao---- 16.00g LVDbSys1 VGExaDb -wi-ao---- 24.00g LVDbSys2 VGExaDb -wi-a----- 24.00g LVDoNotRemoveOrUse VGExaDb -wi-a----- 1.00g [root@exavm01s4 dbserver_patch_20.210314]#
But when I check the failed node I have:
[root@exavm02s4 ~]# pvs PV VG Fmt Attr PSize PFree /dev/xvda2 VGExaDb lvm2 a-m <24.50g 0 /dev/xvdd1 VGExaDb lvm2 a-m <62.00g 1.00g /dev/xvdf VGExaDb lvm2 a-- <50.00g 0 /dev/xvdg VGExaDb lvm2 a-- <150.00g 0 [root@exavm02s4 ~]# [root@exavm02s4 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert LVDbOra1 VGExaDb -wi-ao--p- 221.48g LVDbSwap1 VGExaDb -wi-ao--p- 16.00g LVDbSys1 VGExaDb -wi-ao--p- 24.00g LVDbSys2 VGExaDb -wi-a---p- 24.00g [root@exavm02s4 ~]# [root@exavm02s4 ~]# vgs VG #PV #LV #SN Attr VSize VFree VGExaDb 4 4 0 wz-pn- 286.48g 1.00g [root@exavm02s4 ~]#
As you can see the failed node is missing logical volumes, mainly the LVDoNotRemoveOrUse of 1GB.
The first step to solve is to try to remove the missing volumes. This step will fail but is crucial to be executed because it will generate one backup of the current configuration of LVM. And we need to edit this backup file to reload it.
[root@exavm02s4 ~]# vgreduce --removemissing --verbose VGExaDb There are 2 physical volumes missing. There are 2 physical volumes missing. Archiving volume group "VGExaDb" metadata (seqno 46). WARNING: Partial LV LVDbSys1 needs to be repaired or removed. WARNING: Partial LV LVDbSys2 needs to be repaired or removed. WARNING: Partial LV LVDbOra1 needs to be repaired or removed. WARNING: Partial LV LVDbSwap1 needs to be repaired or removed. There are still partial LVs in VG VGExaDb. To remove them unconditionally use: vgreduce --removemissing --force. WARNING: Proceeding to remove empty missing PVs. There are 2 physical volumes missing. Creating volume group backup "/etc/lvm/backup/VGExaDb" (seqno 47). [root@exavm02s4 ~]#
Look that the file /etc/lvm/backup/VGExaDb is generated. And as an example, you can see that even trying to re-create or delete the missing volume generate error as well:
[root@exavm02s4 ~]# lvcreate -n LVDoNotRemoveOrUse -L1G VGExaDb WARNING: Missing device /dev/xvda2 reappeared, updating metadata for VG VGExaDb to version 48. WARNING: Device /dev/xvda2 still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. WARNING: Missing device /dev/xvdd1 reappeared, updating metadata for VG VGExaDb to version 48. WARNING: Device /dev/xvdd1 still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. Cannot change VG VGExaDb while PVs are missing. Consider vgreduce --removemissing. Cannot process volume group VGExaDb [root@exavm02s4 ~]# [root@exavm02s4 ~]# vgextend --restoremissing LVDbSys1 VGExaDb Volume group "LVDbSys1" not found Cannot process volume group LVDbSys1 [root@exavm02s4 ~]#
Modifying the LVM
After the /etc/lvm/backup/VGExaDb is created we can check the content of it and verify that are volumes (physical) marked with the MISSING flag:
[root@exavm02s4 ~]# cd /etc/lvm/ [root@exavm02s4 lvm]# [root@exavm02s4 lvm]# ls -l backup/ total 4 -rw------- 1 root root 3653 Apr 22 17:19 VGExaDb [root@exavm02s4 lvm]# [root@exavm02s4 lvm]# [root@exavm02s4 lvm]# cat backup/VGExaDb # Generated by LVM2 version 2.02.186(2)-RHEL7 (2019-08-27): Thu Apr 22 17:19:30 2021 contents = "Text Format Volume Group" version = 1 description = "Created *after* executing 'vgreduce --removemissing --verbose VGExaDb'" creation_host = "exavm02s4.mynt.simon.net" # Linux exavm02s4.mynt.simon.net 4.1.12-124.42.4.el7uek.x86_64 #2 SMP Thu Sep 3 16:14:48 PDT 2020 x86_64 creation_time = 1619104770 # Thu Apr 22 17:19:30 2021 VGExaDb { id = "ynfwGi-HZPF-0fe9-38lq-MbKE-DMhp-40QUXh" seqno = 48 format = "lvm2" # informational status = ["RESIZEABLE", "READ", "WRITE"] flags = [] extent_size = 8192 # 4 Megabytes max_lv = 0 max_pv = 0 metadata_copies = 0 physical_volumes { pv0 { id = "o6OAXC-J3xd-Z8YT-mE2j-MZyN-UM4H-7cP3Q0" device = "/dev/xvda2" # Hint only status = ["ALLOCATABLE"] flags = ["MISSING"] dev_size = 51380126 # 24.5 Gigabytes pe_start = 384 pe_count = 6271 # 24.4961 Gigabytes } pv1 { id = "eQDYEs-cwbA-OI58-R9hP-Sure-bsbI-rE5w0x" device = "/dev/xvdd1" # Hint only status = ["ALLOCATABLE"] flags = ["MISSING"] dev_size = 130023326 # 62 Gigabytes pe_start = 384 pe_count = 15871 # 61.9961 Gigabytes } pv2 { id = "fE09JF-ajaK-7057-oiZL-6DRO-UGfL-l1vIFy" device = "/dev/xvdf" # Hint only status = ["ALLOCATABLE"] flags = [] dev_size = 104857600 # 50 Gigabytes pe_start = 2048 pe_count = 12799 # 49.9961 Gigabytes } pv3 { id = "SNJiFM-PEjR-xuyU-8kBK-vqGf-b5AT-kDfW2S" device = "/dev/xvdg" # Hint only status = ["ALLOCATABLE"] flags = [] dev_size = 314572800 # 150 Gigabytes pe_start = 2048 pe_count = 38399 # 149.996 Gigabytes } } logical_volumes { ... ...
So, we edit the file and remove the MISSING flag content (not the entire flag – just the value of it):
[root@exavm02s4 lvm]# cat /etc/lvm/backup/VGExaDb |grep MISSING flags = ["MISSING"] flags = ["MISSING"] [root@exavm02s4 lvm]# [root@exavm02s4 lvm]# [root@exavm02s4 lvm]# cd /etc/lvm/backup/ [root@exavm02s4 backup]# [root@exavm02s4 backup]# [root@exavm02s4 backup]# vi /etc/lvm/backup/VGExaDb [root@exavm02s4 backup]# [root@exavm02s4 backup]# [root@exavm02s4 backup]# cat VGExaDb |grep MISSING [root@exavm02s4 backup]#
After remove we can restore the config file that we edit (the backup file). Please BE AWARE that this can damage your LVM if you have not edited the correct file. Never use files generated from another node, the volume’s ids can be different. Below one parameter is the backup file VGExaDb itself, and the other one is the name of the volume group (VG). Both have the same name, for Exadata the VG is VGExaDb.
[root@exavm02s4 backup]# vgcfgrestore -f VGExaDb VGExaDb Volume group VGExaDb has active volume: LVDbSys2. Volume group VGExaDb has active volume: LVDbSwap1. Volume group VGExaDb has active volume: LVDbSys1. Volume group VGExaDb has active volume: LVDbOra1. WARNING: Found 4 active volume(s) in volume group "VGExaDb". Restoring VG with active LVs, may cause mismatch with its metadata. Do you really want to proceed with restore of volume group "VGExaDb", while 4 volume(s) are active? [y/n]: y Restored volume group VGExaDb Scan of VG VGExaDb from /dev/xvda2 found metadata seqno 49 vs previous 48. Scan of VG VGExaDb from /dev/xvdd1 found metadata seqno 49 vs previous 48. Scan of VG VGExaDb from /dev/xvdf found metadata seqno 49 vs previous 48. Scan of VG VGExaDb from /dev/xvdg found metadata seqno 49 vs previous 48. [root@exavm02s4 backup]#
As you can see above the volume group was restored and a new sequence number was generated to identify it.
After that we can scan the volumes again to check if everything was added correctly and reboot the node:
[root@exavm02s4 backup]# vgscan Reading volume groups from cache. Found volume group "VGExaDb" using metadata type lvm2 [root@exavm02s4 backup]# [root@exavm02s4 backup]# [root@exavm02s4 backup]# vgs VG #PV #LV #SN Attr VSize VFree VGExaDb 4 4 0 wz--n- 286.48g 1.00g [root@exavm02s4 backup]# [root@exavm02s4 backup]# reboot ... ...
After the reboot we can scan again and recreate the missing volume LVDoNotRemoveOrUse:
[root@exavm02s4 ~]# vgscan Reading volume groups from cache. Found volume group "VGExaDb" using metadata type lvm2 [root@exavm02s4 ~]# [root@exavm02s4 ~]# lvscan ACTIVE '/dev/VGExaDb/LVDbSys1' [24.00 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbSys2' [24.00 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbOra1' [221.48 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbSwap1' [16.00 GiB] inherit [root@exavm02s4 ~]# [root@exavm02s4 ~]# lvcreate -n LVDoNotRemoveOrUse -L1G VGExaDb Logical volume "LVDoNotRemoveOrUse" created. [root@exavm02s4 ~]# [root@exavm02s4 ~]# lvscan ACTIVE '/dev/VGExaDb/LVDbSys1' [24.00 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbSys2' [24.00 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbOra1' [221.48 GiB] inherit ACTIVE '/dev/VGExaDb/LVDbSwap1' [16.00 GiB] inherit ACTIVE '/dev/VGExaDb/LVDoNotRemoveOrUse' [1.00 GiB] inherit [root@exavm02s4 ~]#
Is not clear (and I couldn’t investigate) why this error appeared. it was not the first time that I got the same error. Maybe can be related to dracut issue that I described in my previous post.
Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community. Post protected by copyright.”