the home site for me: also iteration 3 or 4 of my site
1+++ 2title = "Fixing a degraded zpool on proxmox" 3date = 2025-02-03T10:00:00 4slug = "degraded-zpool-proxmox" 5description = "replacing a failed drive in a proxmox zpool" 6 7[taxonomies] 8tags = ["homelab", "tutorial"] 9+++ 10 11I decided to finally fix the network issues with my proxmox server (old static ip and used vlans which I hadn't setup with the new switch and router) as I had some time today but after fixing that fairly easily I discovered that my main `2.23 TB` zpool had a drive failure. Thankfully I had managed to stuff 3 disks into the case before so loosing one meant no data loss (thankfully 😬; all my projects from the last 5 years as well as my entire video archive is on this pool). I still have 3 more disks of the same type so I can swap in a new one 2 more times after this. 12 13<!-- more --> 14 15{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/e54fd32f9a72ef35d310cb3cdc299b297c87baea_2image.png" alt="the zpool reporting a downed disk" caption="That really scared the pants off me when I first saw it 😂") }} 16 17## Actually fixing it 18 19First I had to find the affected disk physically in my case. Because I was stupid I didn't bother to label them but thankfully the serial numbers of the drives are stuck to them with a sticker so that wasn't terrible. 20 21{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/a6512def9bbeedbc1315a8ee58c92fbfb9e4d169_0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="(By this point I had spent 30 minutes moaning so I went to lunch)") }} 22 23Now we can run `lsblk -o +MODEL,SERIAL` to find the serial number of our new drive. 24 25> root@thespia:~# lsblk -o +MODEL,SERIAL 26```bash 27NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS MODEL SERIAL 28sda 8:0 0 698.6G 0 disk ST3750640NS 3QD0BG0J 29├─sda1 8:1 0 698.6G 0 part 30└─sda9 8:9 0 8M 0 part 31sdb 8:16 0 698.6G 0 disk ST3750640NS 3QD0BN6V 32sdc 8:32 0 698.6G 0 disk ST3750640NS 3QD0BQ5G 33├─sdc1 8:33 0 698.6G 0 part 34└─sdc9 8:41 0 8M 0 part 35sdd 8:48 1 111.8G 0 disk Hitachi HTS543212L9SA02 090130FBEB00LGGJ35RF 36├─sdd1 8:49 1 1007K 0 part 37├─sdd2 8:50 1 512M 0 part /boot/efi 38└─sdd3 8:51 1 111.3G 0 part 39 ├─pve-swap 253:0 0 8G 0 lvm [SWAP] 40 ├─pve-root 253:1 0 37.8G 0 lvm / 41 ├─pve-data_tmeta 253:2 0 1G 0 lvm 42 │ └─pve-data-tpool 253:4 0 49.6G 0 lvm 43 │ ├─pve-data 253:5 0 49.6G 1 lvm 44 │ ├─pve-vm--100--cloudinit 45 │ │ 253:6 0 4M 0 lvm 46 │ ├─pve-vm--101--cloudinit 47 │ │ 253:7 0 4M 0 lvm 48 │ ├─pve-vm--103--disk--0 49 │ │ 253:8 0 4M 0 lvm 50 │ └─pve-vm--103--disk--1 51 │ 253:9 0 32G 0 lvm 52 └─pve-data_tdata 253:3 0 49.6G 0 lvm 53 └─pve-data-tpool 253:4 0 49.6G 0 lvm 54 ├─pve-data 253:5 0 49.6G 1 lvm 55 ├─pve-vm--100--cloudinit 56 │ 253:6 0 4M 0 lvm 57 ├─pve-vm--101--cloudinit 58 │ 253:7 0 4M 0 lvm 59 ├─pve-vm--103--disk--0 60 │ 253:8 0 4M 0 lvm 61 └─pve-vm--103--disk--1 62 253:9 0 32G 0 lvm 63sde 8:64 0 465.8G 0 disk WDC WD5000AAKS-65YGA0 WD-WCAS83511331 64├─sde1 8:65 0 465.8G 0 part 65└─sde9 8:73 0 8M 0 part 66sdf 8:80 1 0B 0 disk Multi-Card 20120926571200000 67zd0 230:0 0 32G 0 disk 68├─zd0p1 230:1 0 100M 0 part 69├─zd0p2 230:2 0 16M 0 part 70├─zd0p3 230:3 0 31.4G 0 part 71└─zd0p4 230:4 0 522M 0 part 72zd16 230:16 0 80G 0 disk 73├─zd16p1 230:17 0 1M 0 part 74└─zd16p2 230:18 0 80G 0 part 75zd32 230:32 0 4M 0 disk 76zd48 230:48 0 80G 0 disk 77├─zd48p1 230:49 0 1M 0 part 78└─zd48p2 230:50 0 80G 0 part 79zd64 230:64 0 32G 0 disk 80├─zd64p1 230:65 0 512K 0 part 81└─zd64p2 230:66 0 32G 0 part 82zd80 230:80 0 1M 0 disk 83``` 84 85Our two current drives are `3QD0BG0J` and `3QD0BQ5G` as we can see in proxmox but we can also see that they have partitions and `sdb/3QD0BN6V` does not so thats our target drive. Now we can find the disk by id with `ls /dev/disk/by-id | grep 3QD0BN6V` which gives us: 86 87> ls /dev/disk/by-id | grep 3QD0BN6V 88```bash 89ata-ST3750640NS_3QD0BN6V 90``` 91 92{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/f539cc5cb4e40b768f4b7bc6dc719467e438c6ed_0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="My case situation is a bit of a mess and I'm using old 7200rpm server drives for pretty much everything; the dream is a 3 drive 2 TB each m.2 nvme ssd setup, maybe someday 🤷") }} 93 94We are going to go with the first id so no we move on to the zfs part. Running `zpool status vault-of-the-eldunari` we can get the status of the pool: 95 96> zpool status vault-of-the-eldunari 97```bash 98 pool: vault-of-the-eldunari 99 state: DEGRADED 100status: One or more devices could not be used because the label is missing or 101 invalid. Sufficient replicas exist for the pool to continue 102 functioning in a degraded state. 103action: Replace the device using 'zpool replace'. 104 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J 105 scan: resilvered 8.33G in 00:48:26 with 0 errors on Thu Nov 14 18:38:03 2024 106config: 107 108 NAME STATE READ WRITE CKSUM 109 vault-of-the-eldunari DEGRADED 0 0 0 110 raidz1-0 DEGRADED 0 0 0 111 9201394420428878514 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST3750640NS_3QD0BM29-part1 112 ata-ST3750640NS_3QD0BQ5G ONLINE 0 0 0 113 ata-ST3750640NS_3QD0BG0J ONLINE 0 0 0 114 115errors: No known data errors 116``` 117 118We can add our new disk with `zpool replace vault-of-the-eldunari 9201394420428878514 ata-ST3750640NS_3QD0BN6V` but first we wipe the disk from proxmox under the disks tab on our proxmox node to make sure its all clean before we restore the pool after we do that we also initalize a new gpt table. Now we are ready to replace the disk. Running this command can take quite a while and it doesn't output anything so sit tight. After waiting a few minutes proxmox reported that resilvering would take 1:49 minutes and it was 5% done already! I hope this helped at least one other person but I'm mainly writing this to remind myself how to do this when it inevitably happens again :) 119 120{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/8cc1c0d1717abacbc29d634004b14ec7475de0f2_0image.png" alt="the zpool reporting a downed disk" caption="It's slow but faster then I expected for HDDs") }}