the home site for me: also iteration 3 or 4 of my site
1+++
2title = "Fixing a degraded zpool on proxmox"
3date = 2025-02-03T10:00:00
4slug = "degraded-zpool-proxmox"
5description = "replacing a failed drive in a proxmox zpool"
6
7[taxonomies]
8tags = ["homelab", "tutorial"]
9+++
10
11I decided to finally fix the network issues with my proxmox server (old static ip and used vlans which I hadn't setup with the new switch and router) as I had some time today but after fixing that fairly easily I discovered that my main `2.23 TB` zpool had a drive failure. Thankfully I had managed to stuff 3 disks into the case before so loosing one meant no data loss (thankfully 😬; all my projects from the last 5 years as well as my entire video archive is on this pool). I still have 3 more disks of the same type so I can swap in a new one 2 more times after this.
12
13<!-- more -->
14
15{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/e54fd32f9a72ef35d310cb3cdc299b297c87baea_2image.png" alt="the zpool reporting a downed disk" caption="That really scared the pants off me when I first saw it 😂") }}
16
17## Actually fixing it
18
19First I had to find the affected disk physically in my case. Because I was stupid I didn't bother to label them but thankfully the serial numbers of the drives are stuck to them with a sticker so that wasn't terrible.
20
21{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/a6512def9bbeedbc1315a8ee58c92fbfb9e4d169_0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="(By this point I had spent 30 minutes moaning so I went to lunch)") }}
22
23Now we can run `lsblk -o +MODEL,SERIAL` to find the serial number of our new drive.
24
25> root@thespia:~# lsblk -o +MODEL,SERIAL
26```bash
27NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS MODEL SERIAL
28sda 8:0 0 698.6G 0 disk ST3750640NS 3QD0BG0J
29├─sda1 8:1 0 698.6G 0 part
30└─sda9 8:9 0 8M 0 part
31sdb 8:16 0 698.6G 0 disk ST3750640NS 3QD0BN6V
32sdc 8:32 0 698.6G 0 disk ST3750640NS 3QD0BQ5G
33├─sdc1 8:33 0 698.6G 0 part
34└─sdc9 8:41 0 8M 0 part
35sdd 8:48 1 111.8G 0 disk Hitachi HTS543212L9SA02 090130FBEB00LGGJ35RF
36├─sdd1 8:49 1 1007K 0 part
37├─sdd2 8:50 1 512M 0 part /boot/efi
38└─sdd3 8:51 1 111.3G 0 part
39 ├─pve-swap 253:0 0 8G 0 lvm [SWAP]
40 ├─pve-root 253:1 0 37.8G 0 lvm /
41 ├─pve-data_tmeta 253:2 0 1G 0 lvm
42 │ └─pve-data-tpool 253:4 0 49.6G 0 lvm
43 │ ├─pve-data 253:5 0 49.6G 1 lvm
44 │ ├─pve-vm--100--cloudinit
45 │ │ 253:6 0 4M 0 lvm
46 │ ├─pve-vm--101--cloudinit
47 │ │ 253:7 0 4M 0 lvm
48 │ ├─pve-vm--103--disk--0
49 │ │ 253:8 0 4M 0 lvm
50 │ └─pve-vm--103--disk--1
51 │ 253:9 0 32G 0 lvm
52 └─pve-data_tdata 253:3 0 49.6G 0 lvm
53 └─pve-data-tpool 253:4 0 49.6G 0 lvm
54 ├─pve-data 253:5 0 49.6G 1 lvm
55 ├─pve-vm--100--cloudinit
56 │ 253:6 0 4M 0 lvm
57 ├─pve-vm--101--cloudinit
58 │ 253:7 0 4M 0 lvm
59 ├─pve-vm--103--disk--0
60 │ 253:8 0 4M 0 lvm
61 └─pve-vm--103--disk--1
62 253:9 0 32G 0 lvm
63sde 8:64 0 465.8G 0 disk WDC WD5000AAKS-65YGA0 WD-WCAS83511331
64├─sde1 8:65 0 465.8G 0 part
65└─sde9 8:73 0 8M 0 part
66sdf 8:80 1 0B 0 disk Multi-Card 20120926571200000
67zd0 230:0 0 32G 0 disk
68├─zd0p1 230:1 0 100M 0 part
69├─zd0p2 230:2 0 16M 0 part
70├─zd0p3 230:3 0 31.4G 0 part
71└─zd0p4 230:4 0 522M 0 part
72zd16 230:16 0 80G 0 disk
73├─zd16p1 230:17 0 1M 0 part
74└─zd16p2 230:18 0 80G 0 part
75zd32 230:32 0 4M 0 disk
76zd48 230:48 0 80G 0 disk
77├─zd48p1 230:49 0 1M 0 part
78└─zd48p2 230:50 0 80G 0 part
79zd64 230:64 0 32G 0 disk
80├─zd64p1 230:65 0 512K 0 part
81└─zd64p2 230:66 0 32G 0 part
82zd80 230:80 0 1M 0 disk
83```
84
85Our two current drives are `3QD0BG0J` and `3QD0BQ5G` as we can see in proxmox but we can also see that they have partitions and `sdb/3QD0BN6V` does not so thats our target drive. Now we can find the disk by id with `ls /dev/disk/by-id | grep 3QD0BN6V` which gives us:
86
87> ls /dev/disk/by-id | grep 3QD0BN6V
88```bash
89ata-ST3750640NS_3QD0BN6V
90```
91
92{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/f539cc5cb4e40b768f4b7bc6dc719467e438c6ed_0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="My case situation is a bit of a mess and I'm using old 7200rpm server drives for pretty much everything; the dream is a 3 drive 2 TB each m.2 nvme ssd setup, maybe someday 🤷") }}
93
94We are going to go with the first id so no we move on to the zfs part. Running `zpool status vault-of-the-eldunari` we can get the status of the pool:
95
96> zpool status vault-of-the-eldunari
97```bash
98 pool: vault-of-the-eldunari
99 state: DEGRADED
100status: One or more devices could not be used because the label is missing or
101 invalid. Sufficient replicas exist for the pool to continue
102 functioning in a degraded state.
103action: Replace the device using 'zpool replace'.
104 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
105 scan: resilvered 8.33G in 00:48:26 with 0 errors on Thu Nov 14 18:38:03 2024
106config:
107
108 NAME STATE READ WRITE CKSUM
109 vault-of-the-eldunari DEGRADED 0 0 0
110 raidz1-0 DEGRADED 0 0 0
111 9201394420428878514 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST3750640NS_3QD0BM29-part1
112 ata-ST3750640NS_3QD0BQ5G ONLINE 0 0 0
113 ata-ST3750640NS_3QD0BG0J ONLINE 0 0 0
114
115errors: No known data errors
116```
117
118We can add our new disk with `zpool replace vault-of-the-eldunari 9201394420428878514 ata-ST3750640NS_3QD0BN6V` but first we wipe the disk from proxmox under the disks tab on our proxmox node to make sure its all clean before we restore the pool after we do that we also initalize a new gpt table. Now we are ready to replace the disk. Running this command can take quite a while and it doesn't output anything so sit tight. After waiting a few minutes proxmox reported that resilvering would take 1:49 minutes and it was 5% done already! I hope this helped at least one other person but I'm mainly writing this to remind myself how to do this when it inevitably happens again :)
119
120{{ img(id="https://hc-cdn.hel1.your-objectstorage.com/s/v3/8cc1c0d1717abacbc29d634004b14ec7475de0f2_0image.png" alt="the zpool reporting a downed disk" caption="It's slow but faster then I expected for HDDs") }}