btrfs scrub status /dev/md0 csum=2942 uncorrectable errors: 2942

bom · 7. Februar 2022

UPDATE 5: Solution:

PSU was overloaded when a HDD was changed with a newer model that use more power. A super coiner.

Copy RAID data to extern HDD, removed 2 HDD, --created a new RAID 6 array, filesystem BTRFS and copy data back.

UPDATE 4: btrfs check --repair --init-csum-tree --init-extent-tree /dev/md0

Output after some hours:

enabling repair mode

Creating a new CRC tree

Opening filesystem to check...

Checking filesystem on /dev/md0

UUID: 05bbc4f1-d4d8-4af2-863f-13eb864736b1

Creating a new extent tree

parent transid verify failed on 30490624 wanted 4 found 6049

Ignoring transid failure

Reinitialize checksum tree

ctree.c:2245: split_leaf: BUG_ON `1` triggered, value 1

btrfs(+0x141e9)[0x55709ed041e9]

btrfs(+0x14284)[0x55709ed04284]

btrfs(+0x169ad)[0x55709ed069ad]

btrfs(btrfs_search_slot+0xf24)[0x55709ed07f9f]

btrfs(btrfs_csum_file_block+0x25f)[0x55709ed15888]

btrfs(+0x4aa30)[0x55709ed3aa30]

btrfs(cmd_check+0xf0b)[0x55709ed46af8]

btrfs(main+0x1f3)[0x55709ed03e63]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb)[0x7f46fb63109b]

btrfs(_start+0x2a)[0x55709ed03eaa]

Afbrudt (SIGABRT)

And still scrub, uncorrectable, csum errors.

UPDATE 3: PSU 5 V is 4.78 and dropping to 4.74 when when scrub. Plan to change PSU.

UPDATE 2:

The READ 6 array is 18 TB, ~ 4 TB used.

The uncorrectable scrub, csum errors is alwas in the first 1 TB data.

I have empty the trash in SMB/CIFS "Empty now" and overwrite every file from the backup, showed in the log that have checksum error.

Now is uncorrectable scrub, csum errors not 2942 but 1523, also concentrated to the first ~ 1 TB data and it is now other files that have uncorrectable scrub, csum errors.

I like to find the root cause.

1) For now my plan is to overwrite the new file from backup to see what happen. (it will not find the root cause)

2) Replace a RAID 6 disk that show increasing SMART Att 5 errors.

My main questions is:

Why is the scrub errors only in the first 1 TB data?

Why is it some time possible to copy the file and the after some day not possible to read the file?

Why is other file that have stay in the file system now with csum errors?

Is it hardware or software error?

Why is the copy function to Windows 10 from BTRFS RAID 6 have 3 states:

a) Some time no error of file with checksum error and the copy look as normal.

b) Other time Windows 10 just spring the copy process over without error in the middle of the copy process but the file is not copy to the windows folder

c) Stop the copy process and report "network error"

It look as when mark the file in Windows 10 and key <ctrl> + <c> and then in the windows folder key <ctrl> + <v> the copy have higher change to success compare to using mouse: Mark and click the file and move the marked file, drag the file with the mouse to the windows folder and release the mouse click button. If the file have a checksum error then is nearly always fail using the mouse.

After overwrite the file in the RAID array then the file can be copy without problem using mouse and keyboard.

But if <ctrl> + <c>, ctrl> + <v> is not working the delete SMB/CIFS and add with same setting can get <ctrl> + <c>, ctrl> + <v> to work first time, next time it is not working, but now you can overwrite the file with that one you just copy and the that file can be copy using mouse or keyboard....

Any help is appreciated.

UPDATE 1: Found a broken SATA cable that is replaced but still uncorrectable errors scrub errors.

***********************************************************************************************************************************************************

I have situation that for me is confusing:

All copy is over the network from OVM to Windows 10.

btrfs scrub start /dev/md0 and then

btrfs scrub status /dev/md0

return:

root@RAID-6-2-TB:~#

scrub status for 05bbc4f1-d4d8-4af2-863f-13eb864736b1

scrub started at Sun Feb 6 16:40:09 2022, running for 00:43:15

total bytes scrubbed: 1.11TiB with 2942 errors

error details: csum=2942

corrected errors: 0, uncorrectable errors: 2942, unverified errors: 0

root@RAID-6-2-TB:~# btrfs scrub status /dev/md0

scrub status for 05bbc4f1-d4d8-4af2-863f-13eb864736b1

scrub started at Sun Feb 6 16:40:09 2022 and finished after 02:35:46

total bytes scrubbed: 4.00TiB with 2942 errors

error details: csum=2942

corrected errors: 0, uncorrectable errors: 2942, unverified errors: 0

md0: 18 TB, 11 * 2 TB disk, RAID 6.

Q1: Is it csum that is wrong or the datafile?

The OMV RAID 6 system was running very well but one disk in the array was showing sick SMART Att 5 increasing so I –add and –Replace the disk.

Before and after the disk replacement I run a mdadm btrfs scrub status /dev/md0

and it report 0 error.

Same day another disk was reporting ~3000 SMART Att 5 bad sectored.

I perform same procedure.

Scrub: OK

–add and –Replace

And now Scrub always when it is done return csum=2942 uncorrectable errors: 2942. Actually the uncorrectable errors and csum is counting when the scrub is running until ~1 TB of the 4 TB data. Then it reach 2942.

Memtest86: Pass 1½ times

ECC memory: No

System: OMV 5, Stable, the system don't hang

mdadm RAID 6

OMV 5 is updated.

File system: BTRFS

No UDMA_CRC_Error

Some of the 11 disk have few bad sectors but is not increasing

Scrub has run without errors in the pass many time.

Then scrub show:

error details: csum=2942

corrected errors: 0, uncorrectable errors: 2942, unverified errors: 0

The log show error as:

Feb 1 10:35:33

RAID-6-2-TB kernel: [ 621.029918] BTRFS warning (device md0): checksum error at logical 178256846848 on dev /dev/md0, physical 179338977280, root 5, inode 13737, offset 532652032, length 4096, links 1 (path: one/TV/name.ts)

When try to copy this specific file from md0 it fail at Feb 1th. I was try many time without success.

But today Feb 6th and 7th, the file can be copy and look OK…

Is the file being repaired or is the error gone by itself?

Another file is tonight copied but today the file can’t being copied.

More detail: When I shut down OMV and disconnect one disk (remove SATA cable),

mdadm --assemble –scan can’t assemble the RAID array:

mdadm --assemble –scan return no message… No warning, no error, nothing.

The RAID is not showed – it look as it is gone.

When I then shout down and connect the same SATA cable to the same disk and power up the system, then the RAID 6 is back.

But when the system is running and I hot disconnect the SATA cable from the same RAID 6 disk, the system detect the missing disk and report:

clean, degraded

Is it a configuration problem?

What if a disk is gone when the system is turned off, will it the be possible to assemble the RAID array?

Q2: Whey is the RAID 6 array gone and not possible to assemble?

Scrub for the clean, degraded RAID 6 with the missing disk is returning same csum=2942 errors.

That is weird to me because I get errors in more then one directions.

Q3: Whey uncorrectable errors and only csum?

I expect RAID 6 + BTRFS to fix the errors.

root@RAID-6-2-TB:~# btrfsck --force /dev/md0

Opening filesystem to check...

WARNING: filesystem mounted, continuing because of --force

Checking filesystem on /dev/md0

UUID: 05bbc4f1-d4d8-4af2-863f-13eb864736b1

[1/7] checking root items

[2/7] checking extents

[3/7] checking free space cache

[4/7] checking fs roots

[5/7] checking only csums items (without verifying data)

[6/7] checking root refs

[7/7] checking quota groups skipped (not enabled on this FS)

found 4392425177088 bytes used, no error found

total csum bytes: 4283128720

total tree bytes: 5427363840

total fs tree bytes: 16924672

total extent tree bytes: 31080448

btree space waste bytes: 1041717880

file data blocks allocated: 4386997813248

referenced 4386997800960

Q4: It is a local issue or mdadm/BTRFS bug?

Q5: Way is csum=2942 and uncorrectable errors: 2942 equal?

2 of 11 SATA cable is as a trobleshuting experiment replaced without any change.

A 3. situation that is confusion:

A file that in the buggy situation is copied. The file look OK. But when I try now I can see in Windows 10 that the copy progress bar is displayed, but when the progress bar is reached 30% then rest of the progress bar jump fast to the end, looking as the copy is finished, but THE FILE IS NOT IN THE TO FOLDER! No error msg…!

Checksum error have always root 5. What is root 5 means? It is good or bad that it always is root 5?

inode is displayed meny time with different number from ~ 13385 to ~ 14943.

offset is displayed meny time with different number.

openmediavault Version 5.6.24-1 (Usul)

Kernel Linux 5.10.0-0.bpo.9-amd64

mdadm - v4.1 – 2018-10-01

SMP Debian 5.10.70-1~bpo10+1 (2021-10-10) x86_64

root@RAID-6-2-TB:~# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]

md0 : active raid6 sdk[9] sdl[10] sda[14] sdd[2] sdj[6] sdb[13] sdh[4] sdi[5] sdg[11] sdf[8] sde[3]

17580432384 blocks super 1.2 level 6, 512k chunk, algorithm 2 [11/11] [UUUUUUUUUUU]

bitmap: 0/15 pages [0KB], 65536KB chunk

unused devices: <none>

Jetzt mitmachen!