[PROBLEM] reproduceable storage errors on high IO load

June 06th, 2011 - 04:10 am ET by Lars Täuber | Report spam
Hallo!

This is a message originally sent to linux-scsi.
I got no reply so I think this was the wrong ML.
Please tell me if I should send more specific information about something.
Since january I struggle with this problem. It prevents me from running a backup server productively.

Thank you.
Lars



Hi there,

I have a problem with a SW-RAID6. It is reproduceable also after changing the hole hardware.
I startet with a Suse 11.2. The problem occured during writing much data to the array (high io load).
This is hopefully the right ML for my problem. Otherwise please excuse me and point me the the right ML.


Then I changed the PSU. Still errors on high load.
Then I changed the sata controller (Sil 3114 - sata_sil) with one with a different chipset (driver: sata_mv). Still errors on high load.
Then I changed the disk enclosure and all cables. Still errors.
Then I changed the mainboard (tyan opteron) with one from supermicro (H8SCM-F) with 6-core opteron. Still errors.
Then I changed to ubuntu 10.04 -> 10.10. Still errors
Then I tried different schedulars (noop,anticipatory,cfq,deadline). Still errors.
Then I tried kernel options: noapic + acpi=off without luck.
Then I changed the sata controller with a areca sas (driver: mvsas). Still errors.
Then I tried some different hdds (orig: Western Digital WDC WD2002FYPS + WDC WD2003FYYS; new: Seagate ST3320620NS). Still errors.
Then I tried some different kernel versions from ubuntu without luck:
2.6.32-22-server
2.6.35-25-server

Then I tried self compiled kernels without luck:
2.6.35.13
2.6.38.6
2.6.39: same problem occurs but later

The current configuration:
- tested only 64-bit kernels
- Supermicro H8SCM-F (AMD SR5650+SP5100) with 6-core opteron
- Areca (non-raid) ARC-1300ix-16 sas controller
- SW-RAID6 over 8 Western Digital HDDs (sone WDC WD2002FYPS + some WDC WD2003FYYS)
- redundant PSU

How to reproduce my problem:
mdadm -C /dev/md3 -l6 -n8 /dev/sd[c-h] missing missing
(the two missing hdds prevent this raid from initial sync)

Everything is just fine till yet.
Now produce high io-load:
mke2fs -j /dev/md3

The detailed history (search for Lars to get my posts):
https://bugs.launchpad.net/ubuntu/+bug/550559

The error messages changed a bit during the kernel versions.
The nearly complete dmesg output:
https://launchpadlibrarian.net/7232....dmesg.out

Is there something I do wrong? Could someone help me to debug this?
Thanks
Lars
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
email Follow the discussionReplies 11 repliesReplies Make a reply

Replies

#1 Gene Heskett
June 06th, 2011 - 06:00 am ET | Report spam
On Monday, June 06, 2011, Lars Tàuber wrote:
Hallo!

This is a message originally sent to linux-scsi.
I got no reply so I think this was the wrong ML.
Please tell me if I should send more specific information about
something. Since january I struggle with this problem. It prevents me
from running a backup server productively.

Thank you.
Lars



Hi there,

I have a problem with a SW-RAID6. It is reproduceable also after changing
the hole hardware. I startet with a Suse 11.2. The problem occured
during writing much data to the array (high io load). This is hopefully
the right ML for my problem. Otherwise please excuse me and point me the
the right ML.


Then I changed the PSU. Still errors on high load.
Then I changed the sata controller (Sil 3114 - sata_sil) with one with a
different chipset (driver: sata_mv). Still errors on high load. Then I
changed the disk enclosure and all cables. Still errors.
Then I changed the mainboard (tyan opteron) with one from supermicro
(H8SCM-F) with 6-core opteron. Still errors. Then I changed to ubuntu
10.04 -> 10.10. Still errors
Then I tried different schedulars (noop,anticipatory,cfq,deadline). Still
errors. Then I tried kernel options: noapic + acpi=off without luck.
Then I changed the sata controller with a areca sas (driver: mvsas).
Still errors. Then I tried some different hdds (orig: Western Digital
WDC WD2002FYPS + WDC WD2003FYYS; new: Seagate ST3320620NS). Still
errors. Then I tried some different kernel versions from ubuntu without
luck: 2.6.32-22-server
2.6.35-25-server

Then I tried self compiled kernels without luck:
2.6.35.13
2.6.38.6
2.6.39: same problem occurs but later

The current configuration:
- tested only 64-bit kernels
- Supermicro H8SCM-F (AMD SR5650+SP5100) with 6-core opteron
- Areca (non-raid) ARC-1300ix-16 sas controller
- SW-RAID6 over 8 Western Digital HDDs (sone WDC WD2002FYPS + some WDC
WD2003FYYS) - redundant PSU

How to reproduce my problem:
mdadm -C /dev/md3 -l6 -n8 /dev/sd[c-h] missing missing
(the two missing hdds prevent this raid from initial sync)

Everything is just fine till yet.
Now produce high io-load:
mke2fs -j /dev/md3

The detailed history (search for Lars to get my posts):
https://bugs.launchpad.net/ubuntu/+bug/550559

The error messages changed a bit during the kernel versions.
The nearly complete dmesg output:
https://launchpadlibrarian.net/7232....dmesg.out

Is there something I do wrong? Could someone help me to debug this?
Thanks
Lars



Looking at your dmesg, I get the impression you have a bunch of disks that
are in need of a firmware update. Unforch, the dmesg snippet does not
include the drive discovery and identification data.

However, I would back that data up to another medium before I did that as I
had the seagate firmware update scramble the blkid's and partition names of
one of two 1Tb drives I have. Neither drive errors now, but the read/write
speeds for the 2nd identical drive are about 1/3rd the rate of the first.

Firmware updates are in the form of a bootable cd .iso, and you can
download the cd image from the makers site.

Cheers, gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Eisenhower!! Your mimeograph machine upsets my stomach!!
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Similar topics