IO Wait

drumltd · 21. März 2013

Is it usual for IO Wait to be about 90%+ of CPU usage?

My system is an Twin Quad Core 1.86 GHz Xeon Server board with 16Gb of Ram. I have attached an old Hard Drive to the USB, which I boot from. Then I have 5 1TB Drives of varying ages and makes attached to the SATA 4 Of which are in a RAID 5 Forming the "Big Disk"

On the disk I have 2 NFS shares and a couple of SMB\CIFS Shares

The NFS Shares are for PXE booting 6 Ubuntu Machines. Each of the machines run 6 VirtualBox versions of XP (These will effectively be 10Gb files as far as the NFS system knows) Each of the systems share the same Ubuntu system they boot over the net.

The 36 XP Machines all mount the Samba Shares, and read from one share, process the data and write back to the other share.

The bulk of the Data is in the Samba Shares, the bulk of the access time would appear to be in the NFS shares, but this could easily be because the VirtualBox files are on the NFS Shares

If I reboot 4 or 5 of the XP machines at once, I can easily get the CPU load upto 9+ (With Effectively 8 CPUs)

Is this what I should be expecting and I'm pushing my hardware too hard, or are there improvements I can make to reduce the IO wait, and hence drastically reduce my CPU load.

I

ryecoaaron · 21. März 2013

I would guess your raid 5 is the bottleneck. I maintain a PowerEdge 2900 with 8 drive raid 5 (perc5) with two xeon 5320 (bsel mod) for a school. It has two Win2k8 server VMs running on it, 10 thin clients attached to one of those VMs, and 35 systems connected using roaming profiles on the other VM. No IO Wait problems on this system. You could try increasing your stripe cache size as well (read).

drumltd · 21. März 2013

Zitat von "ryecoaaron"

I would guess your raid 5 is the bottleneck. I maintain a PowerEdge 2900 with 8 drive raid 5 (perc5) with two xeon 5320 (bsel mod) for a school. It has two Win2k8 server VMs running on it, 10 thin clients attached to one of those VMs, and 35 systems connected using roaming profiles on the other VM. No IO Wait problems on this system.

Are you suggesting that the extra drives, increase the speed? You have 8 Drives compared to my 4. I do have a couple of extra drives I could put in the system if it would increase performance. I didn't bother because I built the system to the size I wanted. Or Do you think the hardware raid is the key difference?

Zitat von "ryecoaaron"

You could try increasing your stripe cache size as well (read).

Just tried increasing the stripe_cache_size (for the writes) and also increased blockdev --setra max_sectors_kb abd read_ahead_kb for the read, with little if any difference.

As you are also using VirtualBox, in you opinion would it be better to have 1 10 Master VM with 36 Linked VMs that are actually run. Or to do 36 Full Clones of the Master?

ryecoaaron · 21. März 2013

Zitat von "drumltd"

Are you suggesting that the extra drives, increase the speed? You have 8 Drives compared to my 4. I do have a couple of extra drives I could put in the system if it would increase performance. I didn't bother because I built the system to the size I wanted. Or Do you think the hardware raid is the key difference?

Extra drives should help but may increase processor load. Hardware raid definitely takes load off the processors. Maybe you could make two raids and split the load?? Not sure if that would help or hurt the processor load.

Zitat von "drumltd"

As you are also using VirtualBox, in you opinion would it be better to have 1 10 Master VM with 36 Linked VMs that are actually run. Or to do 36 Full Clones of the Master?

I've never used linked VMs. I would guess the full clones would be faster but it might depend on what the VMs are being used for.

drumltd · 21. März 2013

I'll look into a RAID card for the box, but as that means a complete rebuild of the RAID, I'll probably try and hold off until I've got the time to build it with different disks, and copy the data across.

I've started recloning the VMs as full clones see if that improves things, it'll take about 3 hours to get them all cloned, then I'll start them running and leave them overnight to settle down. See who they settle.

In the meantime thanks for the help, much appreciated.

I suspect the long term asnwer is to have a large raid like we have, then have an SSD behave like a cache. Not sure how easy that would be to do. I know several linux project have gone in that direction, not sure how many have made it to a stable state, never mind inclusion in any distros.

ryecoaaron · 21. März 2013

You should look into the bsel mod for your xeons. I've done it to four different servers that have been running for a few years. No noticeable temperature increase while gaining more processor speed and faster FSB. Easy to do as well. I assume you have two Xeon 5320. You would go from 1.86 GHz to 2.33 and from 1066 FSB to 1333. Makes a difference.

drumltd · 21. März 2013

I'll look into the mod, when I've next got the unit apart. Thanks for the pointer.

I've had a little bit of success, by increasing the dirty_ratio and background_dirty_ratio to 80 I'm still seeing mostly IO Wait in the CPU Chart, however the overall load is reduced a bit.

SerErris · 21. März 2013

Also depending on the disk you have, your total IOPS/s will be between 240 and 450. In other words only 10 iops per machine. This is very low and results in the wait cycles as the system wait for free IOPS capacity of your storage backend. Even raid controllers and other caches will not help if your requirements constantly will be over the physical limits. There is only two ways to increase that limit:

Put in more physical disks and a bit of logic ( multiple raid arrays etc pp.)
Put in SSDs. The will deliver about 35k IOPS each. Of cause somehow expensive.

drumltd · 22. März 2013

Zitat von "SerErris"

Also depending on the disk you have, your total IOPS/s will be between 240 and 450. In other words only 10 iops per machine. This is very low and results in the wait cycles as the system wait for free IOPS capacity of your storage backend. Even raid controllers and other caches will not help if your requirements constantly will be over the physical limits. There is only two ways to increase that limit:

Put in more physical disks and a bit of logic ( multiple raid arrays etc pp.)
Put in SSDs. The will deliver about 35k IOPS each. Of cause somehow expensive.

Actually that's sort of inline with a thought I had earlier. I think the problem I am seeing is related it random access IOPs (due to the high number of clients) If I shut down the clients and run a straight copy over the network I get 280Mbits Inbound, and 360Mbits outbound. Which I figure isn't too bad. However the demand from the cumulative clients is nowhere near this, but iowait is about 20times what it is for a straight copy.

To hold the VMs, The OS, and Source and Target Data takes about 2Tb, like you say putting this on SSD would not be cheap.

However Suppose I got lots of Smaller drives, And stripped them into 3 Arrays (Straight Raid 0 For Speed), VMs and OS , Source and Target.

Then keep the existing RAID 5 for security and rsync\mirror the data onto the RAID 5 periodically.

SerErris · 22. März 2013

You could also consider:

Buy some modern high speed drives with normal capacity (2TB each). But this will still limit your IPOS.

Lets see, what we have:

Asume, we have 10 drives, where you can use 4+1 RAID = 8 drives active = 8*100 IOPS = 800 IOPS. This should make your raid more high speed, but , will only have limited help. But you have 16 TB of storage to play with.

Hmm ... not sure if this is the best solution.

Maybe you want to split OS and also split source and target. Source and Target will read and write from the same disks at the same time ... which will result in great amounts of head movement. Maybe a mix is a good thing:

1. 2x 500GB SSD in RAID 1 (OS) (if this is enough ?).
2. 4x 2TB data disks for source data as Raid5
3. 4x 2TB data disks for target data as Raid5

Put all of those in LVM and cut volumes from it. Make sure that you separate source and target disks and raids, so that disks either read or write and therefor operate much better. So you will end up more or less in sequential IO again.

For the OS use the Raid 0 or 1 SSDs, so that you have lots of power for any random IO from OS.

Just one idea ... also others might work. In the end distributing the different workloads over physical disks will help you the most in your scenario as you do not have lots and lots of spindles to run.

drumltd · 17. April 2013

Well not been on this for a while as I had some real work to do. but I'm back now.

I left the system running with a 24 Disk raid (see my fibre channel thread) and I can confirm that as a quick and dirty test, just throwing spindles at it hasn't greatly improved things. So next I think is to split it down roughly as you suggest. With Source and Target as separate raids, the the OS\VMs on their own raid. Should be a fun exercise. :o

drumltd · 17. April 2013

A Little bit of experimentation has proved the problem files are the VirtualBox VMs. These are a series of 10Gb files that are now on the 24 Drive raid. And the only thing on the raid, and they are still causing Large IO Waits.

If there was a way to make the VMs use native files rather than one large .vdi file I bet it would solve the problem.

Jetzt mitmachen!