common/rc: fix test init performance regression on XFS

I've been having problems over the past month with tests randomly
running much slower than they should. The overall effect has been
that running an auto group iterate has blown out from about 3
hours to almost 5 hours.

The problem has been that the root disks have been thrashing.
Thousands of iops and the backing store getting IO bound so all test
vms with root drive images hosted on that backing store then also
go slow.

I finally got to the bottom of it - it looked for all the world like
a qemu image format write amplification problem, because I couldn't
capture anything unusual from inside the guest. I converted all
the root images to raw format on fast SSD storage, and the problem
didn't go away - while some tests were faster, other tests were
still randomly slow.

Finally I tracked it down with blktrace/blkparse. All the IO was
coming from processes with the command line like:

 /sbin/mkfs.xfs -d file name=/tmp/791409.img size=512m

A quick grep then pointed me at common/rc::init_rc() where there is
a test for XFS specific on-disk format behaviour that uses that
exact command line.

init_rc() is run from _begin_fstest() so it is executed on every
test start. We run mkfs to create a 512MB filesystem in $tmp on
every test start. That's ..... not smart. And we do it so we can
change the setup of $XFS_COPY_PROG, which is only used in a handful
of tests. That's also .... not smart.

And then the penny dropped: we recently increased the default log
size of small filesystems, so this is now zeroing a 64MB log instead
of a 5-10MB log. And, of course it's on the root drive because of
the $tmp image prefix, which in this case is ext3, and it's
fragmenting the crap out of the writes and so it increases runtime
of the mkfs.xfs execution from nothing to 5-10s.

Then when one of these mkfs invocations co-incides with journal
flushing driven writeback, everything just stops until the writeback
stops thrashing. And so all tests are 5-10s slows, and random tests
take anywhere between 20-100s longer to start up....

Fix it by moving the setup of XFS_COPY_PROG to _require_xfs_copy()
so only the tests that use xfs_copy run this code.

Numbers don't lie:

generic/001 11s ...  6s
generic/002 9s ...  1s
generic/003 17s ...  10s
generic/004 12s ...  1s
generic/005 9s ...  1s
generic/006 9s ...  2s
generic/007 11s ...  3s
generic/008 5s ...  2s
generic/009 10s ...  2s
generic/010 5s ...  0s
generic/011 9s ...  1s
generic/012 7s ...  2s
generic/013 10s ...  5s
generic/014 11s ...  1s
generic/015 7s ...  1s
generic/016 7s ...  2s
.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Zorro Lang <zlang@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Zorro Lang <zlang@kernel.org>
2 files changed