[Lse-tech] Re: ext3 performance bottleneck as the number of spindles gets large

mgross wrote:
>
> We've been doing some throughput comparisons and benchmarks of block I/O
> throughput for 8KB writes as the number of SCSI addapters and drives per
> adapter is increased.
>
> The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> well as one of those patched up O(1) 2.4.18 kernels out there.
umm. Are you not using block-highmem? That is a must-have.
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz
-

mgross wrote:
>
> ...
> Has anyone done any work looking into the I/O scaling of Linux / ext3 per
> spindle or per adapter? We would like to compare notes.
No. ext3 scalability is very poor, I'm afraid. The fs really wasn't
up and running until kernel 2.4.5 and we just didn't have time to
address that issue.
> I've only just started to look at the ext3 code but it seems to me that replacing the
> BKL with a per - ext3 file system lock could remove some of the contention thats
> getting measured. What data are the BKL protecting in these ext3 functions? Could a
> lock per FS approach work?
The vague plan there is to replace lock_kernel with lock_journal
where appropriate. But ext3 scalability work of this nature
will be targetted at the 2.5 kernel, most probably.
I'll take a look, see if there's any low-hanging fruit in there,
but I doubt that the results will be fantastic.
-

Andrew Morton wrote:
> mgross wrote:
>>Has anyone done any work looking into the I/O scaling of Linux / ext3 per
>>spindle or per adapter? We would like to compare notes.
>
> No. ext3 scalability is very poor, I'm afraid. The fs really wasn't
> up and running until kernel 2.4.5 and we just didn't have time to
> address that issue.
Ick. That takes the prize for the highest BKL contention I've ever
seen, except for some horribly contrived torture tests of mine. I've
had data like this sent to me a few times to analyze and the only
thing I've been able to suggest up to this point is not to use ext3.
>>I've only just started to look at the ext3 code but it seems to me that replacing the
>>BKL with a per - ext3 file system lock could remove some of the contention thats
>>getting measured. What data are the BKL protecting in these ext3 functions? Could a
>>lock per FS approach work?
>
> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate. But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.
I really doubt that dropping in lock_journal will help this case very
much. Every single kernel_flag entry in the lockmeter output where
Util > 0.00% is caused by ext3. The schedule entry is probably caused
by something in ext3 grabbing BKL, getting scheduled out for some
reason, then having it implicitly released in schedule(). The
schedule() contention comes from the reacquire_kernel_lock().
We used to see plenty of ext2 BKL contention, but Al Viro did a good
job fixing that early in 2.5 using a per-inode rwlock. I think that
this is the required level of lock granularity, another global lock
just won't cut it.
http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
--
Dave Hansen
haveblue@...

On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> Andrew Morton wrote:
> >The vague plan there is to replace lock_kernel with lock_journal
> >where appropriate. But ext3 scalability work of this nature
> >will be targetted at the 2.5 kernel, most probably.
>
> I really doubt that dropping in lock_journal will help this case very
> much. Every single kernel_flag entry in the lockmeter output where
> Util > 0.00% is caused by ext3. The schedule entry is probably caused
> by something in ext3 grabbing BKL, getting scheduled out for some
> reason, then having it implicitly released in schedule(). The
> schedule() contention comes from the reacquire_kernel_lock().
>
> We used to see plenty of ext2 BKL contention, but Al Viro did a good
> job fixing that early in 2.5 using a per-inode rwlock. I think that
> this is the required level of lock granularity, another global lock
> just won't cut it.
> http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
There are a variety of different efforts that could be made towards
removing the BKL from ext2 and ext3. The first, of course, would be
to have a per-filesystem lock instead of taking the BKL (I don't know
if Al has changed lock_super() in 2.5 to be a real semaphore or not).
As Andrew mentioned, there would also need to be be a per-journal lock to
ensure coherency of the journal data. Currently the per-filesystem and
per-journal lock would be equivalent, but when a single journal device
can be shared among multiple filesystems they would be different locks.
I will leave it up to Andrew and Stephen to discuss locking scalability
within the journal layer.
Within the filesystem there can be a large number of increasingly fine
locks added - a superblock-only lock with per-group locks, or even
per-bitmap and per-inode-table(-block) locks if needed. This would
allow multi- threaded inode and block allocations, but a sane lock
ranking strategy would have to be developed. The bitmap locks would
only need to be 2-state locks, because you only look at the bitmaps
when you want to modify them. The inode table locks would be read/write
locks.
If there is a try-writelock mechanism for the individual inode table
blocks you can avoid write lock contention for creations by simply
finding the first un-write-locked block in the target group's inode table
(usually in the hundreds of blocks per group for default parameters).
For inode allocation you don't really care which inode you get, as long
as you get one in the preferred group (even that isn't critical for
directory creation). For inode deletions you will get essentially random
block locking, which is actually improved by the find-first-unlocked
allocation policy (at the expense of dirtying more inode table blocks).
Contention for the superblock lock for updates to the superblock free
block and free inode counts could be mitigated by keeping "per-group
delta buckets" in memory, that are written into the superblock only
once every few seconds or at statfs time instead of needing multiple
locks for each block/inode alloc/free. The groups already keep their
own summary counts for free blocks and inodes. The coherency of these
fields with the superblock on recovery would be handled at journal
recovery time (either in the kernel or e2fsck*). Other than these two
fields there are few write updates to the superblock (on ext3 there
is also the orphan list, modified at truncate and when an open file is
unlinked and when such a file is closed).
I have even been thinking about multi-threaded directory-entry creation
in a single directory. One nice thing about ext2/ext3 directory blocks
is that each one is self-contained and can be modified independently.
For regular ext2/ext3 directories you would only be able to do
multi-threaded deletes by having a lock for each directory block.
For creations you would need to lock the entire directory to ensure
exclusive access for a create, which is the same single-threaded behaviour
for a single directory we have today with the directory i_sem.
However, if you are using the htree indexed directory layout (which you
will be, if you care about scalable filesystem performance) then there
is only a single[**] block into which a given filename can be added, so
you can have per-block locks even for file creation. As the number of
directory entries grows (and hence more directory blocks) the locking
becomes increasingly more fine-grained so you get better scalability
with larger directories, which is what you want.
Cheers, Andreas
[*] If we think that we will go to any kind of per-group locking in the
near future, the support for this could be added into e2fsck and
existing kernels today with read support for a COMPAT flag to
ensure maximal forwards compatibility. On e2fsck runs we already
validate the superblock on each boot, and the group descriptor table
is contiguous with the superblock, so the amount of extra checking
at boot time would be very minimal.
The kernel already has ext[23]_count_free_{blocks,inodes} functions
that just need a bit of tweaking to check only the descriptor
summaries unless mounted with debug and check options, and to update
the superblock counts at mount time if the COMPAT flag is set.
[**] In rare circumstances you may have a large number of hash collisions
for a single hash value which fill more than one block, so an entry
with that hash value could live in more than a single block. This
would need to be handled somehow (e.g. always getting the locks on
all such blocks in order at create time; you only need a single
block lock at delete time).
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/http://sourceforge.net/projects/ext2resize/

Andreas Dilger wrote:
>
> On Jun 19, 2002 21:09 -0700, Dave Hansen wrote:
> > Andrew Morton wrote:
> > >The vague plan there is to replace lock_kernel with lock_journal
> > >where appropriate. But ext3 scalability work of this nature
> > >will be targetted at the 2.5 kernel, most probably.
> >
> > I really doubt that dropping in lock_journal will help this case very
> > much. Every single kernel_flag entry in the lockmeter output where
> > Util > 0.00% is caused by ext3. The schedule entry is probably caused
> > by something in ext3 grabbing BKL, getting scheduled out for some
> > reason, then having it implicitly released in schedule(). The
> > schedule() contention comes from the reacquire_kernel_lock().
> >
> > We used to see plenty of ext2 BKL contention, but Al Viro did a good
> > job fixing that early in 2.5 using a per-inode rwlock. I think that
> > this is the required level of lock granularity, another global lock
> > just won't cut it.
> > http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock
>
> There are a variety of different efforts that could be made towards
> removing the BKL from ext2 and ext3. The first, of course, would be
> to have a per-filesystem lock instead of taking the BKL (I don't know
> if Al has changed lock_super() in 2.5 to be a real semaphore or not).
lock_super() has been `down()' for a long time. In 2.4, too.
> As Andrew mentioned, there would also need to be be a per-journal lock to
> ensure coherency of the journal data. Currently the per-filesystem and
> per-journal lock would be equivalent, but when a single journal device
> can be shared among multiple filesystems they would be different locks.
Well. First I want to know if block-highmem is in there. If not,
then yep, we'll spend ages spinning on the BKL. Because ext3 _is_
BKL-happy, and if a CPU takes a disk interrupt while holding the BKL
and then sits there in interrupt context copying tons of cache-cold
memory around, guess what the other CPUs will be doing?
> I will leave it up to Andrew and Stephen to discuss locking scalability
> within the journal layer.
ext3 is about 700x as complex as ext2. It will need to be done with
some care.
> Within the filesystem there can be a large number of increasingly fine
> locks added - a superblock-only lock with per-group locks, or even
> per-bitmap and per-inode-table(-block) locks if needed. This would
> allow multi- threaded inode and block allocations, but a sane lock
> ranking strategy would have to be developed. The bitmap locks would
> only need to be 2-state locks, because you only look at the bitmaps
> when you want to modify them. The inode table locks would be read/write
> locks.
The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer
LRUs to avoid blockdev mapping lock contention, per-blockgroup locks
and removal of lock_super from the block allocator.
But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list. Fixes for both of those are in progress.
ext2 is bog-simple. It will scale up the wazoo in 2.6.
> If there is a try-writelock mechanism for the individual inode table
> blocks you can avoid write lock contention for creations by simply
> finding the first un-write-locked block in the target group's inode table
> (usually in the hundreds of blocks per group for default parameters).
Depends on what the profile say, Andreas. And I mean profiles - lockmeter
tends to tell you "what", not "why". Start at the top of the list. Fix
them by design if possible. If not, tweak it!
-

Hi,
On Wed, Jun 19, 2002 at 05:54:46PM -0700, Andrew Morton wrote:
> The vague plan there is to replace lock_kernel with lock_journal
> where appropriate. But ext3 scalability work of this nature
> will be targetted at the 2.5 kernel, most probably.
I think we can do better than that, with care. lock_journal could
easily become a read/write lock to protect the transaction state
machine, as there's really only one place --- the commit thread ---
where we end up changing the state of a transaction itself (eg. from
running to committing). For short-lived buffer transformations, we
already have the datalist spinlock.
There are a few intermediate types of operation, such as the
do_get_write_access. That's a buffer operation, but it relies on us
being able to allocate memory for the old version of the buffer if we
happen to be committing the bh to disk already. All of those cases
are already prepared to accept BKL being dropped during the memory
allocation, so there's no problem with doing the same for a short-term
buffer spinlock; and if the journal_lock is only taken shared in such
places, then there's no urgent need to drop that over the malloc.
Even the commit thread can probably avoid taking the journal lock in
many cases --- it would need it exclusively while changing a
transaction's global state, but while it's just manipulating blocks on
the committing transaction it can probably get away with much less
locking.
Cheers,
Stephen

mgross wrote:
>
> We've been doing some throughput comparisons and benchmarks of block I/O
> throughput for 8KB writes as the number of SCSI addapters and drives per
> adapter is increased.
>
> The Linux platform is a dual processor 1.2GHz PIII, 2Gig or RAM, 2U box.
> Similar results have been seen with both 2.4.16 and 2.4.18 base kernel, as
> well as one of those patched up O(1) 2.4.18 kernels out there.
umm. Are you not using block-highmem? That is a must-have.
http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre9aa2/00_block-highmem-all-18b-12.gz
-

Is this a property of ext3 only or the vfs? ... the real question being:
will ext2 or some of the other
"major" filesystems like reiser perform better?
Andrew Morton wrote:
>mgross wrote:
>
>>...
>>Has anyone done any work looking into the I/O scaling of Linux / ext3 per
>>spindle or per adapter? We would like to compare notes.
>>
>
>No. ext3 scalability is very poor, I'm afraid. The fs really wasn't
>up and running until kernel 2.4.5 and we just didn't have time to
>address that issue.
>
>
>>I've only just started to look at the ext3 code but it seems to me that replacing the
>>BKL with a per - ext3 file system lock could remove some of the contention thats
>>getting measured. What data are the BKL protecting in these ext3 functions? Could a
>>lock per FS approach work?
>>
>
>The vague plan there is to replace lock_kernel with lock_journal
>where appropriate. But ext3 scalability work of this nature
>will be targetted at the 2.5 kernel, most probably.
>
>I'll take a look, see if there's any low-hanging fruit in there,
>but I doubt that the results will be fantastic.
>
>-
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@...
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>