*Re: [PATCH v2 3/6] 9p: Replace the fidlist with an IDR
2018-07-12 11:23 ` Matthew Wilcox@ 2018-07-12 11:30 ` Dominique Martinet0 siblings, 0 replies; 53+ messages in thread
From: Dominique Martinet @ 2018-07-12 11:30 UTC (permalink / raw)
To: Matthew Wilcox
Cc: v9fs-developer, Latchesar Ionkov, Eric Van Hensbergen,
Ron Minnich, linux-kernel, linux-fsdevel
Matthew Wilcox wrote on Thu, Jul 12, 2018:
> > Ah, I had missed that you didn't update this memset as you said in reply
> > to comment on v1.
>
> Rather than update the memset, I ...
>
> > Could you resend just this patch and either initialize fid->fid or use
> > kzalloc for the fid allocation?
> >
> > > + fid->fid = 0;
>
> Did that instead ;-)
>
> If I were going to initialise the entire structure to 0, I'd replace
> the kmalloc with kzalloc and drop the memset altogether.
Oh, I'm blind, sorry! :)
> > If you do resend, alignment here was wrong.
>
> I think this warning from checkpatch is complete bullshit. It's
> really none of checkpatch's business how I choose to align function
> arguments.
>
> That said, if you want it to be aligned some particular way, feel free
> to adjust the whitespace. I don't care.
I would tend to agree that sometimes checkpatch is overzealous, but
having worked on projects where code style is all over the place it
really feels much more comfortable to have a consistent style
everywhere.
Thanks for the permission, I'll adjust it (assuming this does end up
getting pulled from my branch, but nobody yelled so far)
--
Dominique Martinet
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests
2018-07-23 11:52 ` Greg Kurz@ 2018-07-23 12:25 ` Dominique Martinet
2018-07-23 14:24 ` Greg Kurz
2018-07-30 9:31 ` Dominique Martinet0 siblings, 2 replies; 53+ messages in thread
From: Dominique Martinet @ 2018-07-23 12:25 UTC (permalink / raw)
To: Greg Kurz
Cc: Matthew Wilcox, v9fs-developer, Latchesar Ionkov,
Eric Van Hensbergen, Ron Minnich, linux-kernel, linux-fsdevel
Greg Kurz wrote on Mon, Jul 23, 2018:
> The patch is quite big and I'm not sure I can find time to review it
> carefully, but I'll try to help anyway.
No worry, thanks for this already.
> > Sorry for coming back to this patch now, I just noticed something that's
> > actually probably a fairly big hit on performance...
> >
> > While the slab is just as good as the array for the request itself, this
> > makes every single request allocate "fcalls" everytime instead of
> > reusing a cached allocation.
> > The default msize is 8k and these allocs probably are fairly efficient,
> > but some transports like RDMA allow to increase this to up to 1MB... And
>
> It can be even bigger with virtio:
>
> #define VIRTQUEUE_NUM 128
>
> .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
>
> On a typical ppc64 server class setup with 64KB pages, this is nearly 8MB.
I don't think I'll be able to test 64KB pages, and it's "just" 500k with
4K pages so I'll go with IB.
I just finished reinstalling my IB-enabled VMs, now to get some iops
test running (dbench maybe) and I'll get some figures to be able to play
with different models and evaluate the impact of these.
> > One thing is that the buffers are all going to be the same size for a
> > given client (.... except virtio zc buffers, I wonder what I'm missing
> > or why that didn't blow up before?)
>
> ZC allocates a 4KB buffer, which is more than enough to hold the 7-byte 9P
> header and the "dqd" part of all messages that may use ZC, ie, 16 bytes.
> So I'm not sure to catch what could blow up.
ZC requests won't blow up, but from what I can see with the current
(old) request cache array, if a ZC request has a not-yet used tag it'll
allocate a new 4k buffer, then if a normal request uses that tag it'll
get the 4k buffer instead of an msize sized one.
On the client size the request would be posted with req->rc->capacity
which would correctly be 4k, but I'm not sure what would happen if qemu
tries to write more than the given size to that request?
> > It's a shame because I really like that patch, I'll try to find time to
> > run some light benchmark with varying msizes eventually but I'm not sure
> > when I'll find time for that... Hopefully before the 4.19 merge window!
> >
>
> Yeah, the open-coded cache we have now really obfuscates things.
>
> Maybe have a per-client kmem_cache object for non-ZC requests with
> size msize [*], and a global kmem_cache object for ZC requests with
> fixed size P9_ZC_HDR_SZ.
>
> [*] the server can require a smaller msize during version negotiation,
> so maybe we should change the kmem_cache object in this case.
Yeah, if we're going to want to accomodate non-power of two buffers, I
think we'll need a separate kmem_cache for them.
The ZC requests could be made into exactly 4k and these could come with
regular kmalloc just fine, it looks like trying to create a cache of
that size would just return the same cache used by kmalloc anyway so
it's probably easier to fall back to kmalloc if requested alloc size
doesn't match what we were hoping for.
I'll try to get figures for various approaches before the merge window
for 4.19 starts, it's getting closer though...
--
Dominique
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [PATCH v2 5/6] 9p: Use a slab for allocating requests
2018-07-23 12:25 ` Dominique Martinet@ 2018-07-23 14:24 ` Greg Kurz
2018-07-30 9:31 ` Dominique Martinet1 sibling, 0 replies; 53+ messages in thread
From: Greg Kurz @ 2018-07-23 14:24 UTC (permalink / raw)
To: Dominique Martinet
Cc: Matthew Wilcox, v9fs-developer, Latchesar Ionkov,
Eric Van Hensbergen, Ron Minnich, linux-kernel, linux-fsdevel
On Mon, 23 Jul 2018 14:25:31 +0200
Dominique Martinet <asmadeus@codewreck.org> wrote:
> Greg Kurz wrote on Mon, Jul 23, 2018:
> > The patch is quite big and I'm not sure I can find time to review it
> > carefully, but I'll try to help anyway.
>
> No worry, thanks for this already.
>
> > > Sorry for coming back to this patch now, I just noticed something that's
> > > actually probably a fairly big hit on performance...
> > >
> > > While the slab is just as good as the array for the request itself, this
> > > makes every single request allocate "fcalls" everytime instead of
> > > reusing a cached allocation.
> > > The default msize is 8k and these allocs probably are fairly efficient,
> > > but some transports like RDMA allow to increase this to up to 1MB... And
> >
> > It can be even bigger with virtio:
> >
> > #define VIRTQUEUE_NUM 128
> >
> > .maxsize = PAGE_SIZE * (VIRTQUEUE_NUM - 3),
> >
> > On a typical ppc64 server class setup with 64KB pages, this is nearly 8MB.
>
> I don't think I'll be able to test 64KB pages, and it's "just" 500k with
> 4K pages so I'll go with IB.
> I just finished reinstalling my IB-enabled VMs, now to get some iops
> test running (dbench maybe) and I'll get some figures to be able to play
> with different models and evaluate the impact of these.
>
Sounds like a good plan.
> > > One thing is that the buffers are all going to be the same size for a
> > > given client (.... except virtio zc buffers, I wonder what I'm missing
> > > or why that didn't blow up before?)
> >
> > ZC allocates a 4KB buffer, which is more than enough to hold the 7-byte 9P
> > header and the "dqd" part of all messages that may use ZC, ie, 16 bytes.
> > So I'm not sure to catch what could blow up.
>
> ZC requests won't blow up, but from what I can see with the current
> (old) request cache array, if a ZC request has a not-yet used tag it'll
> allocate a new 4k buffer, then if a normal request uses that tag it'll
> get the 4k buffer instead of an msize sized one.
>
Indeed.
> On the client size the request would be posted with req->rc->capacity
> which would correctly be 4k, but I'm not sure what would happen if qemu
> tries to write more than the given size to that request?
>
QEMU would detect that the sg list doesn't have enough capacity.
Old QEMUs used to return a RERROR or RLERROR message with ENOBUFS
in this case. This didn't made sense to hijack the 9P protocol, which
is transport agnostic to report misconfigured buffers. Especially, in
the worst case, maybe we wouldn't even have enough space for the error
response... So, since QEMU 2.10, we put the virtio 9p device into
broken state instead, ie, inoperative until it gets reset.
I guess this situation was never hit because server responses mostly
need less than 4KB...
> > > It's a shame because I really like that patch, I'll try to find time to
> > > run some light benchmark with varying msizes eventually but I'm not sure
> > > when I'll find time for that... Hopefully before the 4.19 merge window!
> > >
> >
> > Yeah, the open-coded cache we have now really obfuscates things.
> >
> > Maybe have a per-client kmem_cache object for non-ZC requests with
> > size msize [*], and a global kmem_cache object for ZC requests with
> > fixed size P9_ZC_HDR_SZ.
> >
> > [*] the server can require a smaller msize during version negotiation,
> > so maybe we should change the kmem_cache object in this case.
>
> Yeah, if we're going to want to accomodate non-power of two buffers, I
> think we'll need a separate kmem_cache for them.
> The ZC requests could be made into exactly 4k and these could come with
> regular kmalloc just fine, it looks like trying to create a cache of
> that size would just return the same cache used by kmalloc anyway so
> it's probably easier to fall back to kmalloc if requested alloc size
> doesn't match what we were hoping for.
>
You're right, ZC requests could rely on kmalloc() directly.
> I'll try to get figures for various approaches before the merge window
> for 4.19 starts, it's getting closer though...
>
Great thanks for your effort, but we've been leaving with this code
since the beginning. If this misses the 4.19 merge window, we'll have
more time to validate the approach and polish the fix for 4.20 :)
Cheers,
--
Greg
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
2018-07-31 1:18 ` [V9fs-developer] " piaojun
@ 2018-07-31 1:35 ` Dominique Martinet
2018-07-31 1:45 ` piaojun0 siblings, 1 reply; 53+ messages in thread
From: Dominique Martinet @ 2018-07-31 1:35 UTC (permalink / raw)
To: piaojun
Cc: v9fs-developer, linux-fsdevel, Greg Kurz, Matthew Wilcox, linux-kernel
piaojun wrote on Tue, Jul 31, 2018:
> Could you help paste some test result before-and-after the patch applied?
The only performance tests I did were sent to the list a couple of mails
earlier, you can find it here:
http://lkml.kernel.org/r/20180730093101.GA7894@nautica
In particular, the results for benchmark on small writes just before and
after this patch, without KASAN (these are the same numbers as in the
link, hardware/setup is described there):
- no alloc (4.18-rc7 request cache): 65.4k req/s
- non-power of two alloc, no patch: 61.6k req/s
- power of two alloc, no patch: 62.2k req/s
- non-power of two alloc, with patch: 64.7k req/s
- power of two alloc, with patch: 65.1k req/s
I'm rather happy with the result, I didn't expect using a dedicated
cache would bring this much back but it's certainly worth it.
> > @@ -1011,6 +1034,7 @@ void p9_client_destroy(struct p9_client *clnt)
> >
> > p9_tag_cleanup(clnt);
> >
> > + kmem_cache_destroy(clnt->fcall_cache);
>
> We could set NULL for fcall_cache in case of use-after-free.
>
> > kfree(clnt);
Hmm, I understand where this comes from, but I'm not sure I agree.
If someone tries to access the client while/after it is freed things are
going to break anyway, I'd rather let things break as obviously as
possible than try to cover it up.
--
Dominique
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [V9fs-developer] [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
2018-07-31 1:35 ` Dominique Martinet@ 2018-07-31 1:45 ` piaojun0 siblings, 0 replies; 53+ messages in thread
From: piaojun @ 2018-07-31 1:45 UTC (permalink / raw)
To: Dominique Martinet
Cc: v9fs-developer, linux-fsdevel, Greg Kurz, Matthew Wilcox, linux-kernel
On 2018/7/31 9:35, Dominique Martinet wrote:
> piaojun wrote on Tue, Jul 31, 2018:
>> Could you help paste some test result before-and-after the patch applied?
>
> The only performance tests I did were sent to the list a couple of mails
> earlier, you can find it here:
> http://lkml.kernel.org/r/20180730093101.GA7894@nautica
>
> In particular, the results for benchmark on small writes just before and
> after this patch, without KASAN (these are the same numbers as in the
> link, hardware/setup is described there):
> - no alloc (4.18-rc7 request cache): 65.4k req/s
> - non-power of two alloc, no patch: 61.6k req/s
> - power of two alloc, no patch: 62.2k req/s
> - non-power of two alloc, with patch: 64.7k req/s
> - power of two alloc, with patch: 65.1k req/s
>
> I'm rather happy with the result, I didn't expect using a dedicated
> cache would bring this much back but it's certainly worth it.
>
It looks like an obvious promotion.
>>> @@ -1011,6 +1034,7 @@ void p9_client_destroy(struct p9_client *clnt)
>>>
>>> p9_tag_cleanup(clnt);
>>>
>>> + kmem_cache_destroy(clnt->fcall_cache);
>>
>> We could set NULL for fcall_cache in case of use-after-free.
>>
>>> kfree(clnt);
>
> Hmm, I understand where this comes from, but I'm not sure I agree.
> If someone tries to access the client while/after it is freed things are
> going to break anyway, I'd rather let things break as obviously as
> possible than try to cover it up.
>
Setting NULL is not a big matter, and I will hear others' opinion.
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [PATCH 2/2] net/9p: add a per-client fcall kmem_cache
2018-07-31 2:46 ` Matthew Wilcox@ 2018-07-31 4:17 ` Dominique Martinet0 siblings, 0 replies; 53+ messages in thread
From: Dominique Martinet @ 2018-07-31 4:17 UTC (permalink / raw)
To: Matthew Wilcox; +Cc: v9fs-developer, Greg Kurz, linux-fsdevel, linux-kernel
Matthew Wilcox wrote on Mon, Jul 30, 2018:
> On Mon, Jul 30, 2018 at 11:34:23AM +0200, Dominique Martinet wrote:
> > -static int p9_fcall_alloc(struct p9_fcall *fc, int alloc_msize)
> > +static int p9_fcall_alloc(struct p9_client *c, struct p9_fcall *fc,
> > + int alloc_msize)
> > {
> > - fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
> > + if (c->fcall_cache && alloc_msize == c->msize)
> > + fc->sdata = kmem_cache_alloc(c->fcall_cache, GFP_NOFS);
> > + else
> > + fc->sdata = kmalloc(alloc_msize, GFP_NOFS);
>
> Could you simplify this by initialising c->msize to 0 and then this
> can simply be:
>
> > + if (alloc_msize == c->msize)
> ...
Hmm, this is rather tricky with the current flow of things;
p9_client_version() has multiple uses for that msize field.
Basically what happens is:
- init client struct, set clip msize to mount option/transport-specific
max
- p9_client_version() uses current c->msize to send a suggested value
to the server
- p9_client_rpc() uses current c->msize to allocate that first rpc,
this is pretty much hard-coded and will be quite intrusive to make an
exception for
- p9_client_version() looks at the msize the server suggested and clips
c->msize if the reply's is smaller than c->msize
I kind of agree it'd be nice to remove that check being done all the
time for just startup, but I don't see how to do this easily with the
current code.
Making p9_client_version take an extra argument would be easy but we'd
need to actually hardcode in p9_client_rpc that "if the message type is
TVERSION then use [page size or whatever] for allocation" and that kinds
of kills the point... The alternative being having p9_client_rpc takes
the actual size as argument itself but this once again is pretty
intrusive even if it could be done mechanically...
I'll think about this some more
> > +void p9_fcall_free(struct p9_client *c, struct p9_fcall *fc)
> > +{
> > + /* sdata can be NULL for interrupted requests in trans_rdma,
> > + * and kmem_cache_free does not do NULL-check for us
> > + */
> > + if (unlikely(!fc->sdata))
> > + return;
> > +
> > + if (c->fcall_cache && fc->capacity == c->msize)
> > + kmem_cache_free(c->fcall_cache, fc->sdata);
> > + else
> > + kfree(fc->sdata);
> > +}
>
> Is it possible for fcall_cache to be allocated before fcall_free is
> called? I'm concerned we might do this:
>
> allocate message A
> allocate message B
> receive response A
> allocate fcall_cache
> receive response B
>
> and then we'd call kmem_cache_free() for something allocated by kmalloc(),
> which works with slab and slub, but doesn't work with slob (alas).
Bleh, I checked this would work for slab and didn't really check
others..
This cannot happen right now because we only return the client struct
from p9_client_create after the first message is done (and, right now,
freed) but when we start adding refcounting to requests it'd be possible
to free the very first response after fcall_cache is allocated with a
"bad" server like syzcaller does sending the version reply before the
request came in.
I can't see any work-around around this other than storing how the fcall
was allocated in the struct itself though...
I guess I might as well do that now, unless you have a better idea.
> > @@ -980,6 +1000,9 @@ struct p9_client *p9_client_create(const char *dev_name, char *options)
> > if (err)
> > goto close_trans;
> >
> > + clnt->fcall_cache = kmem_cache_create("9p-fcall-cache", clnt->msize,
> > + 0, 0, NULL);
> > +
>
> If we have slab merging turned off, or we have two mounts from servers
> with different msizes, we'll end up with two slabs called 9p-fcall-cache.
> I'm OK with that, but are you?
Yeah, the reason I didn't make it global like p9_req_cache is precisely
to get two separate caches if the msizes are different.
I actually considered adding msize to the string with snprintf or
something but someone looking at it through slabinfo or similar will
have the sizes anyway so I don't think this would bring anything, do you
know if/think that tools will choke on multiple caches with the same
name?
I'm not sure about slab merging being disabled though, from the little I
understand I do not see why anyone would do that except for debugging,
and I'm fine with that.
Please let me know if I'm missing something though!
Thanks for the review,
--
Dominique Martinet
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs
2018-08-01 14:14 ` Greg Kurz@ 2018-08-01 14:38 ` Dominique Martinet
2018-08-01 15:03 ` Greg Kurz0 siblings, 1 reply; 53+ messages in thread
From: Dominique Martinet @ 2018-08-01 14:38 UTC (permalink / raw)
To: Greg Kurz; +Cc: v9fs-developer, linux-fsdevel, Matthew Wilcox, linux-kernel
Greg Kurz wrote on Wed, Aug 01, 2018:
> > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
> > if (!req)
> > return NULL;
> >
> > - req->tc = p9_fcall_alloc(alloc_msize);
> > - req->rc = p9_fcall_alloc(alloc_msize);
> > - if (!req->tc || !req->rc)
> > + if (p9_fcall_alloc(&req->tc, alloc_msize))
> > + goto free;
> > + if (p9_fcall_alloc(&req->rc, alloc_msize))
> > goto free;
>
> Hmm... if the first allocation fails, we will kfree() req->rc.sdata.
>
> Are we sure we won't have a stale pointer or uninitialized data in
> there ?
Yeah, Jun pointed that out and I have a v2 that only frees as needed
with an extra goto (I sent an incremental diff in my reply to his
comment here[1])
[1] https://lkml.kernel.org/r/20180731011256.GA30388@nautica> And even if we don't with the current code base, this is fragile and
> could be easily broken.
>
> I think you should drop this hunk and rather rename p9_fcall_alloc() to
> p9_fcall_alloc_sdata() instead, since this is what the function is
> actually doing with this patch applied.
Hmm. I agree the naming isn't accurate, but even if we rename it we'll
need to pass a pointer to fcall as argument as it inits its capacity.
p9_fcall_init(fc, msize) might be simpler?
(I'm not sure I follow what you mean by 'drop this hunk', to be honest,
did you want a single function call to init both maybe?)
--
Dominique
^permalinkrawreply [flat|nested] 53+ messages in thread

*Re: [V9fs-developer] [PATCH 1/2] net/9p: embed fcall in req to round down buffer allocs
2018-08-01 14:38 ` Dominique Martinet@ 2018-08-01 15:03 ` Greg Kurz0 siblings, 0 replies; 53+ messages in thread
From: Greg Kurz @ 2018-08-01 15:03 UTC (permalink / raw)
To: Dominique Martinet
Cc: v9fs-developer, linux-fsdevel, Matthew Wilcox, linux-kernel
On Wed, 1 Aug 2018 16:38:40 +0200
Dominique Martinet <asmadeus@codewreck.org> wrote:
> Greg Kurz wrote on Wed, Aug 01, 2018:
> > > @@ -263,13 +261,13 @@ p9_tag_alloc(struct p9_client *c, int8_t type, unsigned int max_size)
> > > if (!req)
> > > return NULL;
> > >
> > > + if (p9_fcall_alloc(&req->tc, alloc_msize))
> > > + goto free;
> > > + if (p9_fcall_alloc(&req->rc, alloc_msize))
> > > goto free;
> >
> > Hmm... if the first allocation fails, we will kfree() req->rc.sdata.
> >
> > Are we sure we won't have a stale pointer or uninitialized data in
> > there ?
>
> Yeah, Jun pointed that out and I have a v2 that only frees as needed
> with an extra goto (I sent an incremental diff in my reply to his
> comment here[1])
>
> [1] https://lkml.kernel.org/r/20180731011256.GA30388@nautica
>
> > And even if we don't with the current code base, this is fragile and
> > could be easily broken.
> >
> > I think you should drop this hunk and rather rename p9_fcall_alloc() to
> > p9_fcall_alloc_sdata() instead, since this is what the function is
> > actually doing with this patch applied.
>
> Hmm. I agree the naming isn't accurate, but even if we rename it we'll
> need to pass a pointer to fcall as argument as it inits its capacity.
> p9_fcall_init(fc, msize) might be simpler?
>
Ah yes you're right... alloc is a bit misleading then. I agree that
p9_fcall_init() is more appropriate in this case.
And maybe you should introduce p9_fcall_fini() or _release() for
completeness. It would only do kfree() for a start, but it would
then evolve to be like the p9_fcall_kfree() function from patch 2.
> (I'm not sure I follow what you mean by 'drop this hunk', to be honest,
> did you want a single function call to init both maybe?)
>
I was meaning "keep the same logic in p9_tag_alloc()", something like:
req->tc.sdata = p9_fcall_alloc_sdata(&req->tc, alloc_msize);
req->rc.sdata = p9_fcall_alloc_sdata(&req->tc, alloc_msize);
if (!req->tc.sdata || !req->rc.sdata)
But I agree the way you did is cleaner.
^permalinkrawreply [flat|nested] 53+ messages in thread