Although the SRP protocol supports multichannel operation, althoughsince considerable time RDMA HCA's are available that support multiplecompletion vectors and although multichannel operation yields betterperformance than using a single channel, the Linux SRP initiator doesnot yet support multichannel operation. Hence this patch series thatadds multichannel support to the SRP initiator driver.

The changes compared to the previous version of this patch series are asfollows:* Added a function to the block layer that allows SCSI LLDs to querythe blk-mq hardware context index chosen by the block layer. Removedthe mq_queuecommand callback again.* Added support for multiple hardware queues in the TCQ functions inthe SCSI core.* Split a few patches and elaborated the patch descriptions to make iteasier to review this patch series.* Added two new patches: one patch that makes the SRP initiator alwaysuse block layer tags and another patch that realizes a micro-optimization, namely elimination of the free requests list.

The patches in this series are:0001-blk-mq-Use-all-available-hardware-queues.patch0002-blk-mq-Add-blk_mq_unique_tag.patch0003-scsi-mq-Add-support-for-multiple-hardware-queues.patch0004-scsi_tcq.h-Add-support-for-multiple-hardware-queues.patch0005-IB-srp-Move-ib_destroy_cm_id-call-into-srp_free_ch_i.patch0006-IB-srp-Remove-stale-connection-retry-mechanism.patch0007-IB-srp-Avoid-that-I-O-hangs-due-to-a-cable-pull-duri.patch0008-IB-srp-Introduce-two-new-srp_target_port-member-vari.patch0009-IB-srp-Separate-target-and-channel-variables.patch0010-IB-srp-Use-block-layer-tags.patch0011-IB-srp-Eliminate-free_reqs-list.patch0012-IB-srp-Add-multichannel-support.patch

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Suppose that a system has two CPU sockets, three cores per socket,that it does not support hyperthreading and that four hardwarequeues are provided by a block driver. With the current algorithmthis will lead to the following assignment of CPU cores to hardwarequeues:

HWQ 0: 0 1HWQ 1: 2 3HWQ 2: 4 5HWQ 3: (none)

This patch changes the queue assignment into:

HWQ 0: 0 1HWQ 1: 2HWQ 2: 3 4HWQ 3: 5

In other words, this patch has the following three effects:- All four hardware queues are used instead of only three.- CPU cores are spread more evenly over hardware queues. For theabove example the range of the number of CPU cores associatedwith a single HWQ is reduced from [0..2] to [1..2].- If the number of HWQ's is a multiple of the number of CPU socketsit is now guaranteed that all CPU cores associated with a singleHWQ reside on the same CPU socket.

Post by Bart Van AsscheSuppose that a system has two CPU sockets, three cores per socket,that it does not support hyperthreading and that four hardwarequeues are provided by a block driver. With the current algorithmthis will lead to the following assignment of CPU cores to hardwareHWQ 0: 0 1HWQ 1: 2 3HWQ 2: 4 5HWQ 3: (none)HWQ 0: 0 1HWQ 1: 2HWQ 2: 3 4HWQ 3: 5- All four hardware queues are used instead of only three.- CPU cores are spread more evenly over hardware queues. For theabove example the range of the number of CPU cores associatedwith a single HWQ is reduced from [0..2] to [1..2].- If the number of HWQ's is a multiple of the number of CPU socketsit is now guaranteed that all CPU cores associated with a singleHWQ reside on the same CPU socket.---block/blk-mq-cpumap.c | 2 +-1 file changed, 1 insertion(+), 1 deletion(-)diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.cindex 1065d7c..8e56455 100644--- a/block/blk-mq-cpumap.c+++ b/block/blk-mq-cpumap.c@@ -17,7 +17,7 @@static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,const int cpu){- return cpu / ((nr_cpus + nr_queues - 1) / nr_queues);+ return cpu * nr_queues / nr_cpus;}static int get_first_sibling(unsigned int cpu)

Lets do this separate, as explained last time, it needs to be evaluatedon its own and doesn't really belong in this series of patches.

--Jens Axboe

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Jens AxboeLets do this separate, as explained last time, it needs to be evaluatedon its own and doesn't really belong in this series of patches.

Hello Jens,

A few minutes ago I have resent this patch to you with the LKML in CC. Ihope that Christoph agrees with leaving out this patch from this serieswithout me having to resend this series. BTW, Sagi promised me off-listto review the other patches in this series.

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Jens AxboeLets do this separate, as explained last time, it needs to be evaluatedon its own and doesn't really belong in this series of patches.

Hello Jens,A few minutes ago I have resent this patch to you with the LKML in CC. Ihope that Christoph agrees with leaving out this patch from this serieswithout me having to resend this series. BTW, Sagi promised me off-list toreview the other patches in this series.

Ignoring patches is one of the easier tasks, np. Do you want me tomerge all the srp updates? They normally go through the IB tree, butI can pick them up this time so that we don't need to synchronize thetwo trees.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Jens AxboeLets do this separate, as explained last time, it needs to be evaluatedon its own and doesn't really belong in this series of patches.

Hello Jens,A few minutes ago I have resent this patch to you with the LKML in CC. Ihope that Christoph agrees with leaving out this patch from this serieswithout me having to resend this series. BTW, Sagi promised me off-list toreview the other patches in this series.

Ignoring patches is one of the easier tasks, np. Do you want me tomerge all the srp updates? They normally go through the IB tree, butI can pick them up this time so that we don't need to synchronize thetwo trees.

Hello Christoph,

Since patch 1/12 already has been sent separately to Jens patches2/12..12/12 remain. The SRP initiator changes in this series depend onthe blk-mq and scsi-mq features added in patches 2/12..4/12. I think weshould avoid splitting these patches over multiple kernel trees to avoidhaving to deal with dependencies between kernel trees.

Bart.--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheSince patch 1/12 already has been sent separately to Jens patches2/12..12/12 remain. The SRP initiator changes in this series depend on theblk-mq and scsi-mq features added in patches 2/12..4/12. I think we shouldavoid splitting these patches over multiple kernel trees to avoid having todeal with dependencies between kernel trees.

I'd like to pull it in and Roland already did indicate that he's finewith it.

Sagi, can I get a review from you for the remaining SRP patches?

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheSince patch 1/12 already has been sent separately to Jens patches2/12..12/12 remain. The SRP initiator changes in this series depend on theblk-mq and scsi-mq features added in patches 2/12..4/12. I think we shouldavoid splitting these patches over multiple kernel trees to avoid having todeal with dependencies between kernel trees.

I'd like to pull it in and Roland already did indicate that he's finewith it.Sagi, can I get a review from you for the remaining SRP patches?

As I promised Bart, it's on my todo list. Unfortunately this week I was violently pulled to another project...It's a holiday here in Israel, so I'm planning to review the set by mid next week.

Sagi.--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

The queuecommand() callback functions in SCSI low-level driversneed to know which hardware context has been selected by theblock layer. Since this information is not available in therequest structure, and since passing the hctx pointer directly tothe queuecommand callback function would require modification ofall SCSI LLDs, add a function to the block layer that allows toquery the hardware context index.

With the approach for block layer tag management proposed in this patchseries SCSI LLDs no longer need to call this function. This means thatthe blk_mq_build_unique_tag() function can be eliminated by inlining itinto blk_mq_unique_tag(). Would you like me to rework this patchaccordingly ?

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheWith the approach for block layer tag management proposed in this patchseries SCSI LLDs no longer need to call this function. This means that theblk_mq_build_unique_tag() function can be eliminated by inlining it intoblk_mq_unique_tag(). Would you like me to rework this patch accordingly ?

Yes, please.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheWith the approach for block layer tag management proposed in this patchseries SCSI LLDs no longer need to call this function. This means that theblk_mq_build_unique_tag() function can be eliminated by inlining it intoblk_mq_unique_tag(). Would you like me to rework this patch accordingly ?

Yes, please.

With this bit you can add:

Reviewed-by: Sagi Grimberg <***@mellanox.com>--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Reviewed-by: Sagi Grimberg <sagig-VPRAkNaXOzVWk0Htik3J/***@public.gmane.org>--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

This is because the indentation has been modified from "8x<space><tab>" into"<tab><tab>". I can leave out that change if you want.

Please keep it.

I don't have a big objection on this, but the problem with leaving thisstuff is that it tends to screw up git blame...--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Attempting to connect three times may be insufficient after aninitiator system tries to relogin, especially if the reloginattempt occurs before the SRP target service ID has beenregistered. Since the srp_daemon retries a failed login attemptanyway, remove the stale connection retry mechanism.

If a cable is pulled during LUN scanning it can happen that theSRP rport and the SCSI host have been created but no LUNs have beenadded to the SCSI host. Since multipathd only sends SCSI commandsto a SCSI target if one or more SCSI devices are present and sincethere is no keepalive mechanism for IB queue pairs this means thatafter a LUN scan failed and after a reconnect has succeeded nodata will be sent over the QP and hence that a subsequent cablepull will not be detected. Avoid this by not creating an rport orSCSI host if a cable is pulled during a SCSI LUN scan.

Note: so far the above behavior has only been observed with thekernel module parameter ch_count set to a value >= 2.

srp_disconnect_target(target);++ if (target->state == SRP_TARGET_SCANNING)+ return -ENODEV;+/** Now get a new local CM ID so that we avoid confusing the target in* case things are really fouled up. Doing so also ensures that all CM@@ -2607,11 +2611,23 @@ static struct scsi_host_template srp_template = {.shost_attrs = srp_host_attrs};

Post by Bart Van AsscheIf a cable is pulled during LUN scanning it can happen that theSRP rport and the SCSI host have been created but no LUNs have beenadded to the SCSI host. Since multipathd only sends SCSI commandsto a SCSI target if one or more SCSI devices are present and sincethere is no keepalive mechanism for IB queue pairs this means thatafter a LUN scan failed and after a reconnect has succeeded nodata will be sent over the QP and hence that a subsequent cablepull will not be detected. Avoid this by not creating an rport orSCSI host if a cable is pulled during a SCSI LUN scan.Note: so far the above behavior has only been observed with thekernel module parameter ch_count set to a value >= 2.---drivers/infiniband/ulp/srp/ib_srp.c | 60 +++++++++++++++++++++++++++++++------drivers/infiniband/ulp/srp/ib_srp.h | 1 +2 files changed, 52 insertions(+), 9 deletions(-)diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.cindex 9608e7a..a662c29 100644--- a/drivers/infiniband/ulp/srp/ib_srp.c+++ b/drivers/infiniband/ulp/srp/ib_srp.c@@ -1111,6 +1111,10 @@ static int srp_rport_reconnect(struct srp_rport *rport)int i, ret;srp_disconnect_target(target);++ if (target->state == SRP_TARGET_SCANNING)+ return -ENODEV;+/** Now get a new local CM ID so that we avoid confusing the target in* case things are really fouled up. Doing so also ensures that all CM@@ -2607,11 +2611,23 @@ static struct scsi_host_template srp_template = {.shost_attrs = srp_host_attrs};+static int srp_sdev_count(struct Scsi_Host *host)+{+ struct scsi_device *sdev;+ int c = 0;++ shost_for_each_device(sdev, host)+ c++;++ return c;+}+

Is this really an SRP specific routine?Can you move it to a more natural location?

So my impression is that by conditioning target->qp_in_error you arerelying on the fact that SRP eh was invoked here (RC error), what ifscsi eh was invoked prior to that? did you test this path?

This code path has been tested. It's not that hard to trigger this codepath - setting the channel count (ch_count) parameter to a high valueand running a loop at the target side that disables IB ports after arandom delay between 0s and (2*srp_daemon_scan_interval) is sufficient.After not too many iterations this code will be hit because the higherthe channel count the longer it takes to log in.

The SCSI EH is only activated after a SCSI command has failed. No SCSIcommands are sent to a target system before the scsi_scan_target() call.This means that the SCSI EH can only get activated whilescsi_scan_target() is in progress or after that function has finished.

Bart.--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Introduce the srp_target_port member variables 'sgid' and 'pkey'.Change the type of 'orig_dgid' from __be16[8] into union ib_gid.This patch does not change any functionality but makes the"Separate target and channel variables" patch easier to verify.

Reviewed-by: Sagi Grimberg <***@mellanox.com>--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Changes in this patch:- Move channel variables into a new structure (struct srp_rdma_ch).- Add an srp_target_port pointer, 'lock' and 'comp_vector' membersin struct srp_rdma_ch.- Add code to initialize these three new member variables.- Many boring "target->" into "ch->" changes.- The cm_id and completion handler context pointers are now of typesrp_rdma_ch * instead of srp_target_port *.- Three kzalloc(a * b, f) calls have been changed into kcalloc(a, b, f)to avoid that this patch would trigger a checkpatch warning.- Two casts from u64 into unsigned long long have been left outbecause these are superfluous. Since considerable time u64 isdefined as unsigned long long for all architectures supported bythe Linux kernel.

Post by Bart Van Assche- Move channel variables into a new structure (struct srp_rdma_ch).- Add an srp_target_port pointer, 'lock' and 'comp_vector' membersin struct srp_rdma_ch.- Add code to initialize these three new member variables.- Many boring "target->" into "ch->" changes.- The cm_id and completion handler context pointers are now of typesrp_rdma_ch * instead of srp_target_port *.- Three kzalloc(a * b, f) calls have been changed into kcalloc(a, b, f)to avoid that this patch would trigger a checkpatch warning.- Two casts from u64 into unsigned long long have been left outbecause these are superfluous. Since considerable time u64 isdefined as unsigned long long for all architectures supported bythe Linux kernel.

This patch is pretty exhausting...Didn't find anything wrong but I didn't carefully review every bit of it.

Seems like a nice optimization for the future would be to preallocatethe srp requests with the block ones and the scsi command.

Agreed. The reason why that optimization has not been included in thispatch series is because it would require more work than the optimizationin patch 10/12. The free_reqs list is namely not only used when a SCSIcommand is submitted by the SCSI core but also when submitting a taskmanagement command or when replying to a request submitted by the targetsystem. In other words, the free_reqs list would have to be modifiedsuch that it is only used for the latter two purposes.

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Have you tested this with scsi_mod.use_blk_mq=n?Trying similar changes in hpsa, we still receive some INQUIRY commandssubmitted through queuecommand with tag -1. They are for devices forwhich slave_alloc has not yet been run, implying this work needs tobe done even earlier. Maybe the midlayer is missing a slave_alloccall somewhere?

Hello Rob,

All my tests with use_blk_mq=n were run with a WARN_ON_ONCE(req->tag <0) statement present in srp_queuecommand(). I haven't seen any kernelwarning being triggered during the tests I ran.

I also had a look at scsi_alloc_sdev() in drivers/scsi/scsi_scan.c. Thenumber of statements between queue allocation and the slave_alloc() callis limited. The only scenario I can think of which could causequeuecommand() to be invoked before slave_alloc() is a LUN scaninitiated from user space via sysfs due toscsi_sysfs_device_initialize() being invoked before slave_alloc(). Doesthat make sense to you ?

Bart.--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheAll my tests with use_blk_mq=n were run with a WARN_ON_ONCE(req->tag <0) statement present in srp_queuecommand(). I haven't seen any kernelwarning being triggered during the tests I ran.

Bart, what's the data type of "req->tag", here? (E.g., if it"unsigned", it will never be less than zero, right?)

Thanks,

Webb--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheAll my tests with use_blk_mq=n were run with a WARN_ON_ONCE(req->tag <0) statement present in srp_queuecommand(). I haven't seen any kernelwarning being triggered during the tests I ran.

Bart, what's the data type of "req->tag", here? (E.g., if it"unsigned", it will never be less than zero, right?)

Hello Webb,

This is what I found in "struct request" in <linux/blkdev.h>:

struct request {[ ... ]int tag;[ ... ]};

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Have you tested this with scsi_mod.use_blk_mq=n?Trying similar changes in hpsa, we still receive some INQUIRY commandssubmitted through queuecommand with tag -1. They are for devices forwhich slave_alloc has not yet been run, implying this work needs tobe done even earlier. Maybe the midlayer is missing a slave_alloccall somewhere?

Did that version of hpsa really enable tagging in slave_allocor just in slave_configure? The latter would cause INQUIRY to besent untagged.--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Have you tested this with scsi_mod.use_blk_mq=n?Trying similar changes in hpsa, we still receive some INQUIRY commandssubmitted through queuecommand with tag -1. They are for devices forwhich slave_alloc has not yet been run, implying this work needs tobe done even earlier. Maybe the midlayer is missing a slave_alloccall somewhere?

Did that version of hpsa really enable tagging in slave_allocor just in slave_configure? The latter would cause INQUIRY to besent untagged.

Yes, it is slave_alloc, not slave_configure.

However, it was looking at scmd->tag, which is always 0xff (atleast in those early discovery commands). scmd->request->taglooks like it is the field that has the correct values.

Also, I noticed that scmd->tag is just an 8 bit field, soit could never represent a large number of tags.

Just to confirm: After calling scsi_init_shared_tag_map()in non-mq mode, will scmd->request->tag be based oncontroller-wide tag allocation (never using the samevalue at the same time for the request queues of multipledevices in that controller)?

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Elliott, Robert (Server Storage)However, it was looking at scmd->tag, which is always 0xff (atleast in those early discovery commands). scmd->request->taglooks like it is the field that has the correct values.Also, I noticed that scmd->tag is just an 8 bit field, soit could never represent a large number of tags.

Yes, we need to get rid of scmd->tag. Hannes had a patchset to getstarted on it, and I hope either he or someone else will have time toget back to it ASAP.

Post by Elliott, Robert (Server Storage)Just to confirm: After calling scsi_init_shared_tag_map()in non-mq mode, will scmd->request->tag be based oncontroller-wide tag allocation (never using the samevalue at the same time for the request queues of multipledevices in that controller)?

Yes.--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheThe free_reqs list is no longer needed now that we are usingtags assigned by the block layer. Hence remove it.

Is there any good reason not to fold this into the previous patch?

The only reason why patches 10/12 and 11/12 are separate patchesis to reduce the size of individual patches and hence to make iteasier to review these patches. If everyone agrees I'm fine withfolding patch 11/12 into patch 10/12.

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheThe only reason why patches 10/12 and 11/12 are separate patchesis to reduce the size of individual patches and hence to make iteasier to review these patches. If everyone agrees I'm fine withfolding patch 11/12 into patch 10/12.

I would prefer to merge the two patches.

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Improve performance by using multiple RDMA/RC channels per SCSIhost for communication with an SRP target. About theimplementation:- Introduce a loop over all channels in the code that usestarget->ch.- Set the SRP_MULTICHAN_MULTI flag during login for the creationof the second and subsequent channels.- RDMA completion vectors are chosen such that RDMA completioninterrupts are handled by the CPU socket that submitted the I/Orequest. As one can see in this patch it has been assumed if asystem contains n CPU sockets and m RDMA completion vectorshave been assigned to an RDMA HCA that IRQ affinity has beenconfigured such that completion vectors [i*m/n..(i+1)*m/n) arebound to CPU socket i with 0 <= i < n.- Modify srp_free_ch_ib() and srp_free_req_data() such that itbecomes safe to invoke these functions after the correspondingallocation function failed.- Add a ch_count sysfs attribute per target port.

diff --git a/Documentation/ABI/stable/sysfs-driver-ib_srp b/Documentation/ABI/stable/sysfs-driver-ib_srpindex b9688de..d5a459e 100644--- a/Documentation/ABI/stable/sysfs-driver-ib_srp+++ b/Documentation/ABI/stable/sysfs-driver-ib_srp@@ -55,12 +55,12 @@ Description: Interface for making ib_srp connect to a new target.only safe with partial memory descriptor list support enabled(allow_ext_sg=1).* comp_vector, a number in the range 0..n-1 specifying the- MSI-X completion vector. Some HCA's allocate multiple (n)- MSI-X vectors per HCA port. If the IRQ affinity masks of- these interrupts have been configured such that each MSI-X- interrupt is handled by a different CPU then the comp_vector- parameter can be used to spread the SRP completion workload- over multiple CPU's.+ MSI-X completion vector of the first RDMA channel. Some+ HCA's allocate multiple (n) MSI-X vectors per HCA port. If+ the IRQ affinity masks of these interrupts have been+ configured such that each MSI-X interrupt is handled by a+ different CPU then the comp_vector parameter can be used to+ spread the SRP completion workload over multiple CPU's.* tl_retry_count, a number in the range 2..7 specifying theIB RC retry count.* queue_size, the maximum number of commands that the@@ -88,6 +88,13 @@ Description: Whether ib_srp is allowed to include a partial memorydescriptor list in an SRP_CMD when communicating with an SRPtarget.

+What: /sys/class/scsi_host/host<n>/ch_count+Date: November 1, 2014+KernelVersion: 3.18+Contact: linux-***@vger.kernel.org+Description: Number of RDMA channels used for communication with the SRP+ target.+What: /sys/class/scsi_host/host<n>/cmd_sg_entriesDate: May 19, 2011KernelVersion: 2.6.39@@ -95,6 +102,12 @@ Contact: linux-***@vger.kernel.orgDescription: Maximum number of data buffer descriptors that may be sent tothe target in a single SRP_CMD request.

+static unsigned ch_count;+module_param(ch_count, uint, 0444);+MODULE_PARM_DESC(ch_count,+ "Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA.");+static void srp_add_one(struct ib_device *device);static void srp_remove_one(struct ib_device *device);static void srp_recv_completion(struct ib_cq *cq, void *ch_ptr);@@ -562,11 +567,26 @@ static void srp_free_ch_ib(struct srp_target_port *target,struct srp_device *dev = target->srp_host->srp_dev;int i;

Do you have a reproducer for that? I think we should fix the rootcause.

Hello Christoph,

The above assignment statement has been reported to fix a kernel oopsthat could be triggered by cable pulling. Regarding fixing the rootcause: some time ago I had posted a patch series that makesscsi_remove_host() wait until all error handler callback functions havefinished and also that prevents that any new error handler functioncalls are initiated after scsi_remove_host() has finished(http://thread.gmane.org/gmane.linux.scsi/82572/focus=87985). Should Irepost that patch series ?

Bart.

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

The above assignment statement has been reported to fix a kernel oops thatcould be triggered by cable pulling. Regarding fixing the root cause: sometime ago I had posted a patch series that makes scsi_remove_host() waituntil all error handler callback functions have finished and also thatprevents that any new error handler function calls are initiated afterscsi_remove_host() has finished(http://thread.gmane.org/gmane.linux.scsi/82572/focus=87985). Should Irepost that patch series ?

Please keep the workaround in srp for now, and then resend the series,including a new patch to remove the workaround in srp.

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Shouldn't we do this validity check inside scsi_host_find_tag, so thatall callers get it? That means adding an argument to it, but there arevery few callers at the moment.

Hello Christoph,

That pr_err() statement was convenient while debugging the multiqueuecode in the SRP initiator driver but can be left out. Would you agreewith leaving the above three lines of debug code out instead of addingan additional argument to scsi_host_find_tag() ?

Bart.--To unsubscribe from this list: send the line "unsubscribe linux-rdma" inthe body of a message to majordomo-***@public.gmane.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

That pr_err() statement was convenient while debugging the multiqueue codein the SRP initiator driver but can be left out. Would you agree withleaving the above three lines of debug code out instead of adding anadditional argument to scsi_host_find_tag() ?

Feel free to remove the check.

--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van AsscheImprove performance by using multiple RDMA/RC channels per SCSIhost for communication with an SRP target. About the- Introduce a loop over all channels in the code that usestarget->ch.- Set the SRP_MULTICHAN_MULTI flag during login for the creationof the second and subsequent channels.- RDMA completion vectors are chosen such that RDMA completioninterrupts are handled by the CPU socket that submitted the I/Orequest. As one can see in this patch it has been assumed if asystem contains n CPU sockets and m RDMA completion vectorshave been assigned to an RDMA HCA that IRQ affinity has beenconfigured such that completion vectors [i*m/n..(i+1)*m/n) arebound to CPU socket i with 0 <= i < n.- Modify srp_free_ch_ib() and srp_free_req_data() such that itbecomes safe to invoke these functions after the correspondingallocation function failed.- Add a ch_count sysfs attribute per target port.---Documentation/ABI/stable/sysfs-driver-ib_srp | 25 ++-drivers/infiniband/ulp/srp/ib_srp.c | 291 ++++++++++++++++++++-------drivers/infiniband/ulp/srp/ib_srp.h | 3 +-3 files changed, 238 insertions(+), 81 deletions(-)diff --git a/Documentation/ABI/stable/sysfs-driver-ib_srp b/Documentation/ABI/stable/sysfs-driver-ib_srpindex b9688de..d5a459e 100644--- a/Documentation/ABI/stable/sysfs-driver-ib_srp+++ b/Documentation/ABI/stable/sysfs-driver-ib_srp@@ -55,12 +55,12 @@ Description: Interface for making ib_srp connect to a new target.only safe with partial memory descriptor list support enabled(allow_ext_sg=1).* comp_vector, a number in the range 0..n-1 specifying the- MSI-X completion vector. Some HCA's allocate multiple (n)- MSI-X vectors per HCA port. If the IRQ affinity masks of- these interrupts have been configured such that each MSI-X- interrupt is handled by a different CPU then the comp_vector- parameter can be used to spread the SRP completion workload- over multiple CPU's.+ MSI-X completion vector of the first RDMA channel. Some+ HCA's allocate multiple (n) MSI-X vectors per HCA port. If+ the IRQ affinity masks of these interrupts have been+ configured such that each MSI-X interrupt is handled by a+ different CPU then the comp_vector parameter can be used to+ spread the SRP completion workload over multiple CPU's.

This is fairly not trivial for the user...

Aren't we requesting a bit too much awareness here?Can't we just "make it work"? The user hands out ch_count - why can'tyou do some least-used logic here?

Maybe we can even go with per-cpu QPs and discard comp_vector argument?this would probably bring the best performance, wouldn't it?(fallback to least-used logic in case HW support less vectors)

Post by Bart Van Assche* tl_retry_count, a number in the range 2..7 specifying theIB RC retry count.* queue_size, the maximum number of commands that the@@ -88,6 +88,13 @@ Description: Whether ib_srp is allowed to include a partial memorydescriptor list in an SRP_CMD when communicating with an SRPtarget.+What: /sys/class/scsi_host/host<n>/ch_count+Date: November 1, 2014+KernelVersion: 3.18+Description: Number of RDMA channels used for communication with the SRP+ target.+What: /sys/class/scsi_host/host<n>/cmd_sg_entriesDate: May 19, 2011KernelVersion: 2.6.39@@ -95,6 +102,12 @@ Contact: linux-rdma-***@public.gmane.orgDescription: Maximum number of data buffer descriptors that may be sent tothe target in a single SRP_CMD request.+What: /sys/class/scsi_host/host<n>/comp_vector+Date: September 2, 2013+KernelVersion: 3.11+Description: Completion vector used for the first RDMA channel.+What: /sys/class/scsi_host/host<n>/dgidDate: June 17, 2006KernelVersion: 2.6.17diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.cindex eccaf65..80699a9 100644--- a/drivers/infiniband/ulp/srp/ib_srp.c+++ b/drivers/infiniband/ulp/srp/ib_srp.c@@ -123,6 +123,11 @@ MODULE_PARM_DESC(dev_loss_tmo," if fast_io_fail_tmo has not been set. \"off\" means that"" this functionality is disabled.");+static unsigned ch_count;+module_param(ch_count, uint, 0444);+MODULE_PARM_DESC(ch_count,+ "Number of RDMA channels to use for communication with an SRP target. Using more than one channel improves performance if the HCA supports multiple completion vectors. The default value is the minimum of four times the number of online CPU sockets and the number of completion vectors supported by the HCA.");+

Post by Bart Van Assche* comp_vector, a number in the range 0..n-1 specifying the- MSI-X completion vector. Some HCA's allocate multiple (n)- MSI-X vectors per HCA port. If the IRQ affinity masks of- these interrupts have been configured such that each MSI-X- interrupt is handled by a different CPU then the comp_vector- parameter can be used to spread the SRP completion workload- over multiple CPU's.+ MSI-X completion vector of the first RDMA channel. Some+ HCA's allocate multiple (n) MSI-X vectors per HCA port. If+ the IRQ affinity masks of these interrupts have been+ configured such that each MSI-X interrupt is handled by a+ different CPU then the comp_vector parameter can be used to+ spread the SRP completion workload over multiple CPU's.

This is fairly not trivial for the user...Aren't we requesting a bit too much awareness here?Can't we just "make it work"? The user hands out ch_count - why can'tyou do some least-used logic here?Maybe we can even go with per-cpu QPs and discard comp_vector argument?this would probably bring the best performance, wouldn't it?(fallback to least-used logic in case HW support less vectors)

Hello Sagi,

The only reason the comp_vector parameter is still supported is becauseof backwards compatibility. What I expect is that users will set thech_count parameter but not the comp_vector parameter.

Using one QP per CPU thread does not necessarily result in the bestperformance. In the tests I ran performance was about 4% better whenusing one QP for each pair of CPU threads (with hyperthreading enabled).

Post by Bart Van Assche+static unsigned ch_count;+module_param(ch_count, uint, 0444);+MODULE_PARM_DESC(ch_count,+ "Number of RDMA channels to use for communication with anSRP target. Using more than one channel improves performance if theHCA supports multiple completion vectors. The default value is theminimum of four times the number of online CPU sockets and the numberof completion vectors supported by the HCA.");

Why? how did you get to this magic equation?

On the systems I have access to measurements have shown that this choicefor the ch_count parameter results in a significant performanceimprovement without consuming too many system resources. The performancedifference when using more than four channels was small. This means thatthe exact value of this parameter is not that important. What matters tome is that users can benefit from improved performance even if thech_count kernel module parameter has been left to its default value.

Bart.--To unsubscribe from this list: send the line "unsubscribe linux-scsi" inthe body of a message to ***@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.html

Post by Bart Van Assche* comp_vector, a number in the range 0..n-1 specifying the- MSI-X completion vector. Some HCA's allocate multiple (n)- MSI-X vectors per HCA port. If the IRQ affinity masks of- these interrupts have been configured such that each MSI-X- interrupt is handled by a different CPU then the comp_vector- parameter can be used to spread the SRP completion workload- over multiple CPU's.+ MSI-X completion vector of the first RDMA channel. Some+ HCA's allocate multiple (n) MSI-X vectors per HCA port. If+ the IRQ affinity masks of these interrupts have been+ configured such that each MSI-X interrupt is handled by a+ different CPU then the comp_vector parameter can be used to+ spread the SRP completion workload over multiple CPU's.

This is fairly not trivial for the user...Aren't we requesting a bit too much awareness here?Can't we just "make it work"? The user hands out ch_count - why can'tyou do some least-used logic here?Maybe we can even go with per-cpu QPs and discard comp_vector argument?this would probably bring the best performance, wouldn't it?(fallback to least-used logic in case HW support less vectors)

Hello Sagi,The only reason the comp_vector parameter is still supported is becauseof backwards compatibility. What I expect is that users will set thech_count parameter but not the comp_vector parameter.

Agreed...

Post by Bart Van AsscheUsing one QP per CPU thread does not necessarily result in the bestperformance. In the tests I ran performance was about 4% better whenusing one QP for each pair of CPU threads (with hyperthreading enabled).

I usually don't like using defaults based on empirical experiments onspecific workloads. IMO, going either full blown MQ (per-cpu), orgo SQ for default.

Post by Bart Van Assche+static unsigned ch_count;+module_param(ch_count, uint, 0444);+MODULE_PARM_DESC(ch_count,+ "Number of RDMA channels to use for communication with anSRP target. Using more than one channel improves performance if theHCA supports multiple completion vectors. The default value is theminimum of four times the number of online CPU sockets and the numberof completion vectors supported by the HCA.");

Why? how did you get to this magic equation?

On the systems I have access to measurements have shown that this choicefor the ch_count parameter results in a significant performanceimprovement without consuming too many system resources. The performancedifference when using more than four channels was small. This means thatthe exact value of this parameter is not that important. What matters tome is that users can benefit from improved performance even if thech_count kernel module parameter has been left to its default value.

I do like the idea of giving users high performance out-of-the-box. Butas I wrote below, I less like the idea of basing your choice onexperiments.

Post by Bart Van AsscheImprove performance by using multiple RDMA/RC channels per SCSIhost for communication with an SRP target. About the- Introduce a loop over all channels in the code that usestarget->ch.- Set the SRP_MULTICHAN_MULTI flag during login for the creationof the second and subsequent channels.- RDMA completion vectors are chosen such that RDMA completioninterrupts are handled by the CPU socket that submitted the I/Orequest. As one can see in this patch it has been assumed if asystem contains n CPU sockets and m RDMA completion vectorshave been assigned to an RDMA HCA that IRQ affinity has beenconfigured such that completion vectors [i*m/n..(i+1)*m/n) arebound to CPU socket i with 0 <= i < n.- Modify srp_free_ch_ib() and srp_free_req_data() such that itbecomes safe to invoke these functions after the correspondingallocation function failed.- Add a ch_count sysfs attribute per target port.