After some further fiddling, I should redefine the problem.
It's not that I'm having problems when connecting simultaneously to a large
number of hosts, but rather I'm having problems when doing so over a
forwarded ssh connection.
so I'm doing this:
ssh-agent (localmachine) -> jump server -> 1300 hosts
the 1300 hosts have to contact my ssh-agent by going back through the jump
server, then back to my local machine.
It just so happens that this site has 2 jump servers. So I ran this test:
ssh-agent (jump1) -> 1300 hosts
this works perfectly. I can contact all 1300 hosts no problem when running
the agent on this machine. I had theorized that this was because the
latency was so low that connections got in and out before any limits were
reached. However, I then test this setup:
ssh-agent (jump1) -> jump2 (same DC/same rack) -> 1300 hosts
And this fails in the same manner as it does when running the agent on my
local machine.
I also notice something interesting in netstat. Typical connections show
like this (on jump1):
unix 3 [ ] STREAM CONNECTED 8379279
18988/ssh
unix 3 [ ] STREAM CONNECTED 8379536
18961/ssh-agent /tmp/ssh-Ahvjg18960/agent.18960
unix 3 [ ] STREAM CONNECTED 8379277
18988/ssh
unix 3 [ ] STREAM CONNECTED 8379534
18961/ssh-agent /tmp/ssh-Ahvjg18960/agent.18960
So there is a pair of connections (I'm assuming one connection to the agent,
and one through ssh through the forwarded connection). Indeed, 18988 is
the pid of my ssh connection to jump2, 18961 is the pid of the agent.
But, I see lots of this sort of thing as well:
unix 3 [ ] STREAM CONNECTED 8379337
18988/ssh
unix 4 [ ] STREAM CONNECTING 0 -
/tmp/ssh-Ahvjg18960/agent.18960
unix 3 [ ] STREAM CONNECTED 8379335
18988/ssh
unix 4 [ ] STREAM CONNECTING 0 -
/tmp/ssh-Ahvjg18960/agent.18960
Is the ssh-agent running as a user, or as root? Can you verify that the
> user's limits aren't getting in the way (ulimit -a). You've confirmed
> with /proc/sys/fs/file-nr that you're not running into limits there?
I'm running as a user, but I have tried it as root. I've also set up as
many ulimit parameters to unlimited as possible, and open file limits up to
very large numbers. No problems with limits with what /proc/sys/fs/file-nr
reports.
> Another clue to the puzzle. I have 1300 or so machines in a DC in Hong
> Kong, only available through a jump server in the same DC. If I'm running
> my agent on my local machine, through the jump server, and connect to all
> the machines, connections time out, agent locks up, etc. However, if I
copy
> my keys to the jump box, and run the agent from there, no connections
fail,
> and all connections complete very quickly. I assume that this is because
> connections open and close quickly enough that whatever limit I'm hitting
> isn't reached (netstat snapshots every second show around 200 max
concurrent
> connections).
Aha. That does sound like it may be helpful information.
>> When connecting through the jump server, does it create these hundreds
> of simultaneous connections from your host, or a single one to the jump
> server which then fans out the connections?
Maybe I can answer that with some data. Below should show the number of ssh
and number of ssh-agent connections in netstat on both machines. The ssh
connections to the 1300 hosts are fanned out from the jump server (jump2 in
this case), but all the authentication requests get forwarded back to the
machine with the agent.
[jump1 ~]# netstat -xp | grep -c ssh\
590
[jump1 ~]# netstat -xp | grep -c ssh.*ag
542
[jump2 ~]# netstat -xp | grep -c ssh\
676
[jump2 ~]# netstat -xp | grep -c ssh.*ag
825
I would also verify that entropy is still available on the jump server
> and ake sure that the jump server has appropriate settings in
> /etc/ssh/sshd_config for AllowAgentForwarding, MaxSessions, and
> MaxStartups (see the manpage for sshd_config).
AllowAgentForwarding should be fine (I'm using ssh -A in any case), but I'll
take a look at some of the other settings. I'll fiddle with some of the
sshd settings, there may be something there. With default limits of 10
though, it seems strange that I'd be getting over 600 successful connections
in a run. I'll see what happens though.
Thanks for the ideas!
--Bob