The private key is checked quite carefully -- even to a fault -- for
being sensibly sized, but the corresponding function for public keys
appears to have no checking at all. This is a shame since message-
representative construction assumes that the message representative will
fit in a fixed-size buffer.

While 15360-bit RSA keys are rather large, they're not completely beyond
the realms of possibility and it seems unreasonable to forbid
them. (Specifically, 15360 is the length recommended by NIST for
256-bit security levels.)

slip_unstuff previously responded to bad SLIP encoding or an overlong
SLIP packet by terminating secnet completely with a fatal error or
assertion failure. It turns out that SLIP encoding corruption can
occur through external action (for example, if you attach gdb to
secnet then things get a bit confused when gdb suspends it) and so
this should be a handled error condition rather than a terminal panic.

Therefore, introduce a system whereby SLIP decoding errors result in a
warning log message followed by ignoring everything in the SLIP stream
up to the next (unescaped) END byte, whereupon we resynchronise and
start decoding again.

If a netlink is configured with OPT_SOFTROUTE, we expect to be adding
and removing the kernel routing table entry as the remote site appears
and disappears, so tun_set_route should bring the route up or down
depending on route->up. However, if it isn't, then we won't be doing
that during PHASE_RUN, and therefore we should bring the route up once
and for all at startup.

Previously the state was as follows:

secnet's 'tun' netlink will add and remove kernel routing table
entries during PHASE_RUN if the OPT_SOFTROUTE option is set.
However, at startup during PHASE_GETRESOURCES it will set up routes
for a site only if that site lists an address. So if you have a site
in your sites file with no address but also haven't enabled
OPT_SOFTROUTE (e.g. because you run secnet so that it drops privs),
then no route for that site will _ever_ be set up.

These two policies don't match. We should bring a site's route(s) up
at startup in any situation where we will not be prepared to do so
dynamically during run time. In other words, routes should be added
at startup time not only if they have a fixed address parameter, but
also if they do not have OPT_SOFTROUTE set.

This commit implements this fix, and causes my home secnet
implementation to be able to route to my laptop successfully
(sgo/resolution/resolution). In order to do this I've had to move the
definition of OPT_SOFTROUTE out of netlink.c into netlink.h so that
tun_set_route can see it, which suggests a possible layering
violation, but on the other hand since the netlink_client structure is
visible outside netlink.c it seems only reasonable that the bit flags
used in its 'options' field should be visible too.

When we time out in state SENTMSG5, keep the key we negotiated.
SENTMSG5 gives the peer permission to start sending packets with it so
we need to be able to decrypt them. If we see such packets, we switch
to using the new key at that point and throw the old key away.

This is the final fix to the "connectivity loss during final key
setup can cause locked-up session" bug.

After we have switched to a new key, keep the old key around until we
see packets using the new key.

This is part of the fix to the "connectivity loss during final key
setup" bug. Fixing this requires that both ends be willing to keep
both old and new data keys available until the peer has sent data with
the new key (which might never happen).

If there were also a key setup, the site would then need three keys:
the old data key, the current data key which the peer hasn't started
using yet, and the fresh key it is trying to negotiate. So we have to
a third key, with its own lifetime expiry.

In the code we call the old key the "auxiliary" key because we're
going to use it for an additional fixup in the next patch.

site: Deal with losing our MSG6 - retransmit MSG6 when we see MSG5 in RUN

Fix the lost MSG6 bug in the case where we are the responder:

If we are in RUN and we see a SENTMSG5 encrypted with what we now
regard as the data transfer key, conclude that the peer must have lost
the MSG6. Regenerate and transmit a new MSG6.

This requires a generalised version of generate_msg6 which allows the
new call site to specify which transform and session id to use, and
which doesn't set st->retries. So break the meat of generate_msg6 out
into a new function which I have chosen to call create_msg6. (The
fact that functions called `generate_msg*' set st->retries is slightly
unfortunate.)

We also have to add a transform argument to process_msg5 so that we
can try decrypting MSG5s with the data transfer key.

It is still possible, even after this patch, to get the two ends out
of sync: if, just after the responder has switched to the new key and
entered RUN, the network fails entirely for the key setup timeout, the
initiator will abandon the key setup and switch to WAIT, and continue
to use the new key. However, the responder has already discarded the
old key.

To fix this is not trivial: it will require us to introduce a third
key, with its own lifetime expiry. This will be in a later patch.

The responder sends MSG6 on its transition from SENTMSG4 to RUN, on
receipt of the initiator's MSG5. Unlike all the other key protocol
messages, MSG6 is not retransmitted (for there is no retransmission in
RUN). If the responder's MSG6 is lost, the initiator has no way of
knowing and won't switch to the new key. The initator will keep
retransmitting MSG5, which the responder (in old versions of secnet)
ignores. The responder will be sending data with the new key, which
the initiator (in old versions of secnet) ignores. This broken
situation can persist until the initiator sends data, which it will do
with the old key, possibly causing a new key exchange attempt to be
initiated by the original responder (in old versions of secnet only),
which may or may not experience the same bug.

This bug needs to be fixed separately at both ends, ideally, so that
running a fixed version at either end avoids the bug.

Here we fix for the case where we are the initiator, as follows:

If are in SENTMSG5 and we see data encrypted with the new key,
conclude that we must have lost the MSG6 and simply switch to the new
key, and state RUN, right away.

Because the transform operates destructively (and changing that fact
about the transform API would be too intrusive), this means we need to
keep a copy of the original ciphertext in process_msg0.

For this purpose we introduce a new buffer `scratch' in the site state
structure, and a new buffer_copy utility function for making a copy of
a buffer (reallocating the destination buffer if necessary).

site, transform: Do not initiate rekey when packets too much out of order

If packets arrive sufficiently out of order, we may unnecessarily
initiate a rekey simply because of the skew. Solve this as follows:
* transform->reverse gives a distinct error code for `too much skew'.
* site does not initiate a rekey when it sees that error code.

The rekey-on-decrypt-failure feature is in principle unnecessary.
However, due to another bug it is possible for a key setup to be
regarded as successful at only one of the two ends, and the ability to
immediately do another key setup to try to fix things is useful.

In principle this new behaviour might prevent a necessary rekey: if
the sender has sent around 2^31 packets none of which were received,
the receiver would see the new packets as too old. But: firstly, this
is very unlikely (it would have involved transmitting several Gbytes
into the void, in a period less than the maximum key lifetime).
Secondly, we ought to expire keys before then anyway so the sender
should know that the key had `worn out' (although currently this is
not enforced).

The benefit of this change is that the old behaviour would be likely
to initiate unnecessary rekeys when on unreliable links (eg, mobile
internet). Unnecessary rekeys are a bad idea in these circumstances
not just because they clog up the link but also because they make the
association more fragile.

It is not necessary to check whether we have a current key before
attempting to unpick and decrypt a MSG0. If we don't,
transform->reverse will simply fail due to lacking a key.

This check needs to be removed, because in forthcoming patches we will
want to be able to try to decrypt the packet with other keys, at which
point the lack of a currently valid data transfer key will not be
relevant and this check will be harmful.

When there are multiple peer addresses, attempts to copy them from one
table of peers addresses to another with transport_peers_copy will go
wrong because the stride argument to transport_peers_debug is wrong -
we take sizeof() the pointer rather than of the array element.

This will typically cause a segfault if it happens, but the bug can
only be triggered if LOG_PEER_ADDRS debugging is enabled.

If a message doesn't survive the vsnprintf in vMessage untruncated,
the result may be that its newline is lost. After that, buff will
never manage to gain a newline and all further messages will be
discarded.

Fix this by always putting "\n\0" at the end of buff. That way a
truncated message is printed and the buffer will be cleared out for
the next call.

This does mean that sometimes a message which is just slightly shorter
than the buffer will produce a spurious empty message sent to the
syslog. However, the whole logging approach here will inevitably
occasionally insert spurious newlines, if only to split up long
messages; long messages may also be truncated. The bad effect of the
change in this patch is simply to make the length at which
infelicities appear very slightly shorter. The good effect is that it
turns a very nasty failure mode into a benign one.

So I think it is reasonable to err on the side of code which clearly
cannot go wrong in a bad way. Other approaches which make the
infelicity threshold a couple of characters longer, or which slightly
reduce the different kinds of infelicity, are much more complicated
and therefore more likely to have bugs.

vMessage looks for the system log instance. If there is one, it uses
it; otherwise it just uses stderr or stdout.

The system log instance consists, eventually, of syslog_vlog or
logfile_vlog. Both of these functions have a fallback: if they are
properly set up they do their thing; otherwise they call vMessage.
This is wrong because in this situation they have probably been called
by vMessage so it doesn't help.

At first sight this ought to produce unbounded recursion but for
complicated reasons another bug prevents this. Instead, messages can
just vanish.

Break out the fallback mode of [v]Message into [v]MessageFallback so
that syslog_vlog and logfile_vlog can use it directly.

transform_reverse checks to see if the incoming message is a multiple
of the block cipher block size. If the message fails this check, it
sets *errmsg but it fails to return. Instead, it proceeds to attempt
to decrypt it. This will involve a buffer read/write overrun.

It transform_reverse fails to check to see if the incoming message is
long enough to contain the IV, MAC and padding. This can easily be
exploited to cause secnet to segfault.

buffer_unprepend should check that the buffer has enough data in it to
unprepend. With the other patches, this should be irrelevant, but it
turns various other potential read overrun bugs into crashes.

site_incoming should check that the incoming message is long enough.
Again, without this, this is an buffer read overrun or possibly a
segfault.

Previously the userv-ipif netlink would not request (from userv) a
route for a site for which the link quality was DOWN. The link
quality is a dynamic quantity but userv-ipif lacks any machinery for
dynamically adding routes, so this is wrong.

Instead, in userv-ipif, unconditionally add routes for all sites,
regardless of link up status.

In practice this code is run during startup and the only reason a link
might be down at that point, ie LINK_QUALITY_DOWN, is that it does not
have an address configured. Mobile sites are often in this situation.

Restrict the "include" directive to the "header" of -u (groupfile
update) mode. Callers who are simply using make-secnet-sites to
transform a (possibly untrusted) sites file into a (to be trusted)
sites.conf file should not have to worry about includes.

Whenm make-secnet-sites is writing the "output" sites file in -u
(groupfile update) mode, it includes the effective contents of files
referenced in "include" directives, rather than the "include"
directive itself.

So the "output" sites file does not any longer depend on any files
included by the header.

make-secnet-sites generates a config file which defines the keys
"vpn-data", "vpn", and "all-sites".

To make it possible to usefully include the output of more than one
different piece of output from make-secnet-sites, support the option
-P <prefix>, which generates a config file which defines
"<prefix>vpn-data", "<prefix>vpn", and "<prefix>all-sites".

The commit 9b8369e07aeba5ed2c69fb4a7f74d07c8cebe015
make-secnet-sites: refactor to break out new function "pfilepath"
broke the userv service invocation, because it turned out that later
code depended on the "headerinput" variable whose assignment had been
removed and replaced by a call to pfilepath.

Make pfilepath return the lines read from the file and assign the
result to headerinput.

Previously the userv-ipif netlink would not request (from userv) a
route for a site for which the link quality was DOWN. The link
quality is a dynamic quantity but userv-ipif lacks any machinery for
dynamically adding routes, so this is wrong.

Instead, in userv-ipif, unconditionally add routes for all sites,
regardless of link up status.

In practice this code is run during startup and the only reason a link
might be down at that point, ie LINK_QUALITY_DOWN, is that it does not
have an address configured. Mobile sites are often in this situation.

* The version of secnet 0.1.16 tagged as such in revision control
contains a "fix" which was based on the authbind documentation but
not apparently tested against authbind. Ie, this part from NEWS:
4) Change the endianess of the arguments to authbind-helper.
sprintf("%04X") already translates from machine repesentation to most
significant octet first so htons reversed it again.

* The version of secnet 0.1.16 actually in service on chiark had an
out-of-version-control change to udp.c to make it work with
chiark's authbind 1.2.0. The actual code found has been recorded
on the dead branch "chiark-0.1.16" in the master git repo, but the
version of udp.c is exactly that from 0.1.15 so it looks like we
just reverted to the previous udp.c during deployment of 0.1.16.

* We (re)discovered all this after the release of secnet 0.2.0
because my attempt to deploy 0.2.0 on chiark was not actually
effective.

Therefore, undo the authbind endianness change introduced in secnet
0.1.16. This is most easily achieved by constructing the arguments to
the helper from the sockaddr rather than the contents of "st".

The effect is secnet is told that it is running as a daemon right from
the start, so it knows to follow the logging rules for daemons but not
to fork.

The conflation of daemonization with dropping privilege is also
unpicked by this patch. Most importantly this ensures that errors
from PHASE_GETRESOURCES operations such as 'route' commands is sent to
the logfile (or syslog).

The logic is that if Fink is in the path then we should use it. So
(unlike certain build systems I could mention) if you want to use a
local install of GMP, it should be sufficient to run ./configure
without /sw/bin in the path.

We introduce a new config option "local-mobile" which must match our
peers' idea of whether we are mobile. This is cross-checked with the
"site" entry for our own site, if there is one, to detect mistakes.

We do not transfer this data in the protocol because we don't want to
break compatibility with older secnets which do not understand mobile
peers at all.

mobile (bool): if True then peer is "mobile" ie we assume it may change
its apparent IP address and port number without either it or us
being aware of the change; so, we remember the last several port/addr
pairs we've seen and send packets to all of them (subject to a timeout).
We maintain one set of addresses for key setup exchanges, and another for
data traffic. [false]

In the code this involves replacing the comm_addrs for setup_peer and
peer with a new struct transport_peers which contains zero or more
comm_addrs. This subsumes peer_valid, too.

Additionally, we are slightly cleaner about the use of setup_peer: we
ensure that we clean it out appropriately when we go into states where
it won't (shouldn't) be used.

The transport_peers structure is opaque to most of site.c and is
manipulated by a new set of functions which implement the detailed
semantics described in site.c.

The main code in site.c is no longer supposed to call dump_packet and
st->comm->sendmsg directly; it should use transport_xmit (which will
do dump_packet as well as appropriate sendmsgs).

No intentional functional change if mobile=false (the default) and the
new debug log feature ("peer-addrs") is not enabled.

If data is exchanged after the "renegotiation time", we need to
refresh the data transfer key - ie, to initiate a key setup.
Previously this was done by the peer which wanted to transmit data
using an existing but in-need-of-refreshing key.

However, mobile peers may often be disconnected. There is no point
trying to do a key renegotiation while they are disconnected. Thus it
is better to have the renegotiation initiated by a peer which receives
a data packet. That means that if there is a network outage,
renegotiation will be deferred until the network is restored.

In particular, it means that a mobile node which has no underlying
network but which has applications trying to send data will not waste
effort attempting key renegotiation until it once more has
connectivity.

This minor functional change should be harmless or even beneficial for
non-mobile sites too. It simply means that the other peer will play
the role of initiator during renegotiation, but since which peer
played this role is arbitrary for non-mobile sites this should make no
difference.

Compatibility: In the case of an old version of secnet talking to a
new version, only data packets in one direction will cause
renegotiation. This should not be a problem since all real-world
IP protocols involve data in both directions.

So we make the new behaviour universal rather than making it depend on
the forthcoming "mobile-peer" site config option.

* Quote the correct default value for setup-timeout
(setup_retry_interval) in the README.
* Improve the documentation of the default value for renegotiate-time.

Three simple internal improvements:

* Change definitions of DEFAULT_* to forms which make the semantics
of the values clearer, thus obviating the need for the comments
giving human-readable values.
* Add comments to DEFAULT_* time values giving the units (currently,
they are all in millisecons).
* Rename (in the code) SETUP_TIMEOUT and setup_timeout to
SETUP_RETRY_INTERVAL, which is more accurate. We leave the config
dictionary entry with the misleading name (particularly, since we
have no facility for spotting obsolete or misspelled config keys).

Also two changes which are preparatory to the mobile peers support:

* New macro DEFAULT(D) which currently simply expands to DEFAULT_*.
* New convenience macro CFG_NUMBER for a common pattern of use for
dict_read_number, which uses DEFAULT().

This abstracts away the fact that the peer address is a sockaddr_in;
instead, most of the code in site.c now only handles the peer address
as an opaque structure.

We put the struct comm_if* in the address structure too, as this will
be useful later. Amongst other things, doing this arranges that the
comm client knows which comm is notifying about an incoming packet.
Previously the client was expected to "just know" because the only
actual client in secnet is site which currently only deals with one
comm.

Also make the relevant arguments const-correct.

Contains consequential changes the signatures of many functions but no
intentional functional change.

Insert a call to dump_packet in send_msg7. The packet is mostly
ciphertext and may not make a great deal of sense but turning on
debugging should not show all management packets; only data packets
should not be shown.

If a MSG1 (key setup initiation packet) is received containing
expected local and remote site names, the receiving secnet will start
a key setup attempt with details from that packet.

MSG1 packets are (almost necessarily) unauthenticated, so anyone on
the Internet can cause this to happen. secnet is only willing to have
one key exchange attempt ongoing at once, and will ignore subsequent
incoming MSG1s until it has dealt with the first key exchange attempt.

So this means that an attacker who can send packets to any secnet
instance can DoS secnet at session setup (or key renewal) time. All
the attacker needs to know is the secnet site names, and the IP
address and port number of one of the secnets. The attacker does not
need to spoof their IP address or know any secret keys.

If the attacker sends a contant stream of bogus packets they can
probably prevent the link coming up at all.

This is difficult to fix without changing the protocol.

However, there is worse: when the key setup with the bogus peer
eventually fails, as it must, secnet invalidates the current session
key and its note of where to send actual data packets. It will then
refuse to attempt a new key exchange for a timeout period. During
this period, data packets will not flow.

This means that sending one fairly easy to construct udp packet can
cause a 20s outage. Worse, after this one packet has had its effect,
the attacker can prevent the connection being reestablished, as
described above.

In this patch we fix the latter problem. It is simply a bug that the
session key and data transport peer address (resulting from a previous
successful key exchange) are discarded when a key setup fails.

We also provide a test program "test-example/bogus-setutp-request.c"
which can be used to reproduce the problem.

(1) If the subprocess exits nonzero then the exit status
is unpicked and logged.
(2) If the exec in the child fails, the command and
errno string are written to stderr (which should
end up in secnet's usual log output).
(3) _exit() is used instead of exit(), to avoid
any possibility of craziness with stdio/atexit/etc.

tun.c attempted to guess at runtime what kind of tun interface to use
by looking at uname. Specifically, for Linux it would do
TUN_FLAVOUR_BSD if u.release starts 2.2
TUN_FLAVOUR_LINUX if u.release starts 2.4
and require the user to specify, otherwise.

Linux 2.6 and all later versions use TUN_FLAVOUR_LINUX. The guessing
code is pretty nasty and Linux 2.2 is very very obsolete, so drop it.

Commit 29672515
portability: Work around Apple's bizarrely deficient poll() implementation
introduced a read of unitialised data in fds[], because it failed to
honour i->nfds which is 0 on the first iteration.

The symptom is that secnet may, on the first iteration, die due to
mistakenly seeing POLLNVAL.