split PRU_CONNECT2 & PRU_PURGEIF function out of pr_generic() usrreq
switches and put into separate functions
- always KASSERT(solocked(so)) even if not implemented
(for PRU_CONNECT2 only)
- replace calls to pr_generic() with req = PRU_CONNECT2 with calls to
pr_connect2()
- replace calls to pr_generic() with req = PRU_PURGEIF with calls to
pr_purgeif()
put common code from unp_connect2() (used by unp_connect() into
unp_connect1() and call out to it when needed
patch only briefly reviewed by rmind@

split PRU_RCVD function out of pr_generic() usrreq switches and put into
separate functions
- always KASSERT(solocked(so)) even if not implemented
- replace calls to pr_generic() with req = PRU_RCVD with calls to
pr_rcvd()

split PRU_SENDOOB and PRU_RCVOOB function out of pr_generic() usrreq
switches and put into separate functions
xxx_sendoob(struct socket *, struct mbuf *, struct mbuf *)
xxx_recvoob(struct socket *, struct mbuf *, int)
- always KASSERT(solocked(so)) even if request is not implemented
- replace calls to pr_generic() with req = PRU_{SEND,RCV}OOB with
calls to pr_{send,recv}oob() respectively.
there is still some tweaking of m_freem(m) and m_freem(control) to come
for consistency. not performed with this commit for clarity.
reviewed by rmind

* split PRU_ACCEPT function out of pr_generic() usrreq switches and put
into a separate function xxx_accept(struct socket *, struct mbuf *)
note: future cleanup will take place to remove struct mbuf parameter
type and replace it with a more appropriate type.
patch reviewed by rmind

sync with head.
for a reference, the tree before this commit was tagged
as yamt-pagecache-tag8.
this commit was splitted into small chunks to avoid
a limitation of cvs. ("Protocol error: too many arguments")

Add struct pr_usrreqs with a pr_generic function and prepare for the
dismantling of pr_usrreq in the protocols; no functional change intended.
PRU_ATTACH/PRU_DETACH changes will follow soon.
Bump for struct protosw. Welcome to 6.99.62!

- fsocreate: set SS_NBIO before the file descriptor is affixed as there is
a theoretical race condition (hard to trigger, though); remove the LWP
parameter and clean up the code a little.
- Sprinkle few comments.
- Remove M_SOOPTS while here.

Ensure that the top level sysctl nodes (kern, vfs, net, ...) exist before
the sysctl link sets are processed, and remove redundancy.
Shaves >13kB off of an amd64 GENERIC, not to mention >1k duplicate
lines of code.

Checkpoint work in progress:
- Initial split of the protocol user-request method into the following
methods: pr_attach, pr_detach and pr_generic for old the pr_usrreq.
- Adjust socreate(9) and sonewconn(9) to call pr_attach without the
socket lock held (as a preparation for the locking scheme adjustment).
- Adjust all pr_attach routines to assert that PCB is not set.
- Sprinkle various comments, document some routines and their locking.
- Remove M_PCB, replace with kmem(9).
- Fix few bugs spotted on the way.

Pull up revisions:
src/sys/kern/uipc_socket.c revision 1.213
src/sys/kern/uipc_syscalls.c revision 1.160
(requested by christos in ticket #822).
PR/47569: Valery Ushakov: SOCK_NONBLOCK does not work because it does not
set SS_NBIO.
XXX: there are too many flags that mean the same thing in too many places=
,
and too many flags that mean the same thing and are different.

PR/47569: Valery Ushakov: SOCK_NONBLOCK does not work because it does not
set SS_NBIO.
XXX: there are too many flags that mean the same thing in too many places,
and too many flags that mean the same thing and are different.

- Eliminate so_nbio and turn it into a bit SS_NBIO in so_state.
- Introduce MSG_NBIO so that we can turn non blocking i/o on a per call basis
- Use MSG_NBIO to fix the XXX: multi-threaded issues on the fifo sockets.
- Don't set SO_CANTRCVMORE, if we were interrupted (perhaps do it for all
errors?).

Pull up following revision(s) (requested by bouyer in ticket #1644):
sys/sys/socketvar.h: revision 1.126
sys/kern/init_main.c: revision 1.433
sys/kern/uipc_socket.c: revision 1.205
Fix kern/45093 as discussed on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2011/06/17/msg010734.html
The cause of the problem is that the so_pendfree is processed with
the softnet_lock held at one point, and processing the list
calls sodoloanfree() which may kpause(). As the thread sleeps with
softnet_lock held, it ultimately cause a deadlock (see the PR or tech-kern
thread for details).
Although it should be possible to call sodopendfree() after releasing
the socket lock, it's not so easy to know where he socket lock is held and
where it's not, so we may hit the issue again later.
Add a kernel thread to handle the so_pendfree list, and wake up this
thread when adding mbufs to this list. Get rid of the various sodopendfree()
calls, hopefully fixing definitively the problem.

Pull up following revision(s) (requested by bouyer in ticket #1644):
sys/sys/socketvar.h: revision 1.126
sys/kern/init_main.c: revision 1.433
sys/kern/uipc_socket.c: revision 1.205
Fix kern/45093 as discussed on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2011/06/17/msg010734.html
The cause of the problem is that the so_pendfree is processed with
the softnet_lock held at one point, and processing the list
calls sodoloanfree() which may kpause(). As the thread sleeps with
softnet_lock held, it ultimately cause a deadlock (see the PR or tech-kern
thread for details).
Although it should be possible to call sodopendfree() after releasing
the socket lock, it's not so easy to know where he socket lock is held and
where it's not, so we may hit the issue again later.
Add a kernel thread to handle the so_pendfree list, and wake up this
thread when adding mbufs to this list. Get rid of the various sodopendfree()
calls, hopefully fixing definitively the problem.

Fix kern/45093 as discussed on tech-kern@:
http://mail-index.netbsd.org/tech-kern/2011/06/17/msg010734.html
The cause of the problem is that the so_pendfree is processed with
the softnet_lock held at one point, and processing the list
calls sodoloanfree() which may kpause(). As the thread sleeps with
softnet_lock held, it ultimately cause a deadlock (see the PR or tech-kern
thread for details).
Although it should be possible to call sodopendfree() after releasing
the socket lock, it's not so easy to know where he socket lock is held and
where it's not, so we may hit the issue again later.
Add a kernel thread to handle the so_pendfree list, and wake up this
thread when adding mbufs to this list. Get rid of the various sodopendfree()
calls, hopefully fixing definitively the problem.

* Arrange for interfaces that create new file descriptors to be able to
set close-on-exec on creation (http://udrepper.livejournal.com/20407.html).
- Add F_DUPFD_CLOEXEC to fcntl(2).
- Add MSG_CMSG_CLOEXEC to recvmsg(2) for unix file descriptor passing.
- Add dup3(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add pipe2(2) syscall with a flags argument for O_CLOEXEC, O_NONBLOCK.
- Add flags SOCK_CLOEXEC, SOCK_NONBLOCK to the socket type parameter
for socket(2) and socketpair(2).
- Add new paccept(2) syscall that takes an additional sigset_t to alter
the sigmask temporarily and a flags argument to set SOCK_CLOEXEC,
SOCK_NONBLOCK.
- Add new mode character 'e' to fopen(3) and popen(3) to open pipes
and file descriptors for close on exec.
- Add new kqueue1(2) syscall with a new flags argument to open the
kqueue file descriptor with O_CLOEXEC, O_NONBLOCK.
* Fix the system calls that take socklen_t arguments to actually do so.
* Don't include userland header files (signal.h) from system header files
(rump_syscallargs.h).
* Bump libc version for the new syscalls.

Make uvm_map recognize UVM_FLAG_COLORMATCH which tells uvm_map that the
'align' argument specifies the starting color of the KVA range to be returned.
When calling uvm_km_alloc with UVM_KMF_VAONLY, also specify the starting
color of the kva range returned (UMV_KMF_COLORMATCH) and pass those to
uvm_map.
In uvm_pglistalloc, make sure the pages being returned have sequentially
advancing colors (so they can be mapped in a contiguous address range).
Add a few missing UVM_FLAG_COLORMATCH flags to uvm_pagealloc calls.
Make the socket and pipe loan color-safe.
Make the mips pmap enforce strict page color (color(VA) == color(PA)).

Add a new AF/PF_ROUTE which is 64-bit clean which makes the routing socket
interface (and its associated sysctls) act identically for both 32 and 64 bit
programs. The old unclean one remains for backward compatibility.

If a multithreaded app closes an fd while another thread is blocked in
read/write/accept, then the expectation is that the blocked thread will
exit and the close complete.
Since only one fd is affected, but many fd can refer to the same file,
the close code can only request the fs code unblock with ERESTART.
Fixed for pipes and sockets, ERESTART will only be generated after such
a close - so there should be no change for other programs.
Also rename fo_abort() to fo_restart() (this used to be fo_drain()).
Fixes PR/26567

Rename fo_drain() to fo_abort(), 'drain' is used to mean 'wait for output
do drain' in many places, whereas fo_drain() was called in order to force
blocking read()/write() etc calls to return to userspace so that a close()
call from a different thread can complete.
In the sockets code comment out the broken code in the inner function,
it was being called from compat code.

Make ifconfig(8) set and display preference numbers for IPv6
addresses. Make the kernel support SIOC[SG]IFADDRPREF for IPv6
interface addresses.
In in6ifa_ifpforlinklocal(), consult preference numbers before
making an otherwise arbitrary choice of in6_ifaddr. Otherwise,
preference numbers are *not* consulted by the kernel, but that will
be rather easy for somebody with a little bit of free time to fix.
Please note that setting the preference number for a link-local
IPv6 address does not work right, yet, but that ought to be fixed
soon.
In support of the changes above,
1 Add a method to struct domain for "externalizing" a sockaddr, and
provide an implementation for IPv6. Expect more work in this area: it
may be more proper to say that the IPv6 implementation "internalizes"
a sockaddr. Add sockaddr_externalize().
2 Add a subroutine, sofamily(), that returns a struct socket's address
family or AF_UNSPEC.
3 Make a lot of IPv4-specific code generic, and move it from
sys/netinet/ to sys/net/ for re-use by IPv6 parts of the kernel and
ifconfig(8).

Add fileops::fo_drain(), to be called from fd_close() when there is more
than one active reference to a file descriptor. It should dislodge threads
sleeping while holding a reference to the descriptor. Implemented only for
sockets but should be extended to pipes, fifos, etc.
Fixes the case of a multithreaded process doing something like the
following, which would have hung until the process got a signal.
thr0 accept(fd, ...)
thr1 close(fd)

fix problem pointed out by ad where sockopt may end up sleeping
inappropriately with the socket lock held.
sockopt_init() may sleep
sockopt_set() will not sleep
sockopt_getmbuf() for legacy code will not sleep

Address problems with accept filters noted by ad in his source-changes
mail: http://mail-index.netbsd.org/source-changes/2008/10/10/msg211109.html
* Scary-looking socket locking stubs (changed to KASSERT of locked)
* depends on INET inappropriately (though now you must add new
accept filter names to the uipc_accf.c line in conf/files if
you aren't using dataready or httpready)
* New code uses MALLOC/FREE -- changed to kmem_alloc/kmem_free;
could be pool_cache, these are all fixed-size allocations.
We need to verify that this works as expected with protocols with per-socket
locking, like PF_LOCAL. I'm a little concerned about the case where the
lock on the listen socket isn't the same lock as on the eventual connected
socket.

Add accept filters, ported from FreeBSD by Coyote Point Systems. Add inetd
support for specifying an accept filter for a service (mostly as a usage
example, but it can be handy for other things). Manual pages to follow
in a day or so.
OK core@.

Fix a use-after-free in soabort(). It would be better to kill SS_NOFDREF
and maintain a per-socket reference count, but SS_NOFDREF is slightly
more than a simple reference count and I don't want to break anything.

soreceive: dom_externalize/dom_dispose can block. If new messages are
appended while the receiver is blocked, the sockbuf will be corrupted.
Dequeue control messages from the sockbuf and sync its state in one
pass. Only then process the control messages. From FreeBSD.

- Extract the guts of soo_poll() into sopoll(), which takes a struct socket *.
This is for netsmb which wants to poll sockets directly.
- When polling a socket, first check for pending I/O without acquring any
locks. If no I/O seems to be pending, acquire locks/spl and check again
doing selrecord() if necessary.

Do not "return 1" from kqfilter for errors. That value is passed
directly to the userland caller and results in a mysterious EPERM.
Instead, return EINVAL or something else sensible depending on the
case.

1) Introduce a new socket option, (SOL_SOCKET, SO_NOHEADER), that
tells a socket that it should both add a protocol header to tx'd
datagrams and remove the header from rx'd datagrams:
int onoff = 1, s = socket(...);
setsockopt(s, SOL_SOCKET, SO_NOHEADER, &onoff);
2) Add an implementation of (SOL_SOCKET, SO_NOHEADER) for raw IPv4
sockets.
3) Reorganize the protocols' pr_ctloutput implementations a bit.
Consistently return ENOPROTOOPT when an option is unsupported,
and EINVAL if a supported option's arguments are incorrect.
Reorganize the flow of code so that it's more clear how/when
options are passed down the stack until they are handled.
Shorten some pr_ctloutput staircases for readability.
4) Extract common mbuf code into subroutines, add new sockaddr
methods, and introduce a new subroutine, fsocreate(), for reuse
later; use it first in sys_socket():
struct mbuf *m_getsombuf(struct socket *so)
Create an mbuf and make its owner the socket `so'.
struct mbuf *m_intopt(struct socket *so, int val)
Create an mbuf, make its owner the socket `so', put the
int `val' into it, and set its length to sizeof(int).
int fsocreate(..., int *fd)
Create a socket, a la socreate(9), put the socket into the
given LWP's descriptor table, return the descriptor at `fd'
on success.
void *sockaddr_addr(struct sockaddr *sa, socklen_t *slenp)
const void *sockaddr_const_addr(const struct sockaddr *sa, socklen_t *slenp)
Extract a pointer to the address part of a sockaddr. Write
the length of the address part at `slenp', if `slenp' is
not NULL.
socklen_t sockaddr_getlen(const struct sockaddr *sa)
Return the length of a sockaddr. This just evaluates to
sa->sa_len. I only add this for consistency with code that
appears in a portable userland library that I am going to
import.
const struct sockaddr *sockaddr_any(const struct sockaddr *sa)
Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.
const void *sockaddr_anyaddr(const struct sockaddr *sa, socklen_t *slenp)
Return the "don't care" sockaddr in the same family as
`sa'. This is the address a client should sobind(9) if it
does not care the source address and, if applicable, the
port et cetera that it uses.

Eliminate address family-specific route caches (struct route, struct
route_in6, struct route_iso), replacing all caches with a struct
route.
The principle benefit of this change is that all of the protocol
families can benefit from route cache-invalidation, which is
necessary for correct routing. Route-cache invalidation fixes an
ancient PR, kern/3508, at long last; it fixes various other PRs,
also.
Discussions with and ideas from Joerg Sonnenberger influenced this
work tremendously. Of course, all design oversights and bugs are
mine.
DETAILS
1 I added to each address family a pool of sockaddrs. I have
introduced routines for allocating, copying, and duplicating,
and freeing sockaddrs:
struct sockaddr *sockaddr_alloc(sa_family_t af, int flags);
struct sockaddr *sockaddr_copy(struct sockaddr *dst,
const struct sockaddr *src);
struct sockaddr *sockaddr_dup(const struct sockaddr *src, int flags);
void sockaddr_free(struct sockaddr *sa);
sockaddr_alloc() returns either a sockaddr from the pool belonging
to the specified family, or NULL if the pool is exhausted. The
returned sockaddr has the right size for that family; sa_family
and sa_len fields are initialized to the family and sockaddr
length---e.g., sa_family = AF_INET and sa_len = sizeof(struct
sockaddr_in). sockaddr_free() puts the given sockaddr back into
its family's pool.
sockaddr_dup() and sockaddr_copy() work analogously to strdup()
and strcpy(), respectively. sockaddr_copy() KASSERTs that the
family of the destination and source sockaddrs are alike.
The 'flags' argumet for sockaddr_alloc() and sockaddr_dup() is
passed directly to pool_get(9).
2 I added routines for initializing sockaddrs in each address
family, sockaddr_in_init(), sockaddr_in6_init(), sockaddr_iso_init(),
etc. They are fairly self-explanatory.
3 structs route_in6 and route_iso are no more. All protocol families
use struct route. I have changed the route cache, 'struct route',
so that it does not contain storage space for a sockaddr. Instead,
struct route points to a sockaddr coming from the pool the sockaddr
belongs to. I added a new method to struct route, rtcache_setdst(),
for setting the cache destination:
int rtcache_setdst(struct route *, const struct sockaddr *);
rtcache_setdst() returns 0 on success, or ENOMEM if no memory is
available to create the sockaddr storage.
It is now possible for rtcache_getdst() to return NULL if, say,
rtcache_setdst() failed. I check the return value for NULL
everywhere in the kernel.
4 Each routing domain (struct domain) has a list of live route
caches, dom_rtcache. rtflushall(sa_family_t af) looks up the
domain indicated by 'af', walks the domain's list of route caches
and invalidates each one.

Introduce KAUTH_REQ_NETWORK_SOCKET_OPEN, to check if opening a socket is
allowed. It takes three int * arguments indicating domain, type, and
protocol. Replace previous KAUTH_REQ_NETWORK_SOCKET_RAWSOCK with it (but
keep it still).
Places that used to explicitly check for privileged context now don't
need it anymore, so I replaced these with XXX comment indiacting it for
future reference.
Documented and updated examples as well.

Pull up following revision(s) (requested by ginsbach in ticket #1568):
sys/kern/uipc_socket.c: revision 1.120
lib/libc/sys/socket.2: revision 1.32
lib/libc/sys/socket.2: revision 1.33
Sort ERRORS. Bump date.
Add EAFNOSUPPORT as a possible error if the address family is not
supported. This adds further differentiation between which argument to
socket(2) caused the error. No longer are invalid domain (address family)
errors classified as ENOPROTOSUPPORT errors. This should make socket(2)
conform to current POSIX and X/Open standards. Fixes PR/33676.

Add credentials to sockets, 'so_cred'.
Brought up on tech-kern@ some ~2 months ago, didn't seem to be an
objection; brought up again recently and no objection either... this is
not too intrusive and I've been running with this for a while.

Add EAFNOSUPPORT as a possible error if the address family is not
supported. This adds further differentiation between which argument to
socket(2) caused the error. No longer are invalid domain (address family)
errors classified as ENOPROTOSUPPORT errors. This should make socket(2)
conform to current POSIX and X/Open standards. Fixes PR/33676.

- Move kauth_cred_t declaration to <sys/types.h>
- Cleanup struct ucred; forward declarations that are unused.
- Don't include <sys/kauth.h> in any header, but include it in the c files
that need it.
Approved by core.

merge yamt-uio_vmspace branch.
- use vmspace rather than proc or lwp where appropriate.
the latter is more natural to specify an address space.
(and less likely to be abused for random purposes.)
- fix a swdmover race.

Pull up following revision(s) (requested by christos in ticket #5598):
sys/kern/uipc_socket.c: revision 1.105 via patch
PR/26210: Matthew Mondor: Since revision 1.14 when net-2 was merged,
the code to do receive packet accounting has been disabled for no apparent
reason. Re-enable it.

bug reported by millert@openbsd:
> Call dom_dispose() for any SCM_RIGHTS message that went through the
> read path rather than recv. Previously, if an fd was passed via
> sendmsg() but was consumed by the receiver via read() the ref count
> was incremented and never decremented and so the ref count would
> never reach zero even when there was no long any processes holding
> the file open (this was especially bad for locked fds).

Eliminate several uses of `curproc' from the socket-layer code and from NFS.
Add a new explicit `struct proc *p' argument to socreate(), sosend().
Use that argument instead of curproc. Follow-on changes to pass that
argument to socreate(), sosend(), and (*so->so_send)() calls.
These changes reviewed and independently recoded by Matt Thomas.
Changes to soreceive() and (*dom->dom_exernalize() from Matt Thomas:
pass soreceive()'s struct uio* uio->uio_procp to unp_externalize().
Eliminate curproc from unp_externalize. Also, now soreceive() uses
its uio->uio_procp value, pass that same value downward to
((pr->pru_usrreq)() calls for consistency, instead of (struct proc * )0.
Similar changes in sys/nfs to eliminate (most) uses of curproc,
either via the req-> r_procp field of a struct nfsreq *req argument,
or by passing down new explicit struct proc * arguments.
Reviewed by: Matt Thomas, posted to tech-kern.
NB: The (*pr->pru_usrreq)() change should be tested on more (all!) protocols.

Initialise (most) pools from a link set instead of explicit calls
to pool_init. Untouched pools are ones that either in arch-specific
code, or aren't initialiased during initial system startup.
Convert struct session, ucred and lockf to pools.

Adjust struct sockbuf and sorflush() so that we don't zero out
every field; some need to stay around.
Fixes a bug where by calling shutdown() on a socket with knotes
will cause the kernel to panic when the kernel closes the socket.
Other access, such as calling kevent() may also trigger the panic.
Debugged with help from Jason and Allen. Patch reviewed by same plus
Itojun and Matt Thomas.
This problem seems to be the same one that FreeBSD saw in their PR
number 54331.
Kernel version _not_ bumped as we will piggyback the bump earlier today.

Apply the aborted ktrace-lwp changes to a specific branch. This is just for
others to review, I'm concerned that patch fuziness may have resulted in some
errant code being generated but I'll look at that later by comparing the diff
from the base to the branch with the file I attempt to apply to it. This will,
at the very least, put the changes in a better context for others to review
them and attempt to tinker with removing passing of 'struct lwp' through
the kernel.

Pass lwp pointers throughtout the kernel, as required, so that the lwpid can
be inserted into ktrace records. The general change has been to replace
"struct proc *" with "struct lwp *" in various function prototypes, pass
the lwp through and use l_proc to get the process pointer when needed.
Bump the kernel rev up to 1.6V

* Use a pool_cache constructor to record the physical address of mbufs
in the mbuf header.
* Use the new cached paddr feature of the pool_cache API to record
the physical address of mbuf clusters. (We cannot use a ctor for
clusters, since clusters have no constructed form; they are merely
buffers).
Bus_dma back-ends may use the cached physical addresses to save having to
extract the physical address from virtual.
* Provide space in m_ext recording the vm_page *'s for an SOSEND_LOAN_CHUNK-
sized non-cluster external buffer. Use this in the sosend_loan code to
save having to extract the physical address from virtual and then look
up the vm_page *'s.
* Provide an indication that an external buffer is mapped read-only at
the MMU. Set this flag for the external buffer in the sosend_loan
case, since loaned pages are always mapped read-only. Bus_dma back-ends
may use this information to save cache flushing, since a cache flush of
a read-only mapping is redundant on some architectures (the cache would
have already been flushed when making the mapping read-only).
Part 2 in a series of simple patches contributed by Wasabi Systems
to improve network performance.

Add MBUFTRACE kernel option.
Do a little mbuf rework while here. Change all uses of MGET*(*, M_WAIT, *)
to m_get*(M_WAIT, *). These are not performance critical and making them
call m_get saves considerable space. Add m_clget analogue of MCLGET and
make corresponding change for M_WAIT uses.
Modify netinet, gem, fxp, tulip, nfs to support MBUFTRACE.
Begin to change netstat to use sysctl.

Add extensible malloc types, adapted from FreeBSD. This turns
malloc types into a structure, a pointer to which is passed around,
instead of an int constant. Allow the limit to be adjusted when the
malloc type is defined, or with a function call, as suggested by
Jonathan Stone.

Pull up revision 1.68 (requested by thorpej in ticket #429):
Fix 2 bugs with MSG_WAITALL. The first is to not block forever if one is
trying to MSG_PEEK for more than the socket can hold. The second is that
before sleeping waiting for more data, upcall the protocol telling it you
have just received data so it can kick itself to re-fill the just drained
socket buffer.

Make insertion of data into socket buffers O(C):
* Keep pointers to the first and last mbufs of the last record in the
socket buffer.
* Use the sb_lastrecord pointer in the sbappend*() family of functions
to avoid traversing the packet chain to find the last record.
* Add a new sbappend_stream() function for stream protocols which
guarantee that there will never be more than one record in the
socket buffer. This function uses the sb_mbtail pointer to perform
the data insertion. Make TCP use sbappend_stream().
On a profiling run, this makes sbappend of a TCP transmission using
a 1M socket buffer go from 50% of the time to .02% of the time.
Thanks to Bill Sommerfeld and YAMAMOTO Takashi for their debugging
assistance!

Curproc->curlwp renaming.
Change uses of "curproc->l_proc" back to "curproc", which is more like the
original use. Bare uses of "curproc" are now "curlwp".
"curproc" is now #defined in proc.h as ((curlwp) ? (curlwp)->l_proc) : NULL)
so that it is always safe to reference curproc (*de*referencing curproc
is another story, but that's always been true).

Pull up revision 1.67 (via patch, requested by he):
Make sure to preserve atomicity of receive operation when instructed
to do so. Make sure to drop rest of partial record if an error
occurs. Fixes PR#16990.

Pull up revision 1.67 (requested by he in ticket #238):
In soreceive(), if any part of a received record has been freed,
and an error occurs, make sure the socket doesn't retain a partial
copy by dropping the rest of the record.
This would otherwise trigger a panic("receive 1a") under DIAGNOSTIC.
Fixes PR#16990, suggested fix adapted.
Reviewed by Matt Thomas.

Fix 2 bugs with MSG_WAITALL. The first is to not block forever if one is
trying to MSG_PEEK for more than the socket can hold. The second is that
before sleeping waiting for more data, upcall the protocol telling it you
have just received data so it can kick itself to re-fill the just drained
socket buffer.

In soreceive(), if any part of a received record has been freed,
and an error occurs, make sure the socket doesn't retain a partial
copy by dropping the rest of the record.
This would otherwise trigger a panic("receive 1a") under DIAGNOSTIC.
Fixes PR#16990, suggested fix adapted.
Reviewed by Matt Thomas.

Add some experimental page-loaning for writes on sockets. It is disabled
by default, and can be enabled by adding the SOSEND_LOAN option to your
kernel config. The SOSEND_COUNTERS option can be used to provide some
instrumentation.
Use of this option, combined with an application that does large enough
writes, gets us zero-copy on the TCP and UDP transmit path.

no need to protect the kqueue SLIST manipulation with splnet - the
structures are never changed nor accessed from interrupt context
filt_solisten(): g/c the comment about what FreeBSD does, leave only
comment what the code does (it was checked to be correct)

Pool deals fairly well with physical memory shortage, but it doesn't
deal with shortages of the VM maps where the backing pages are mapped
(usually kmem_map). Try to deal with this:
* Group all information about the backend allocator for a pool in a
separate structure. The pool references this structure, rather than
the individual fields.
* Change the pool_init() API accordingly, and adjust all callers.
* Link all pools using the same backend allocator on a list.
* The backend allocator is responsible for waiting for physical memory
to become available, but will still fail if it cannot callocate KVA
space for the pages. If this happens, carefully drain all pools using
the same backend allocator, so that some KVA space can be freed.
* Change pool_reclaim() to indicate if it actually succeeded in freeing
some pages, and use that information to make draining easier and more
efficient.
* Get rid of PR_URGENT. There was only one use of it, and it could be
dealt with by the caller.
From art@openbsd.org.

Use lmin() instead of min(), and long for mlen & clen, to avoid integer
overflow on LP64 architectures. This fixes kern/10070 by Juergen Weiss.
Fix tested on NetBSD/alpha by Bernd Ernesti, on NetBSD/sparc64
by David Brownlee and Eduardo Horvath.

when the peer is disconnected before accept(2) is issued,
do not return junk data in mbuf (= sockaddr on accept(2)'s 2nd arg).
set the length zero.
behavior checked with bsdi and freebsd.
partial solution to PR 12027 and 10698 (need more investigation).

Make sobind() take a struct proc *. It already took curproc and
passed it down to the appropriate usrreq function, and this
allows usage for contexts that need to be explicitly different
from curproc (like in the NFS code when binding to a reserved port).

bring in latest KAME (as of 19991130, KAME/NetBSD141) into kame branch
just for reference purposes.
This commit includes 1.4 -> 1.4.1 sync for kame branch.
The branch does not compile at all (due to the lack of ALTQ and some other
source code). Please do not try to modify the branch, this is just for
referenre purposes.
synchronization to latest KAME will take place on HEAD branch soon.

In sosend(), if so_error is set, clear it before returning the error to
the process (i.e. pre-Reno behavior). The 4.4BSD behavior (introduced
in Reno) caused transient errors to stick incorrectly.
This is from PR #7640 (Havard Eidnes), cross-checked w/ FreeBSD, where
Bill Fenner committed the same fix (as described in a comment in the
Vat sources, by Van Jacobsen).

Wow, that was much easier than I originally thought. Fix PR kern/7583:
serious race condition in sosend(). Upon closer inspection, the appropriate
flags are checked within splsoftnet() for soreceive(), so no change needed
there. Also a little KNFing in sosend().

Ensure that you can only bind a more specific address when it is done by the
same uid or by root.
This code is from FreeBSD. (Whilst it was originally obtained from OpenBSD,
FreeBSD fixed it to work with multicast. To quote the commit message:
- Don't bother checking for conflicting sockets if we're binding to a
multicast address.
- Don't return an error if we're binding to INADDR_ANY, the conflicting
socket is bound to INADDR_ANY, and the conflicting socket has
SO_REUSEPORT set.
)

In the sosend() loop, if the residual count is > 0 before calling PRU_SEND,
set SS_MORETOCOME as a hint to the lower layer that more data is coming
on the next iteration of the loop. Clear the flag after the PRU_SEND
call.
Suggested by Justin Walker <justin@apple.com> on the freebsd-net
mailing list.

From 4.4BSD-Lite2 (noted by Frank van der Linden):
so_linger is used as an argument to tsleep(), so was stuffed with
clockticks for the TCP linger time. However, so_linger is set directly from
l_linger if the linger time is specified, and l_linger is seconds (although
this is not currently documented anywhere). Fix this to set the TCP
linger time in seconds, and multiply so_linger by hz when tsleep() is
called to actually perform the linger.

In sosetopt():
- Disallow < 1 values for SO_SNDBUF, SO_RCVBUF, SO_SNDLOWAT, and
SO_RCVLOWAT; return EINVAL if the user attempts to set <= 0.
Inspired by PR #3770, from Havard Eidnes <he@vader.runit.sintef.no>.
- For SO_SNDLOWAT and SO_RCVLOWAT, don't let the low-water mark get
set above the high-water mark. Behavior is now consistent with
BSD/OS: If such an attempt is made, silently truncate to the high-water
value.

This fixes a nasty little bug where traceroute (and other raw-ip sending
programs which attach their own header) can crash the machine. The problem
in this case was:
a variable "space" was set to the total data to copy,
len was used to remember how much to copy in this chunk (mbuf),
in one case, len = min(MCLBYTES - max_hdr, resid) but
size -= MCLBYTES;
instead of
size -= len;
Note that userland programs can still crash the machine by providing
bogus data in the ip->ip_len field I suspect. I haven't verified this,
but will soon be doing so and applying a fix of some sort. Probably
clamping the ip->ip_len value to the true packet size will be ok.

Pass a proc pointer down to the usrreq and pcbbind functions for PRU_ATTACH, PRU_BIND and
PRU_CONTROL. The usrreq interface really needs to be split up, but this will have to wait.
Remove SS_PRIV completely.

fix from david greenman, davidg@freefall.cdrom.com:
sosend was attempting to reserve space in an mbuf cluster for a datagram
header and because of bugs in the sosend's mbuf allocation algorithm,
sosend was calling uiomove twice as many times as was necessary. It turns
out that PREPEND does the right thing when a cluster is associated with
an mbuf header, so the datagram header allocation can be defered. This
also ends up additionally consuming one less mbuf for the TCP protocol
because TCP always allocates another header mbuf regardless if space is
available to prepend the protocol header. The net result of this fix is
that unix domain and pipe throughput is increased by a measured 10%.

BSDI official patch #14:
SUMMARY:
Here is a patch for a kernel hang that can be provoked with a write
or send of a negative amount. The talk program is capable of exercising
this bug. This patch also includes a fix for a bug that caused data
to be delivered to TCP in smaller chunks than desired, and which caused
TCP to send a short packet when starting up. Finally, there is a bug
fix for MSG_PEEK with an oobmark pending.

Make all files using spl*() #include cpu.h. Changes from trunk.
init_main.c: New method of pseudo-device of initialization.
kern_clock.c: hardclock() and softclock() now take a pointer to a clockframe.
softclock() only does callouts.
kern_synch.c: Remove spurious declaration of endtsleep(). Adjust uses of
averunnable for new struct loadav.
subr_prf.c: Allow printf() formats in panic().
tty.c: averunnable changes.
vfs_subr.c: va_size and va_bytes are now quads.