Sponsored Links

Squid Proxy

5th Dec 2010, 15:38:59

By James Stocks

This is everything I know about Squid all in one place.

The last time wrote about Squid -- over five years ago -- it was at version 2.5. Much has changed
since then and my setup looks very different these days. Now that bandwidth is not nearly so scarce
as it was in 2005, I don't use Squid to cache anything to disk.

Here's how I set up the perfect Squid install for my purposes:

My platform of choice is Debian Linux. For my proxy setup I am using 'Squeeze', since it includes
Squid 3.1. Squid version 3.1 has many enhancements, but most important for me is the inclusion of
IPv6 support. You could just as easily use 'Lenny'.

Install the squid3 package rather than squid, unless you know you need
the older Squid version 2.7.

# apt-get install squid3

The default squid.conf is very well commented, but it is overkill for a simple and
efficient setup. It can serve as a useful resource for looking up what certain configuration
directives do though, so we'll move it sideways:

# cd /etc/squid3
# mv squid.conf dist-squid.conf

In my view, this is the absolute minimal working Squid configuration one can have:

Once the script has run, these lines can be added to squid.conf so that squid will use
the yoyo blacklist.

http_port 8080
acl ads dstdom_regex "/etc/squid3/yoyo"
acl ad-filtered myport 3128
# block ads for requests to dstdomains in 'ads' AND where user is on port 3128
# 'ads' acl must be last so that it is the acl picked up by deny_info later
http_access deny ad-filtered ads
# Where a request is blocked due to 'ads' acl, return an empty file not an error
deny_info http://adzapper.toastputer.net/zaps/empty ads

Now your squid offers a filtered service on port 3128 and an unfiltered service on port 8080. I
have set Squid to serve up an empty file in place of the adverts, whilst you're welcome to use mine,
you should really point deny_info at web server you control. If deny_info
is not set, Squid will return an error page instead of the blocked file, which may be desirable for
troubleshooting when you need to confirm that an object is indeed being blocked.

Whilst this approach can be extended to block any content you wish simply by adding more ACLs, I
recommend that you look at the following two products if your needs are more complex:

SquidGuard - a powerful filtering plugin.
Useful if you need to block long lists of whole sites and present your users with pretty pages
explaining why.

DansGuardian - true content filtering,
more like WebSense(TM) and co. Will filter based on the content of a page, not just URL. Highly
configurable.

Both of these approaches will be slower and require more system resources than plain old Squid.

Logging

Don't do any logging unless you really need to or you are prepared to accept the performance
penalty. You must turn off the pipeline_prefetch, since this is incompatible with
logging.

##is incompatible with access logging:
#pipeline_prefetch on
#cache_access_log none
cache_log /var/log/squid3/cache.log
cache_access_log /var/log/squid3/access.log
cache_store_log none
#This can help troubleshooting, but leave commented out for production use - it degrades performance
#cache_store_log /var/log/squid3/store.log

Caching

Consider carefully whether you really want to have a disk cache. The hit rate is very low (about 3%
of requests are ever served from the cache). Each object held in the cache requires a certain
amount of RAM so that Squid can keep track of it, so this results in either tying up a lot of RAM,
or a massive performance penalty if the server begins to hit swap space.

My Squid setup is configured to cache only in RAM. This means that the 'hottest' objects will be
served quickly, but Squid doesn't eat through huge amounts of RAM trying to keep track of a large
disk cache.

That said, if you have a large number of users who frequently request the same content, or you are
so bandwidth limited that 3% is a big deal to you, of course you can cache. We must start with some
tedious but important planning.

Firstly, we need to establish how much RAM Squid will require. On 64-bit architectures, Squid will us 14MB per
1GB of disk cache. In this example, I'm using a
120GB partition, so I know that Squid will need about 1.6GB of RAM, purely to keep track
of its own cache. My server will have 4GB of RAM, so I know that I can spare this amount.
Otherwise, I would need to reduce the size of my cache to match the available memory.

A Squid cache is divided up into first level and second level directories. This is necessary
because it would take Squid far to long to locate the file it needed if they were all in the same
directory. So, the second consideration is to calculate how many level 1 and level 2 directories
are needed for our 120GB partition using this formula:

(((x / y) / 256) / 256) * 2 = z

Let x be the size of the cache in kB. Let y be the average size of
objects in the cache in kB
(if you don't know this value, 13kB is considered to be a reasonable choice). z
will equal the
number of level 1 directories required.

Squid gets extremely upset if it runs out of space in its cache_dir, so I am going to
leave plenty of headroom here! For starters, my '120GB' disk is actually more like 111GB
when measured in base-2 rather than the base-10 manufacturers use. Squid will need some space to
write swap and other temporary files, so I am going to allocate only 100GB, leaving 11GB free for
these purposes. (100 * 1024) * 1024 = 104857600kB, so:

(((104857600 / 13) / 256) / 256) * 2 = 246.153846

At long last we have to proper values to plug in to our cache_dir directive:

By default, Squid will only cache files 4MB or smaller. This is a good optimisation for performance,
but bad if you are looking to save bandwidth. Squid can be instructed to cache more aggressively,
for example:

Bandwidth Restriction

Squid has a method of preventing a single user or small group of users from hogging all the
bandwidth, or indeed to prevent your web users as a whole from swamping your Internet link.
The feature is called 'delay pools'.

Important: Note carefully the difference between 'b' (one bit) and 'B' (one byte/eight bits).
Squid uses only B (bytes) per second, whereas Internet links are normally talked about in
terms of bits (b) per second. Things will get confusing very quickly if you mix them up!

If you want to impose an overall limit on Squid's bandwidth of, say, 6Mbps then this can be done
very simply:

This is fine, but it will only limit bandwidth in a simplistic way. It's still possible for one
user to hog all of that bandwidth to the detriment of other users. It's possible to prevent this,
but it's necessary to have a more detailed knowledge of how Squid deals with bandwidth.

The overall bandwidth available to Squid comes from a delay pool which holds 200MB. This pool
refills at a rate of 20mbps. This means that our users as a whole may download 200MB at a rate in
excess of 20Mbps before any bandwidth controls will activate. This helps Squid to respond to short
spikes in demand of the sort than can occur after a network outage or similar event.

Each of our users has a bandwidth bucket which they may dip into the pool. Each bandwidth bucket
holds 20MB. An individual user can download a 20MB file at unrestricted speed, provided that there
is sufficient bandwidth left in the delay pool. After this 20MB is bucket exhausted, or the delay
pool becomes empty, the user will be limited to 2mbps.

The result for the end user is that small file downloads will be very fast, so normal web browsing
will be very responsive. Those who download large files all day will find their connection rate
limited so that they won't be able to impinge on other users' bandwidth.

Here's how it looks in the Squid config after all bits have been converted to bytes:

I've chosen these numbers mainly because the maths is easy. An element of trial an error will be
needed to make this work for you.

More Than One Squid

If your Squid proxy stops for any reason, you're likely to have lots of users complaining. You can
guard against this by having multiple servers running Squid and using DNS to round robin between
them. But what about the cache? If we don't tell each squid about the other, each will end up
maintaining independent but similar caches. Squid has a mechanism to deal with this. Here's an
example of how it would be configured on two Squids, proxy1 and proxy2.

On proxy1:

# Make squid listen for HTCP requests:
htcp_port 4827
# Tell it about the other Squid:
# proxy-only tells squid not to cache stuff it requests from this peer - that would be pointless
cache_peer proxy2.spruce.toastputer.net sibling 3128 4827 proxy-only htcp
# The other squid should only access stuff we have cached to avoid 'tromboning'.
acl othersquid 172.16.0.8/32
miss_access deny othersquid

On proxy2:

# Make squid listen for HTCP requests:
htcp_port 4827
# Tell it about the other Squid:
# proxy-only tells squid not to cache stuff it requests from this peer - that would be pointless
cache_peer proxy1.spruce.toastputer.net sibling 3128 4827 proxy-only htcp
# The other squid should only access stuff we have cached to avoid 'tromboning'.
acl othersquid 172.16.0.7/32
miss_access deny othersquid