Standard deviation (stddev) is a measurement of the width of a normal distribution where one stddev on each side of the mean covers 68% and two stddevs 95%. Normal distributions are sometimes called Gauss curves or Bell shapes.

But the latter is faster, especially if @data is large since it sorts the numbers only once internally.

Example:

Data: 1, 4, 6, 7, 8, 9, 22, 24, 39, 49, 555, 992

Average (or mean) is 143

Median is 15.5 (which is the average of 9 and 22 who both equally lays in the middle)

The 25-percentile is 6.25 which are between 6 and 7, but closer to 6.

The 75-percentile is 46.5, which are between 39 and 49 but close to 49.

Linear interpolation is used to find the 25- and 75-percentile and any other x-percentile which doesn't fall exactly on one of the numbers in the set.

Interpolation:

As you saw, 6.25 are closer to 6 than to 7 because 25% along the set of the twelve numbers is closer to the third number (6) than to he fourth (7). The median (50-percentile) is also really interpolated, but it is always in the middle of the two center numbers if there are an even count of numbers.

However, there is two methods of interpolation:

Example, we have only three numbers: 5, 6 and 7.

Method 1: The most common is to say that 5 and 7 lays on the 25- and 75-percentile. This method is used in Acme::Tools.

Method 2: In Oracle databases the least and greatest numbers always lay on the 0- and 100-percentile.

As an argument on why Oracles (and others?) definition is not the best way is to look at your data as for instance temperature measurements. If you place the highest temperature on the 100-percentile you are sort of saying that there can never be a higher temperatures in future measurements.

A quick non-exhaustive Google survey suggests that method 1 here is most used.

The larger the data sets, the less difference there is between the two methods.

Extrapolation:

In method one, when you want a percentile outside of any possible interpolation, you use the smallest and second smallest to extrapolate from. For instance in the data set 5, 6, 7, if you want an x-percentile of x < 25, this is below 5.

If you feel tempted to go below 0 or above 100, percentile() will die (or croak to be more precise)

Another method could be to use "soft curves" instead of "straight lines" in interpolation. Maybe B-splines or Bezier curves. This is not used here.

For large sets of data Hoares algorithm would be faster than the simple straightforward implementation used in percentile() here. Hoares don't sort all the numbers fully.

Fifth argument: a maximum number of iterations before resolve gives up and carps. Default 100 (if fifth argument is not given or is undef). The number 0 means infinite here. If the derivative of the start position is zero or close to zero more iterations are typically needed.

Sixth argument: A number of seconds to run before giving up. If both fifth and sixth argument is given and > 0, resolve stops at whichever comes first.

Return the string in the first input argument, but where pairs of search-replace strings (or rather regexes) has been run.

Works as replace() in Oracle, or rather regexp_replace() in Oracle 10. Except that this replace() accepts more than three arguments.

Examples:

print replace("water","ater","ine"); # Turns water into wine
print replace("water","ater"); # w
print replace("water","at","eath"); # weather
print replace("water","wa","ju",
"te","ic",
"x","y", # No x is found, no y is returned
'r$',"e"); # Turns water into juice. 'r$' says that the r it wants
# to change should be the last letters. This reveals that
# second, fourth, sixth and so on argument is really regexs,
# not normal strings. So use \ (or \\ inside "") to protect
# the special characters of regexes. You probably also
# should write qr/regexp/ instead of 'regexp' if you make
# use of regexps here, just to make it more clear that
# these are really regexps, not strings.
print replace('JACK and JUE','J','BL'); # prints BLACK and BLUE
print replace('JACK and JUE','J'); # prints ACK and UE
print replace("abc","a","b","b","c"); # prints ccc (not bcc)

If the first argument is a reference to a scalar variable, that variable is changed "in place".

Returns an pseudo-random number with a Gaussian distribution instead of the uniform distribution of perls rand() or random() in this module. The algorithm is a variation of the one at http://www.taygeta.com/random/gaussian.html which is both faster and better than adding a long series of rand().

Uses perls rand function internally.

Input: 0 - 3 arguments.

First argument: the average of the distribution. Default 0.

Second argument: the standard deviation of the distribution. Default 1.

Third argument: If a third argument is present, random_gauss returns an array of that many pseudo-random numbers. If there is no third argument, a number (a scalar) is returned.

Output: One or more pseudo-random numbers with a Gaussian distribution. Also known as a Bell curve or Normal distribution.

Example:

my @I=random_gauss(100, 15, 100000); # produces 100000 pseudo-random numbers, average=100, stddev=15
#my @I=map random_gauss(100, 15), 1..100000; # same but more than three times slower
print "Average is: ".avg(@I)."\n"; # prints a number close to 100
print "Stddev is: ".stddev(@I)."\n"; # prints a number close to 15
my @M=grep $_>100+15*2, @I; # those above 130
print "Percent above two stddevs: ".(100*@M/@I)."%\n"; #prints a number close to 2.2%

Compresses the input (text or binary) and returns a base64-encoded string of the compressed binary data. No known limit on input length, several MB has been tested, as long as you've got the RAM...

Input: One or two strings.

First argument: The string to be compressed.

Second argument is optional: A dictionary string.

Output: a base64-kodet string of the compressed input.

The use of an optional dictionary string will result in an even further compressed output in the dictionary string is somewhat similar to the string that is compressed (the data in the first argument).

If x relatively similar string are to be compressed, i.e. x number automatic of email responses to some action by a user, it will pay of to choose one of those x as a dictionary string and store it as such. (You will also use the same dictionary string when decompressing using "unzipb64".

The returned string is base64 encoded. That is, the output is 33% larger than it has to be. The advantage is that this string more easily can be stored in a database (without the hassles of CLOB/BLOB) or perhaps easier transfer in http POST requests (it still needs some url-encoding, normally). See "zipbin" and "unzipbin" for the same without base 64 encoding.

gzip() is really the same as Compress:Zlib::memGzip() except that gzip() just returns the input-string if for some reason Compress::Zlib could not be required. Not installed or not found. (Compress::Zlib is a built in module in newer perl versions).

gzip() uses the same compression algorithm as the well known GNU program gzip found in most unix/linux/cygwin distros. Except gzip() does this in-memory. (Both using the C-library zlib).

bzip2() and bunzip2() works just as gzip() and gunzip(), but use another compression algorithm. This is usually better but slower than the gzip-algorithm. Especially in the compression, decompression speed is less different.

ipaddr() memoizes the results internally (using the %Acme::Tools::IPADDR_memo hash) so only the first loopup on a particular IP number might take some time.

Some few DNS loopups can take several seconds. Most is done in a fraction of a second. Due to this slowness, medium to high traffic web servers should probably turn off hostname lookups in their logs and just log IP numbers by using HostnameLookups Off in Apache httpd.conf and then use ipaddr afterwards if necessary.

Zero or one input argument: A string of the same type often found behind the first question mark (?) in URLs.

This string can have one or more parts separated by & chars.

Each part consists of key=value pairs (with the first = char being the separation char).

Both key and value can be url-encoded.

If there is no input argument, webparams uses $ENV{QUERY_STRING} instead.

If also $ENV{QUERY_STRING} is lacking, webparams() checks if $ENV{REQUEST_METHOD} eq 'POST'. In that case $ENV{CONTENT_LENGTH} is taken as the number of bytes to be read from STDIN and those bytes are used as the missing input argument.

The environment variables QUERY_STRING, REQUEST_METHOD and CONTENT_LENGTH is typically set by a web server following the CGI standard (which Apache and most of them can do I guess) or in mod_perl by Apache. Although you are probably better off using CGI. Or $R->args() or $R->content() in mod_perl.

Output:

webparams() returns a hash of the key/value pairs in the input argument. Url-decoded.

If an input string has more than one occurrence of the same key, that keys value in the returned hash will become concatenated each value separated by a , char. (A comma char)

Examples:

use Acme::Tools;
my %R=webparams();
print "Content-Type: text/plain\n\n"; # or rather \cM\cJ\cM\cJ instead of \n\n to be http-compliant
print "My name is $R{name}";

Storing those four lines in a file in the directory designated for CGI-scripts on your web server (or perhaps naming the file .cgi is enough), and chmod +x /.../cgi-bin/script and the URL http://some.server.somewhere/cgi-bin/script?name=HAL will print My name is HAL to the web page.

First argument: the html where a <table> is to be found and converted.

Second argument: (optional) If the html contains more than one <table>, and you do not want the first one, applying a second argument is a way of telling ht2t which to capture: the one with this word or string occurring before it.

Output: An array of arrayrefs.

ht2t() is a quick and dirty way of scraping (or harvesting as it is also called) data from a web page. Look too HTML::Parse to do this more accurate.

If Term::ANSIColor is not installed or not found, returns the input string with every ¤ including the following code letters removed. (That is: ansicolor is safe to use even if Term::ANSIColor is not installed, you just don't get the colors).

Checks if a Credit Card number (CCN) has correct control digits according to the LUHN-algorithm from 1960. This method of control digits is used by MasterCard, Visa, American Express, Discover, Diners Club / Carte Blanche, JCB and others.

Input:

A credit card number. Can contain non-digits, but they are removed internally before checking.

Output:

Something true or false.

Or more accurately:

Returns undef (false) if the input argument is missing digits.

Returns 0 (zero, which is false) is the digits is not correct according to the LUHN algorithm.

Returns 1 or the name of a credit card company (true either way) if the last digit is an ok control digit for this ccn.

The name of the credit card company is returned like this (without the ' character)

The first six digits is Issuer Identifier, that is the bank (probably). The rest in the "account number", except the last digits, which is the control digit. Max length on credit card numbers are 19 digits.

This uses the LUHN algorithm (also known as mod-10) from 1960 which is also used internationally in control digits for credit card numbers, and Canadian social security ID numbers as well.

The algorithm, as described in Phrack (47-8) (a long time hacker online publication):

"For a card with an even number of digits, double every odd numbered
digit and subtract 9 if the product is greater than 9. Add up all the
even digits as well as the doubled-odd digits, and the result must be
a multiple of 10 or it's not a valid card. If the card has an odd
number of digits, perform the same addition doubling the even numbered
digits instead."

If you got a huge directory with tens or even houndreds of thousands of files, readdirectory() uses more memory than perls opendir/readdir. This isn't usually a concern anymore for "normal" modern computers, but might be the rationale behind perls more tedious way created back in the 80s. The same argument often goes for file slurping.

Up to five input arguments permutations() is probably as fast as it can be in this pure perl implementation (see source). For more than five, it could be faster. How fast is it now: Running with different n, this many time took that many seconds:

If the first argument is a coderef, that sub will be called for each permutation and the return from those calls with be the real return from permutations(). For example this:

print for permutations(sub{join"",@_},1..3);

...will print the same as:

print for map join("",@$_), permutations(1..3);

...but the first of those two uses less RAM if 3 has been say 9. Changing 3 with 10, and many computers hasn't enough memory for the latter.

The examples prints:

123
132
213
231
312
321

If you just want to say calculate something on each permutation, but is not interested in the list of them, you just don't take the return. That is:

my $ant;
permutations(sub{$ant++ if $_[-1]>=$_[0]*2},1..9);

...is the same as:

$$_[-1]>=$$_[0]*2 and $ant++ for permutations(1..9);

...but the first uses next to nothing of memory compared to the latter. They have about the same speed. (The examples just counts the permutations where the last number is at least twice as large as the first)

permutations() was created to find all combinations of a persons name. This is useful in "fuzzy" name searches with String::Similarity if you can not be certain what is first, middle and last names. In foreign or unfamiliar names it can be difficult to know that.

Default is 3, but here 4 is used instead in the second optional input argument:

print join ", ", trigram("Kjetil Skotheim", 4);

And this prints:

Kjet, jeti, etil, til , il S, l Sk, Sko, Skot, koth, othe, thei, heim

trigram() was created for "fuzzy" name searching. If you have a database of many names, addresses, phone numbers, customer numbers etc. You can use trigram() to search among all of those at the same time. If the search form only has one input field. One general search box.

Store all of the trigrams of the trigram-indexed input fields coupled with each person, and when you search, you take each trigram of you query string and adds the list of people that has that trigram. The search result should then be sorted so that the persons with most hits are listed first. Both the query strings and the indexed database fields should have a space added first and last before trigram()-ing them.

"The Euclidean algorithm (also called Euclid's algorithm) is an algorithm to determine the greatest common divisor (gcd) of two integers. It is one of the oldest algorithms known, since it appeared in the classic Euclid's Elements around 300 BC. The algorithm does not require factoring."

Input: two or more positive numbers (integers, without decimals that is)

Output: an integer

Example:

print gcd(12, 8); # prints 4

Because (prime number) factoring of 12 is 2 * 2 * 3 and factoring 4 is 2 * 2 and the common ('overlapping') for both 12 and 4 is then 2 * 2. The result is 4.

The rest of the arguments is a list which also signals the number of columns from left in each row that is ending up to the left of the data table, the rest ends up at the top and the last element of each row ends up as data.

Returns a data structure as a string. See also Data::Dumper (serialize was created long time ago before Data::Dumper appeared on CPAN, before CPAN even...)

Input: One to four arguments.

First argument: A reference to the structure you want.

Second argument: (optional) The name the structure will get in the output string. If second argument is missing or is undef or '', it will get no name in the output.

Third argument: (optional) The string that is returned is also put into a created file with the name given in this argument. Putting a > char in from of the filename will append that file instead. Use '' or undef to not write to a file if you want to use a fourth argument.

Fourth argument: (optional) A number signalling the depth on which newlines is used in the output. The default is infinite (some big number) so no extra newlines are output.

Output: A string containing the perl-code definition that makes that data structure. The input reference (first input argument) can be to an array, hash or a string. Those can contain other refs and strings in a deep data structure.

Limitations:

- Code refs are not handled (just returns sub{die()})

- Regex, class refs and circular recursive structures are also not handled.

- Storing arrays and hashes and data structures of those on file, database or sending them over the net

- eval earlier stored string to get back the data structure

Be aware of the security implications of evaling a perl code string stored somewhere that unauthorized users can change them! You are probably better of using YAML::Syck or Storable without enabling the CODE-options if you have such security issues. See perldoc Storable or perldoc B::Deparse for how to decompile perl.

Valid for any Gregorian year. Dates repeat themselves after 70499183 lunations = 2081882250 days = ca 5699845 year ...but the earth will before that have a different rotation time around the sun and spin time around itself...

Bloom filters can be used to check whether an element (a string) is a member of a large set using much less memory or disk space than other data structures. Trading speed and accuracy for memory usage. While risking false positives, Bloom filters have a very strong space advantage over other data structures for representing sets.

In the example below, a set of 100000 phone numbers (or any string of any length) can be "stored" in just 11992 bytes if you accept that you can only check the data structure for existence of a string and accept false positives with an error rate of 0.01 (that is one percent, error rates are given in numbers larger then 0 and smaller than 1).

You can not retrieve the strings in the set without using "brute force" methods and even then you would get slightly more strings than you put in because of the error rate inaccuracy.

To enable deleting be sure to initialize the bloom filter with the numeric counting_bits argument. The number of bits could be 2 or 3 for small filters with a small capacity (a small number of keys), but setting the number to 4 ensures that even very large filters with very small error rates would not overflow.

Acme::Tools does not currently support counting_bits => 3 so 4 or 8 is the only practical alternatives.

...that is: m = the best number of bits in the filter and k = the best number of hash functions optimized for the given capacity (n) and error_rate (p). Note that k is a dependent only of the error_rate. At about two percent error rate the bloom filter needs just the same number of bytes as the number of keys.

The o.o. interface has the same methods as the bf...-subs without the bf-prefix in the names. The bfretrieve is not available as a method, although bfretrieve, Acme::Tools::bfretrieve and Acme::Tools::BloomFilter::retrieve are synonyms.

Since md5 returns 128 bits and most medium to large sized bloom filters need only a 32 bit hash function, the result from md5() are split (unpack-ed) into 4 parts 32 bits each and are treated as if 4 hash functions was called. Using different salts to the key on each md5 results in different hash functions.

Digest::SHA512 would have been better, but its slower than Digest::MD5.

String::CRC32::crc32 is faster than Digest::MD5, but not 4 times faster: