xldlas&mdash;A Program for Statistics

xldlas offers a straightforward way to summarize data, plot it and perform regressions on it—and it was written for Linux.

Linux is a virtually unparalleled
platform for using freely distributable software. The kernel source
is free, the standard utilities are free and so is the X Window
System. The whole concept of using free software is incredibly
appealing, and many users are tempted to try running their systems
without any commercial products whatsoever. Yet this desire is
often thwarted by a single missing application; desktop publishing
and presentation software are commonly cited as current “holes” in
the Linux arsenal.

I was faced with such a problem when I decided to abandon the
MS-DOS partition on my hard drive and go all Linux. Since I work
with a fair amount of statistical information, I needed a
straightforward way to summarize data, plot it and perform
regression as needed. gnuplot is great for plotting, but that's all
it does. Octave and MuPad have powerful numerical features, but
they are overkill for simple statistical chores. Unable to find a
program that fit this niche, I decided to write one. The result is
xldlas, a program for statistics. In the grand Unix tradition, its
name is a pseudo-acronym which stands for “x
lies,
damned
lies,
and
statistics.” The first public
release in October 1996 met with quite positive feedback from
users, and one of those beta testers (Hans Zoebelein) suggested an
article in Linux Journal might be a good way
to introduce xldlas to a wider audience. The people at
LJ agreed, and asked me to write this
overview. The program runs under the X Window System, and is built
using the XForms library. You'll find information on how to
download xldlas and associated software at the end of this
article.

Using xldlas

The philosophy behind xldlas is to offer standard statistical
tools via an easy-to-use point and click interface. To facilitate
this approach, common commands are grouped together into a set of
menus. In addition, frequently used commands are available via
buttons (See Figure 1).

Like most statistics packages, xldlas handles a random
variable as a vector of values. So a single variable name can refer
to dozens, hundreds or thousands of observations.* By grouping data
points together under variable names, it is easy to perform
relatively complex operations by selecting a few variables and
clicking on the relevant command.

*By default, xldlas has a limit of 100 variables of 10,000
observations each. These constraints can easily be adjusted by
changing the values for MAX_VARS and MAX_OBS in the source code
file xldlas.h.

Of course, before you can perform any kind of statistical
operations, you have to get data into xldlas. Since ASCII is the de
facto standard for exchanging information under Linux, xldlas
allows you to read in space-delimited data from a text file by
using the Import command. You supply a file name, and tell xldlas
whether the data is in column or row format. The import routine
automatically figures out how many variables and observations there
are, and reads in the data. To take a concrete example, suppose you
have a file which contains space-delimited data on rainfall,
temperature and barometric pressure for a single location. After
importing this file, xldlas will have three variables in memory,
which will be called unknown0, unknown1 and unknown2. You can
change these names to anything you like using the Rename command,
which is accessible from the Data menu. In addition to this simple
ASCII format, xldlas can read and write sets of data in its own
proprietary file format. By convention, these files have an .lda
extension. Since variable names, descriptions and other useful
information are stored in these files, it's generally a good idea
to save all your data this way if you plan on using xldlas
frequently. The Load, Save and Import commands can all be found in
the File menu. To input data by hand, erase variables or perform
any kind of editing, there are a number of related commands grouped
together in the Data menu. Of these, the most frequently used is
probably the Describe command, which generates a table in the main
xldlas window showing you the name, number of observations, and a
description of every variable currently in memory. In addition to
changing observation values, the Edit command can also be used to
enter a description for a variable.

Another frequently used item in the Data menu is the Generate
command. This routine allows you to perform mathematical
transformations on existing data. To continue with the weather
example from above, suppose we want to convert our rainfall
variable from millimeters to centimeters. With a few clicks of the
mouse, we can easily accomplish this task. We could also add some
random noise, find the log of the data, or what have you. It's a
far cry from Mathematica, but for simple operations the Generate
command is quick and easy to use.

Once you have your data loaded, edited and transformed, the
next logical step is to perform some kind of statistical work on
it. To get a tabular summary of a single variable, including mean,
variance, skewness and kurtosis, there's the Summarize command. If
you want to check multiple variables for linear relationships, the
Correlation command will produce a table of Pearson coefficients.
Similarly, the ANOVA command lets you perform one-way and two-way
analyses of variance by simply selecting variable names with your
mouse and clicking the Go button.

The workhorse of statistical techniques, ordinary least
squares regression, is available via the Regress command. Just
select a single variable from the dependent browser, any number
from the independent browser, and press Go. If you want to store
fitted values, then you can enter a new variable name in the
regression window. The output of the regression command is a set of
three tables, which summarize the fit of the regression, break down
the sum of squares deviations and list coefficient estimates.
Relevant t-statistics and their associated probabilities are
automatically included, as is the F coefficient and confidence
level for a joint test of all the estimates.

xldlas also offers two
experimental data fitting routines that use connectionist
artificial intelligence techniques. The first, GA Fit, uses genetic
algorithms to build a fit equation that minimizes the sum of
squares between fitted values and actual observations of a given
dependent variable. The second, NN Fit, creates a back-propagation
neural network using selected independent variables for the input
layer, and a single dependent variable for the output layer. In
both cases, the fitted values from these techniques can be stored
under a supplied variable name. These routines are sometimes useful
for exploring non-linear relationships in data that are generally
difficult to examine using standard OLS regression.*

*Although not part of the “standard” statistical toolkit,
these sorts of AI techniques are becoming increasingly common in
various contexts and are great for data mining. Although their
implementations in xldlas are fairly rudimentary, more
sophisticated modifications are likely if users request
them.

In addition to manipulating data and performing analysis,
xldlas allows you to graph variables. All of xldlas's graphical
output is actually performed by gnuplot, an application which is
included in all major Linux distributions. Two graphing commands
are implemented: Plot and Histogram. The former lets you create
line and scatter plots, while the latter generates a histogram
describing a variable's distribution. Both sorts of graphs can be
titled and labeled, and they can be saved in any format supported
by whatever version of gnuplot is installed on your system. In
addition, you can set point and line styles, and the Histogram
routine includes an optional feature which will superimpose a
normal distribution with the same mean and variance as the data
being graphed.

xldlas also provides fairly
powerful logging facilities. The Log command allows you to echo all
of xldlas's output to an ASCII file. A more powerful tool is the
TeXLog command, which allows you to create a PlainTeX format log
file with a user-supplied name. All subsequent output, such as
regression tables, is written to this file in TeX format. Under
xldlas's default configuration, all saved graphs are also included
as Encapsulated PostScript insertions. This makes writing
statistical papers (such as homework assignments) quite fast and
efficient, since much of the time-consuming TeX markup is done
automatically.

Finally, all xldlas commands are documented on-line in the
Help menu. There are also a number of on-line tutorials, which many
users of xldlas have found to be a very useful introduction.

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.