gunglichen has asked for the
wisdom of the Perl Monks concerning the following question:

Hello All,
Have any of you used SAS in the past as I would like to evaluate Perl as a viable alternative. Does Perl have powerful data manipulation and summarisation abilities. If it does, could you please point me in the right direction to start learning it.
Thanks

GNU R, like S+ - a dedicated statistical language similar in
purpose to SAS, but intentionally similiar to S, another
statistical language more popular in academia rather than
business. Most people like the esthetics of S better than
SAS, but SAS is very widely ported, marketed, and supported,
and is exceedingly complete.

SPSS and Minitab are similiar in purpose to SAS as well,
but I know little about them so won't say anything. Search
Google.

To Perlers who know little about SAS but seem to need to
have some opinion of it, SAS talks to many databases,
presents to the user an extended dialect of SQL enhanced
for statistical tasks, and provides a massive library of statistical
and data processing functions. It imports data from arbitrary formats - part of the language is specifying
input format and it is effective as a parser - and it works
across databases and flat files of arbitrary format.
It works with extremely large data sets efficiently.
No special syntax or logic is needed to work with datasets
far larger than would fit into virtual memory, though
many operations require a large amount of temporary space.
The companies I've worked for or know people who work at
often process terabytes a day. One place, in the medical
infromation industry, has thousands of people on staff
doing statistics on data, a good chunk of them using SAS.
SAS is used for anything related to statistics - sometimes
marketing, but also drug interaction research, stock
market speculation, financial planning, insurance
(a major purchaser of supercomputers), and numerous
other things. It tends to be used by people who are
statisticians but not neccisarily programmers. SAS is very
old and very mature. It was originally written in FORTRAN,
and lived for a long period of time as a mix of FORTRAN and
C, though I'm told the FORTRAN parts have been rewritten.
Like a lot of old software, it is very reliable and
of very high quality, and has evolved a lot through
continued pressure, though may not be very consistent.
Other statistics programs have moved in on SAS lately -
web based applications that push down buggy ActiveX
plugins, and attempt to graft "5GL" logic onto the process,
making design visual. They are extremely limited,
extremely buggy, and dumb. They try to do queries for
you, but screw it up, so you have to muck through their
busted SQL trying to fix it, without the option of
rewritting it, as it would no longer be able to understand
the SQL and would then no longer be visual. Microstrategy
is an example. Its output looks pretty, and it does
simple things easily, but man... I'm just trying to
put what SAS is into perspective.

To actually answer your question - is Perl is a viable
alternative to SAS - I'd say "no". Perl could not replace
SAS. They are too different, SAS is only marginally a
language but is primarily a library of integrated
routines with a lot of backend, and SAS is very good at
what it does. Perl couldn't replace SAS. If you wanted to
know if Perl could replace SAS for your particular
application of SAS, that is an entirely different
question, and it depends on what you're doing with SAS.
Very likely you're using only a small portion of SAS,
making it much easier. Still, if you're employing non-programmer statisticians, they won't be comfortable
with Perl. Better use R (still far less complete, but
atleast specialized). If you yourself have some basic
statistical things that you want to do and you're
able to program in a "traditional" C-like programming
language, you'll find yourself writing a lot more Perl
than you would SAS to do the same job, but PDL (Perl Data Language), PDL::R (some R functions for PDL), and lots of things under Math:: in CPAN will go a long ways. You'll
need a database - no bones about it - and it will need
to do subqueries.

If you're just learning statistics: You can go the
tranditional way, and buy a book on statistics and a
calculator with statistical functions (or equivilent
software), in which case you're exposed to performing
the functions and no so concerned with processing
large amounts of data. If you just have a lot of data
to process, you probably don't need statistics at all -
a good database application will do you. Somewhere in the
middle, a lot of statistics tasks are very common:
finding products that sell well together and should be
co-promoted, or optimizing variables (number of flights
an airline should make between two cities in a month,
price to market a product at), computing customer churn
and optimizing customer service for maxiumum profit
(minimum churn, minumum cost). You wait on hold for an
hour before you can get a rep and you think the company
is just really busy? It is all completely intentional.
They know exactly how much customer service costs and they
know how much business they will lose when they provide
different levels and they've intenionally picked exactly
that level of service. Most people have no idea what
a prominate role statitics play in their consumer
experience...

Anyway, I hope this background and these pointers help
with whatever you're trying to do. If you expound on
what you're trying to do, someone will probably be able
to give a less broad, more helpful tutorial.

gunglichen, you don't explain what your application is and how you currently use SAS, so it's hard for me to make specific comments, but I'm going to drivel on anyway :-)

As well as Perl, I use SAS at work - it's used as the basis of our (multi-terabyte) datawarehouse, which I personally think is pretty horrific, and for marketing/customer research analysis on substantial amounts of data where it really shines.

I'm not really a fan of SAS - I generally don't use it as a statistical package, but as a datawarehouse/datastorage system.
Statisticians, analysts and pharmaceutical users seem to be the type of folk that will get the most from SAS - it's where SAS grew up, and it shows.
There seems to be a push for SAS to move into the database/warehouse area, but I'm not too impressed.

The basic SAS does not handle parallel processing, concurrency, transactions and the like that you'd expect from an RDBMS.
There are additional SAS packages that help, but I'd rather use an RDBMS.

If you want to calculate regressions, aggregations, perform summarisations, and more analytical functions that I understand, SAS is one of those bits of kit that'll do the job.

Perl is neither an RDBMS nor a complicated statistics package.
It's a case of the right tool for the job.
Some of the SAS programmers I work with try and do everything in SAS, and it gets nasty quickly.
My colleagues are getting bored of me telling them they should be using Perl for that ;-)

SAS does have text processing capabilities, but I almost always extract the data, munge it using Perl, or stick in through some other bespoke software.

If you're using SAS to perform transforms, data storage, basic reporting, comparisons etc, maybe Perl plus some kind of database (from CSV to RDBMS depending on requirements) would do the job.
If you're storing, retrieving and querying data, use an RDBMS (and maybe Perl as a glue language).
If you want hardcore statistics, use SAS.

There is no way you could hope to replicate all of the statistical and reporting functionality or even a respectable subset that SAS offers using Perl.

Whenever I see phrases like "SAS provides a complete framework for delivering high-value enterprise intelligence that optimally integrates individual technology components within your existing infrastructure.", I assume they've been generated by some jargon program. Having looked at a few pages on the site, I *still* can't work out what SAS actually does...

Does Perl have powerful data manipulation and summarisation abilities

Yes, hence the acronym "Practical Extraction and Reporting Language".

...the right direction to start learning it

Check out "Learning Perl", written by our very own merlyn, published by O'Reilly and available pretty much eveywhere. (Hey merlyn - do I get any commission if he/she buys one? {g} )

AFAIK they specialize in making data aggregations (for relational data mostly). That is data that is preprocessed in some way to reduce its volume. An example can be a log file with timestamps accurate to a second. You rarely need that acuracy - so you can sum the number of records of each type in every hour and keep just the information of those sums. You'll still be able to extract any information you need from this data, but the you reduced the volume of the data - and those extractions would be much faster.

First and foremost, SAS is a statistical software. If SAS is a statistics professor; Perl is a baby.

On the other hand, Perl is a much nicer general glue language. SAS has pretty good macros; it's not quite as good as a programming language, compared to Splus or R. (R is a really good free Open Source statistical software by the way; not as bloated as SAS.)

In general, you use Perl to automate stuff around your statistical work, which's best done by SAS

There're something that both SAS and Perl may provide, such as database access and management as well as web publishing.

Many software that come with web publishing turn out to have clumsy web publishing capability, yet the price tags are enormously high. In that case, you're better off using Perl to get the data from your software and publish the data using Perl.

There're many other things SAS provides. So, I'm not sure what else you're also comparing.

In short, Perl can complement SAS more than SAS complement Perl (in a sense that you probably won't be controlling Perl from within SAS). Yet you'll never do in Perl what you'll do in SAS, which is really good at what it's doing (for a dear price).

I have about 10 experience with SAS and am currently working through the exersices in "Learning Perl." SAS is no longer, from a revenue standpoint, a "Statistical" package. The "data base" part is huge and SAS is alway trying to get their fingers into every pie (including hosting). They do a very good job. Their people are top notch.
If I knew as much PERL as I know SAS, I would use PERL for the web not SAS because PERL is a programming language and is supported by the top minds in web programming.

Anonymous Monk - I went the other way. I had a Perl background before I was introduced to SAS and wondered how the product could ever justify the license fee, especially as my colleagues wrote everything in proc sql steps and used Perl Pattern Matching for anything more complicated. (I also thought it was rather cheeky how SAS have incorporated Perl Pattern Matching into their product, but that's a separate issue).

Consequently, I developed a Perl package to provide SAS-style data manipulation natively in Perl, (using a MySQL back end to hold the data.) OK I'm biased but I think it's rather cool.

Would anyone be interested? It's still in a very early stage and there is minimal error handling. I'm happy to make the code available to others, as I've taken loads from the Perl community without really contributing...but I'm a bit daunted by CPAN as I've never released anything before .....but if anyone was interested or wanted a copy....get in touch.

Update: the above list links came from my personal list of resources. I threw in a couple of hours to adapt it and make a new, nicer formatted node, that contains these and a lot more useful links and information: Where and how to start learning Perl.

I sure hope so, as we are looking to replace SAS in our workplace with perl. The thing about SAS is that the licensing deal is horrible. They license not per site, but per server. And the licensing isn't cheap, either (I've heard 6-digit figures per license). I've always argued that SAS was nothing more than a crappy database with some "nice" reporting. This is nothing that, with a little bit of foresight, can't be accomplished with any other language. Perl (combined with an RDBMS of your choice) comes to mind as a good candidate because of ease of database connectivity, and it's reputation as a reporting language.

There is a cost vs benefit here thay you may want to look at. If SAS (I don't know have never looked at it) provides all of the functionality you are looking for out of the box, and is supported it may be more cost effective to buy the product vs spending the time to develop and support it in house. You have to decide how much time it would take your team to replicate all of the features, test, fix, document, support -- in the long haul it may be the same amount of cost or even more. On the other hand if SAS does not provide all of the functionality you need or is not extendable than you may have a better buisness case right off the bat.

What about the total cost in $x months/years? Is it worth spending the time and money now, as opposed to dealing with a product that you can't get under the hood of? What is their turn around time for bugs? Do the have a "forced" upgrade path, which you *must* follow in order to maintain your support contract? Do they end up providing Platnium service to a few of their customers, while the smaller shops have to deal with stuff as is? If you follow thier upgrade path, and need a new server, do you need a new licence? so on and so forth

I understand the framework you are coming from, and prefer if people dont really understand a problem space, that they go with a canned product, at least initially. But if the organization is fairly knowledgable about what they want (I can hear the groans), then doing in-house devel works out in the long run. The turn around time should be much much shorter on bugs and feature requests, even if instead of a true in house product, the organization maintains a business relation with a contractor who wrote the program. It really depends on how fast and or frequent changes occur to the core business practices and goals, and how this product ties into the scheme of things.

This is a very old thread, but I wanted to point out that PDL is an excellent Perl extension that handles major number crunching, and a number of years ago Maggie Xiong created PDL::Stats, a wrapper around the statistical functions provided by the GNU Scientific Library. You can certainly accomplish a great deal of statistical analysis using Perl and these extensions, but there is no single source of documentation that will tell you how to stitch together database management, statistical analysis, and data visualization. All of these, I presume, are tied together nicely with SAS. To start learning about PDL, you should check out the PDL Book. The table of contents along with a download of the PDF are listed here: http://pdl.perl.org/content/pdl-book-toc.html