Are you going to do heavy text parsing of non-xml documents online, when serving 300+ visitors a minute?
–
Your Common SenseMar 29 '10 at 14:48

Shrapnel, yes, users will be posting data that will need to be parsed on the fly. I can realistically expect 20-100 of such posts per minute during heavy loads.
–
ClintonMar 29 '10 at 14:52

It is not my intention to start a popularity contest, I am merely seeking pros and cons of using said languages for the requirements I've outlined in the OP. I'll adjust the question accordingly.
–
ClintonMar 29 '10 at 15:37

3

Unless you already have an architecture for this in mind, it seems like asking about the best way to architect this may be a useful question, and may well inform the answers to this question. For example, submitting the documents to be processed to a queue to be processed inline may alleviate some of the front-end load problem, as long as they don't immediately need to be available. It also allows for separate languages to be used for the web and processing systems, so each could play to their own strengths.
–
kbensonMar 29 '10 at 18:54

2

@Robert P. "then you decide to use pcre library via COBOL's C bindings... now you have THREE problems" >-)
–
DVKMar 30 '10 at 7:54

However, Perl, Python or Ruby or even ServerSide JavaScript (...) should all be capable of doing what you are asking for either. PHP has it's quirks, so do the other languages. If you are a Java Guy, you might like Ruby for it's syntax, but then again, only you can decide.

PHP 5 has the SimpleXML Class which makes working with XML very easy.
–
XeoncrossMar 29 '10 at 14:58

Gordon, thank you for these excellent references. While they are definite pros for PHP, can you outline any cons I might encounter? Another answer mentions problems with UTF8, can you confirm or deny such problems exist?
–
ClintonMar 29 '10 at 15:58

@Sinan I really don't understand your angry tone nor your definition of "Subjective" here. Gordon is offering hard info why PHP could, among a broad variety of other languages, fulfill the requirements stated by the OP. He is not saying "it will work better than (Perl|Ruby|any other language)", nor "I would go with PHP"
–
Pekka 웃Mar 29 '10 at 16:38

2

@Sinan - I must admit that while I generally find PHP vs. Perl fanboyism as distasteful as you do, this answer is in fact VERY inoffensive for me due to explicit "other languages should all be capable of doing what you are asking for either" from the get go. Although that makes it slighly less useful for OP's problem since it does not, in fact, provide any marginal reasons for choosing one or the other :)
–
DVKMar 30 '10 at 8:01

I'd go with Perl. The LibXML series of modules gives a variety of interfaces (DOM, XPath, XSLT, etc.) backed by a fast C parser.

Perl's regex support for slicing and dicing text is pretty much unmatched by any other language. If you expect to do lots of arbitrary text processing, and are at least a little familiar with regex, you will thank yourself.

There are also a series of great web frameworks for Perl, including the simple but powerful Mojolicious framework, and the comprehensive Catalyst framework. There's always the ancient and stable CGI library, but Mojolicious or Catalyst would probably be better choices.

Just to be crystal clear if you don't already know this: whether you use Perl or PHP or something else, NEVER EVER use a DOM XML parser for large XML documents unless your server has unlimited memory :)
–
DVKMar 30 '10 at 8:02

It is, indeed, very much a subjective question. I can totally conceive that in 2010, Perl or PHP (and even Python or Ruby) could equally serve you for such a project. The difference is not going to come from the language itself as much as the tools, best practices and community.

Among these languages, I am most familiar with Perl, so let me try to offer an answer from that perspective, regarding your needs.

Text and XML parsing: Perl has very robust support for text parsing of even very long files (as long as you don't slurp), and allows powerful, clear and easy regex programming. It has clear built-in Unicode support and standard trans-encoding tools (the Encode module), which is very handy when it comes to user interfaces. It also has a direct binding for libxml2 in the form of a standard, fast and well-maintained module: XML::LibXML.

Relational DB Support: In addition to the standard database interface (DBI) which allows direct SQL queries to a number of DBMSes, there are a number of frameworks to make DB-to-Webdoc management easier while still powerful. The most famous probably being Catalyst.

HTML Document presentation:Mason is my favorite web application delivery engine. The integration with Perl is so elegant, yet it does not sacrifice templating patterns or language features.

As it appears the bulk of your work will be processing data more than presentation, in my opinion this is what Perl does best. Perl does perform very well with regular expressions and the vast array of modules on CPAN can help you parse commonplace formats. There are also a good few frameworks in Perl that will make life easier in the presentation of the data. The major disadvantage for a newcomer, is with the tens of distributions on CPAN for each of the various problems you may encounter (XML parsing, web framework, ORM etc), it can be hard to make decisions as to which one to use. Thanks to Plack/PSGI, talking to webservers with Perl in recent times has gotten much, much better.

It's important that "load" is a problem that is completely language agnostic, so it is not what language you choose, it is how you engineer your system that will determine how well it handles increased load. Perl, Java, PHP have all been used in small setups all the way through to some of the most heavily trafficked websites on the net. If growth is on your future needs, decouple where appropriate and design for future expansion first. Multiple database servers, caching, message/work queues can be used in the small scale, and putting them in when things are small is easier than having to rewrite or quickly hack them in when demand for more resources is needed.

As far as I'm aware, PHP's regex (which I would assume is what you'll use) PCRE library came from Perl. So if you have a lot of non-XML parsing then you need to test both and see which one runs faster. I'm not sure which one is faster for you neededs.

They both handle XML well (finally).

However, PHP is just a massive community. There is no other scripting language on the planet as large. So if that matters to you then use PHP since you can find everything under-the-sun about it.

However, Perl also has a large following and I'm sure there are plenty of tutorials for everything you would want to do.

Python is also a language you might want to look into. Heck, since everyone realized Ruby was God's gift to the world it has exploded too! You can honstly do what you want in any language so you need to look at the syntax of each of them and figure out which one you like best. From there you can run a simple example benchmark in each one to see which language is the fastest for you neededs.

Whatever you do - don't use a "framework" like wordpress or drupal. They are CMS's not frameworks and are so slow and bloated. Wordpress takes 8MB just to load the index page!

We had a PHP project and a Guy from Java joined us and was up and running in a week or two once he got the hang of everthing.

All mentioned languages should be usable for your purpose. But as far as I know PHP could be a little bit tricky regarding UTF8 strings (e.g. getting the right string length for UTF8 character which consists of multiple bytes). But I'm sure some guys will provide good solutions for PHP via comments soon :-)

My personal favorite is Ruby. As it provides for all your needs really easy and powerful APIs (so called gems).

Some of the non-xml data being posted by users will be in German or Russian, and therefore I need the parsing to properly handle such cases. Is UTF8 character handling a known problem with PHP?
–
ClintonMar 29 '10 at 15:31

UTF8 is not supported by native strings in PHP5. So you might run in trouble if you use them (e.g. strpos() returns amount of bytes and not the amount of characters). So you would have to consider particular utf8 string functions. Or you'll wait for PHP6 as it is considered to support UTF8 there for native strings, we will see.
–
Achim TrommMar 29 '10 at 16:07

PHP 5 does not have native support for Unicode or multibyte strings, unlike Perl and Python, but there is the mbstring module. This problem will be fixed in PHP 6, but that hasn't been released yet.
–
Leon TimmermansMar 29 '10 at 16:10

Depending on your needs you may want to consider a framework that already supports caching, Drupal is one example but there are many others. Most frameworks are extensible so you can add plugins to handle all the parsing and presentation.

I think language is less important than the framework you choose. I would personally choose PHP over Perl, because I think it is more applicable in the real world. Python is another beautiful scripting language, but php has the most traction in the web world. If you goal is to make your skill set more marketable, go with PHP.

Seriously, you are that picky about nomenclature?
–
vfilbyMar 29 '10 at 15:01

4

Yes, we are that picky about referring to the language by its correct name.
–
Sinan ÜnürMar 29 '10 at 15:20

2

when anyone refers to 'PERL' there's a good chance that they are unfamiliar with modern Perl. the same goes for when people always refer to it only as a 'scripting language'
–
plusplusMar 30 '10 at 8:39

I'll admit that I haven't used Perl in 5 years, but that doesn't change my argument. If you want more experience then the one with the most market penetration is the one you should go with. It makes you more marketable as a developer.
–
vfilbyMar 31 '10 at 17:28

Ok, so everyone is been subjective in their answers I'll add mine too.

Use Java, the core supports all you need (no frameworks needed), its free, OS and its 2 to 3 times faster than Perl - PHP.

Seriously...
PHP is designed for Web projects, its easy, and support all you need to do (try Zend framework), it has a decent learning curve (Java is harder to learn), there is a huge community of developers out there to help you if you run into something unexpected (bigger than Pearl's and Java's). On performance, its a little slower than pearl (im talking about plain'old PHP scripts, no wierd-vodoo optimizations) but its enough for what you probably need.

In the end I'm pretty sure you will get a smaller-consistent app if you use PHP ( and if follow all the coding and design best practices) than you will ever get using Perl.

(Java is way better... but I don't want to be verbally lynched by some PHP zealot)

The question rules out one and only one language … Java. And "Pearl"? Really?
–
QuentinMar 29 '10 at 19:15

Well as I though, I just got lynched by all the zealots available No one really read my answer through... =) C'mon the fist paragraph is a JOKE people!.. in retrospective not a very good one considering the results, but try to read the rest...shame on you people! =P
–
ChepechApr 9 '10 at 19:05