Posted
by
timothy
on Thursday October 17, 2013 @12:28PM
from the english-then-chinese dept.

New submitter longhunt writes "I just started my second year of grad school and I am working on a project that involves a computationally intensive data mining problem. I initially coded all of my routines in VBA because it 'was there'. They work, but run way too slow. I need to port to a faster language. I have acquired an older Xeon-based server and would like to be able to make use of all four CPU cores. I can load it with either Windows (XP) or Linux and am relatively comfortable with both. I did a fair amount of C and Octave programming as an undergrad. I also messed around with Fortran77 and several flavors of BASIC. Unfortunately, I haven't done ANY programming in about 12 years, so it would almost be like starting from scratch. I need a language I can pick up in a few weeks so I can get back to my research. I am not a CS major, so I care more about the answer than the code itself. What language suggestions or tips can you give me?"

I have a friend who works for a company that does gene sequencing and other genetic research and, from what he's told me, the whole industry uses mostly python. You probably don't have the hardware resources that they do, but I'd bet you also don't have data sets that are nearly as large as theirs are.

You might also get better results from something less general purpose like Julia [julialang.org], which is designed for number crunching.

This is certainly the way of the future, not just for gene sequencing but many other quantitative sciences, although a complete answer would be Python and C++, because numpy/scipy can't do everything and Python is still very slow for number-crunching. It's best to start with just Python, but eventually some C++ knowledge will be helpful. (Or just plain C, but I can't see any good reason to inflict that on myself or anyone else.)

a complete answer would be Python and C++, because numpy/scipy can't do everything and Python is still very slow for number-crunching.

The problem with using the mix (when you actually write the C++ code yourself) is that debugging it is a major pain in the ass - you either attach two debuggers and simulate stepping across the boundary by manually setting breakpoints, or you give up and resort to printf debugging.

OTOH, if Windows is an option, PTVS is a Python IDE that can debug Python and C++ code side by side [codeplex.com], with cross-boundary stepping etc. It can also do Python/Fortran debugging with a Fortran implementation that integrates into VS (e.g. the Intel one).

(full disclosure: I am a developer on the PTVS team who implemented this particular feature)

How do you write C++ code for use from Python such that it's not an independent module?

Anyway, regardless of how you architecture it, in the end you'll have Python script feeding data to your C++ code. If something goes wrong, you might want to debug said C++ code specifically as it is called from Python (i.e. with that data). Even if you don't ever have to cross the boundary between languages during debugging, there are still benefits to be had from a debugger with more integrated support - for example, it

Yes, I did my master's thesis using simpy [readthedocs.org] / scipy [scipy.org], integrated with lp_solve for the number crunching , all of which was a breeze to learn and use. It was amazing banging out a new recursive algorithm crawling a new object structure and just having it work the first time without spending several precious cycles bugfixing syntax errors and chasing down obscure stack overflows.

I used the psyco JIT compiler (unfortunately 32-bit only) to get ~100x boost in runtime performance (all from a single import statement, woo), which was fast enough for me... these days I think you can get similar boosts from running on PyPy [pypy.org]. Of course, if you're doing more serious number crunching, python makes it easy to rewrite your performance-critical modules in C/C++.

I also ended up making a LiveCD and/or VM of my thesis, which was a good way of wrapping up the software environment and dependencies, which could quickly grow outdated in a few short years.

Yep. High level languages such as python are great for letting you focus on the domain-specific task you want to accomplish without spending years learning all the little poorly-documented compiler-specific idiosyncrasies of compilers and preprocessors and template languages. Once you're through the prototyping phase and have your interface definitions and unit tests set up, you can then toss things one module at a time over to one of those software weenies to turn into hand-optimized production code. A

"This is certainly the way of the future, not just for gene sequencing but many other quantitative sciences, although a complete answer would be Python and C++, because numpy/scipy can't do everything and Python is still very slow for number-crunching."

I mostly agree with your conclusion, but for somewhat different reasons. I don't believe Python is "the wave of the future", but rather I'd recommend it because it has been in use by the scientific community for far longer than other similar languages, like Ruby. Therefore, there will be more pre-built libraries for it that a programmer in the sciences can take advantage of.

I also agree that some C should go along with it, for building those portions of the code that need to be high performance. I would choose C over C++ for performance reasons. If you need OO, that's what Python is for. If you need performance, that's what the C is for. C++ would sacrifice performance for features you already have in Python.

If it were entirely up to me, however -- that is to say, if there weren't so much existing code for the taking out there already -- I'd choose Ruby over Python. But that's just a personal preference.

I agree with you that doing the number crunching is best in a language designed for that but I don't think C is the answer because it was primarily designed for systems programming, not numeric. If you really need efficient number crunching, go with FORTRAN, especially as the OP says that he already has experience with it.

I have a few points to add.1) compiled language vs scripting languageIn general, any compiled language is going to run faster than any scripting language. But you will probably spend more time coding and debugging to get your analysis running with a compiled language. It is useful to think about how important performance is to you relative to the value of your own time. Are you going to be doing these data mining runs repeatedly? Is it worth

I use Sage. When Python isn't fast enough, I can essentially write in C with Cython. It's gloriously easy. Have some trivially parallelizable data mining? Just use the @parallel decorator. Sage comes with a slew of fast mathematical packages, so your toolbox is massive, and you can hook it all in to your Cython code with minimal overhead.

Sage is okay for small-midsize projects, as is R (both benefit from being free).. on the whole though, I'd really recommend Mathematica, which is purpose-built for that type of project, makes it trivial to parallelize code, is a functional language (once you learn, I doubt you'll want to go back) and scales well up to fairly large data sets (10s of gigs).

For research engineering, I use Java to run the numerical examples of the algorithms I develop although most of the authors in the journals I publish in are using Matlab for this purpose (ewwwwww!). Long time ago I was a Turbo Pascal person as were engineering colleagues who crossed over to Matlab seeking the same kind of ease-of-use. Me, I transitioned to Delphi but now I am with Java and Eclipse -- the Turbo Pascal of the 21st century.

For numeric-intensive work, I can get within 20% of the speed of C++ using the usual techniques -- minimize garbage collection by allocating variables once, use the "server" VM, perform "warmup" iterations in benchmark code to stabilize the JIT. I use the Eclipse IDE, copy and paste numeric results from the Console View into a spreadsheet program, and voila, instant journal article tables.

I have a friend who works for a company that does gene sequencing and other genetic research and, from what he's told me, the whole industry uses mostly python.

I think your friend is mistaken. Though it's essential to know a scripting language, most of the computationally expensive stuff in sequence analysis is done with code written in, as you might expect, C, C++, or Java. Perl and Python are used more for glue code, building analysis pipelines, and processing the output of the heavy duty tools for various downstream applications. R is used heavily for statistics, and especially for anything involving microarrays.

It depends on what exactly his computationally intensive part is. It may be something that can be trivially implemented in Python in terms of standard numpy operations, for example, with performance that's "good enough".

This is exactly the right answer. Never write code that someone else has already written. If you can compose standard operations to do your calculations, then do so in a high-level language. Spend more time thinking and less time coding.
OTOH, if you need to code up something custom and you're REALLY sure that you can't use standard operations to do it, then think again about whether or not you can do it with standard operations. You probably can. But, if you can't, then go with FORTRAN. Or maybe C o

I used it to wrap some crazy magnetometer processing code written in Fortran into a nice Python program. I ripped out all the I/O from the Fortran code and moved it into the Python layer. It worked great. Fortran is AWESOME at number crunching but SUCKS ASS at IO or well pretty much anything else, hence Python.

if you don't care about having your code be maintained or extended by anyone under age 30

1. There are plenty of programmers over age 30.2. Someone who is 30 today, likely finished his BSc in 2005. Do you think Fortran was much more popular then?3. People under age 30 learn Fortran if they're involved in HPC. It's still widely used, and has advantages over C/C++ (easy, built-in parallelization, etc.).

don't plan on doing any custom visualization beyond GNUplot

There are lots of other programs you can use besides GNUplot. In serious HPC graphics are often considered a back end that runs separately from the main program, and sometimes on a different machine

Was that supposed to be a crack about popularity? Because auditing fortran is no worse then most other languages, and it can be argued that fortran is better then most in terms of being able to validate models.

Not really. My first job while still green and fresh out of high school was an internship with Lockheed Martin, working on hundreds of thousands of lines of meteorological software code that was used by NASA and was written in FORTRAN. I went in without ever having seen it before in my life, and was able to pick it up easily enough so that I was productive within a couple of weeks. I recall that having the first few columns of each line reserved for special uses threw me off the first time I saw it, as did

And if you think FORTRAN is some ancient esoteric languge, you're ignorent as well. The most recent standard, ISO/IEC 1539-1:2010, informally known as Fortran 2008, was approved in September 2010.

Fortran is, for better or worse, the only major language out there specifically designed for scientific numerical computing. It's array handling is nice, with succinct array operations on both whole arrays and on slices, comparable with matlab or numpy but super fast. The language is carefully designed to make it very difficult to accidentally write slow code -- pointers are restricted in such a way that it's immediately obvious if there might be aliasing, as the standard example -- and so the optimizer can go to town on your code. Current incarnations have things like coarray fortran, and do concurrent and forall built into the language, allowing distributed memory and shared memory parallelism, and vectorization.

The downsides of Fortran are mainly the flip side of one of the upsides mentioned; Fortran has a huge long history. Upside: tonnes of great libraries. Downsides: tonnes of historical baggage.

If you have to do a lot of number crunching, Fortran remains one of the top choices, which is why many of the most sophisticated simulation codes run at supercomputing centres around the world are written in it. But of course it would be a terrible, terrible, language to write a web browser in. To each task its tool.

Those great libraries are spread across several different "FORTRAN"s. gfortran. gfortran44. Intel's fortran. f77. f90. PGI pgif90. etc. etc etc.

Gfortran is woooonderful. It allows complete programming idiots to write functional code, since the libraries all do wonderful input error checking. Want to extract a substring from the 1 to -1 character location? gfortran will let you do it. Quite happily. Not a whimper.

PGI pgif90 will not. PGI writes compilers that are intended to do things fast. Input error checking takes time. If you want the 1 to -1 substring, your program crashes. PGI assumes you know not to do something that stupid, and it forces you to write code that doesn't take shortcuts.

So, if you get a program from someone else that runs perfectly for them, and you want to use it for serious work and get it done in a reasonable amount of time so you compile it with pgif90, you may find it crashes for no obvious reason. And then you have to debug seriously stupidly written code wondering how it could ever have worked correctly, until you find that it really shouldn't have worked at all. They want to extract every character in an input line up to the '=', and they never check to see if there wasn't an '=' to start with. 'index' returns zero, and they happily try to extract from 1 to index-1. Memcpy loves that.

The other issue is what is an intrinsic function and what isn't. I've been bitten by THAT one, too.

And someone I work with was wondering why code that used to run fine after being compiled with a certain compiler was now segment faulting when compiled with the same compiler, same data. Switching to the Intel compiler fixed it.

Sigh. But yes, FORTRAN is a de-facto standard language for modeling earth sciences, even if nobody can write it properly.

In part, this is because Intel has a compiler for it. On commodity hardware (as in desktop, laptop), you will generally get the best performance running an Intel CPU and using an Intel compiler. That means C/C++ or FORTRAN, as they are the only languages for which Intel makes compilers. C++ is easy to see, since so much is written in it but why would they make a FORTRAN compiler? Because as you say, serious science research uses it.

When you want fast numerical computation on a desktop, FORTRAN is a good cho

Agreed. There are also OpenMP implementations for doing your parallel processing. If you're running on a Xeon processor then I would SERIOUSLY consider Intel's linux fortran compiler as it will provide the best performance by far.

It's totally possible to use Python and Fortran side by side. Fortran for heavy computational tasks, Python (with numpy) for glue wrapper code that loads the data and massages it into the desired shape before handing it over to that super-fast Fortran routine, and then visualizes the result

If he wrote it in VBA, I'm pretty sure he can rewrite it into a native extension of some kind and use it from the same environment. Some industries love those since you expose the functionality to many users who want or need to work in that user environment.

>> Hadoop isn't extremely useful, but for a student who managed to scrounge up a single Xeon machine, it's entirely ill suited

Go back and read the problem again: "would like to be able to make use of all four CPU cores"

Here's a guy seeking parallelization...and may not know that you don't have to throw big (potentially expensive) multicore processors against the problem - he could throw multiple (cheaper?) computers against it.

Second this. There are numerous languages out there that are tailor-made for specific kinds of problems. You didn't quite share enough to narrow down what kinds problems you need to solve, but the R project is geared toward number crunching, albeit with a significant bent toward statistics and graphic display.

If that's not pointed in the right direction, some other language might be. Alternatively, there are a lot of libraries out there for the more popular languages that could help with what you're doing. Heck, 12 years ago we didn't even have the boost libraries for C++. It's difficult for me to imagine using that language with out them now.

R is by far the best solution that I've found for statistical analysis and data mining. It's ugly, inconsistent, quirky and old fashioned but it's absolutely brilliant.

The whole syntax of R is based around processing data sets without ever needing to worry about loops. Read up on data tables - not data frames - in R and you'll learn how to filter data, aggregate it, add columns, perform a regression and beautifully plot the results all in one line of code. The Zoo package will sort out your time series anal

Modelling: Hard core finite element simulations or the like. Then C or Fortran and you will be linking with the math libraries.Log Processing: A lot of other stuff you will be parsing data logs and doing statistics. So perl or python then octive.Data Mining: Python or other SQL front end.

Well if your problems require statistical computing, R is the language to use. For general scientific computing, the last I checked Octave was still valid. As for multi-core processing only a few languages and compilers support platforms like Open MP. Fortran, C, and C++.

It sounds like you are saying a more specific version of what I was going to post.

A little research goes a long way and libraries may be more important than language. I don't care how nice the language is.... the less underlying mechanisms I need to implement, and the faster I can get into the meat of what I am working on, the better.

If you want to do RSA encryption in your code (for example) your best bet is NOT to pick a language where you can't find an RSA implementation (Applesoft basic? lol not sure wh

For completeness, it should also be noted that both C and C++ work with MPI and CUDA. Fortran can theoretically be faster than C or C++ as its compiler can optimize more aggressively (due to the lack of pointer aliasing in Fortran), but I don't have any hard data for how much of a difference it would make in actual runtime speeds.

If you have to do the whole thing from scratch then Fortran is the fastest platform. I can't say I've meet anyone who enjoyed Fortran but it's wicked fast.

True, but the only place where this *really* matters is programming for repetitive calculations on massively parallel supercomputers. For anything else, there is a tradeoff between program speed and developer speed, and ultimately it's cheaper to buy more computers than hire more programmers.

First suggestion: Python. Lot's of nice stuff for science (NumPy, SciPy), lots of other goodies, easy to learn, many people to ask or places to get help from. Plus you can explore data interactively ("Yes Wedesday, play with your data!").

Beyond that: CERN uses a lot of Java (sorry folks, true), they have good (and fast) tools I do a project right now where I am using Jython since it is supported by the main (Java) software I have to use. I like jhepwork/SCaVis quite a bit, if you are into plotting stuff on

Most of the cutting edge data mining I've seen is done using R (which acts as a scripting wrapper for the C or Fortran code that the fast analysis libraries are coded in), or alternatively in python. Some people swear by MatLab if they have trained in it (so your octave would come in handy there). Have a look at some discussions at places like kaggle.com to see what the competitive machine learning community uses (if that is what you mean by data mining).

This is the correct advice: Use whatever language is most common in your research area, so you can benefit from the most existing source code. This will almost certainly be a high-level scripting language like R, MATLAB or Python, with the ability to drop down to C, FORTRAN and CUDA for the small parts of the code that need optimization. (In my case: electrical engineering = MATLAB + C and CUDA mex files)

A lot of people will propose a language because it is their favorite. Others because they believe it is very easy to learn. I will give you a third line of thought.

I would not look for a language in this case, I would look for a library, then teach myself whatever language is easiest/quickest to access it. I would try to profile what you are building, figure out where the bottlenecks are likely to be (profiling your existing mockup can help here but dont trust it entirely) and try to find the best stable well-designed high performance library for that particular type of code.

I am not sure how much that helps since unless the person is doing something very specific, chances are it will just shift the problem into 'which library is best' debate, which will again mostly involve people suggesting libraries they like or because they believe they are easy to learn.

If you really want to do heavy lifting, you can't beat Fortran. Just stay away from Fortran 77; it's a hot mess. Fortran 90 and later are much easier to use, and they're supported by the main compilers: gfortran and ifortran.

ifortran is Intel's Fortran compiler. It's the fastest out there, and it runs on Windows and Linux. Furthermore, you can get it as a free download for some types of academic use. (Search around intel's website -- it's hard to find.) That said, I usually use gfortran -- which is free and

Use KNIME and you can probably do 90% of what you want by dragging and dropping a new nodes and joining them up. KNIME does all the complicated memory caching for large filesets for you, and you can write your own Java functions to plug into it if you need something special.

R, MATLAB, SAS, Python, there's a bunch of languages you can use, and a bunch of ways to store the data (RDBMS, NOSQL, Hadoop, etc.). It really comes down to what kind of access to the data you have, how it's presented, what other resources you have available to you, and what you want to do with it.

In general for flat out speed, toss interpreted languages out (Perl, Python, Java, etc.) the door. You'll want something that compiles to machine code, esp. if you are running on older hardware. Crunching numbers, complex math, matrices then Fortran is the beast. If you're data is arranged in lists, consider lisp, then pick something else as it will likely gi

Oh please! It's not like Lisp doesn't have any other data structure, is it? You can have your multidimensional numerical arrays in CL quite easily. (I'm saying neither "use CL" nor "don't use CL", merely that your argument is pretty weak. It's easier to learn to work with lists in the language you already know (unless it's COBOL!) than to learn an entirely different one just because of lists.)

Since you mention VBA, I suspect that your data is in Excel spreadsheets? If you want to try to speed this up with minimum effort, then consider using Python with Pyvot [codeplex.com] to access the data, and then numpy [numpy.org]/scipy [scipy.org]/pandas [pydata.org] to do whatever processing you need. This should give you a significant perf boost without the need to significantly rearchitecture everything or change your workflow much.

In addition, using Python this way gives you the ability to use IPython [ipython.org] to work with your data in interactive mode - it's kinda like a scientific Python REPL, with graphing etc.

If you want an IDE that can connect all these together, try Python Tools for Visual Studio [codeplex.com]. This will give you a good general IDE experience (editing with code completion, debugging, profiling etc), and also comes with an integrated IPython console. This way you can write your code in the full-fledged code editor, and then quickly send select pieces of it to the REPL for evaluation, to test it as you write it.

FORTAN used to be it back in the day, but now days Matlab is the stuff that many engineers use for scientific computing. Many of the math libraries are very good in Matlab and don't require you to be a computer scientist to make them run fast. I used to work with scientists in my old lab to port their Matlab code to run on HPC clusters porting them to FORTAN or C. Often the matlab libraries smoked the BLAS/Atlas packages that you find on Linux/UNIX machines for instance. The same would hold true for Octave since they just build on the standard GNU math pacakges like BLAS.

If you want to be able to ask someone for help then it would be best to use the same tools they use. The point is that any programming language will work. Some languages are easier then others but the difference is negligible compared to the advantage of being able to ask your piers for assistance.

It's more powerful, concise, and consistent than most languages. However, R and Matlab have larger user communities and this is an important consideration.

There was a note on the J-forum a few months ago from an astronomer who uses J to "...compute photoionization models of planetary nebulae." His code to do this is about 500 lines in about 30 modules and uses some multi-dimensional datasets, including a four-dimensional one of "...2D grids of the collisional cooling by each of 16 ion

I'm a MSEE and I've been working in the digital signal processing realm for the last 10 years since graduating. I should mention that I haven't done a lot of low level hardware work, I haven't programmed actual DSP cards or played with CUDA. I have written software that did real-time signal processing just on a GPU. Everyone in my industry at this point uses C or C++. There is some legacy FORTRAN, and I shudder when I have to read it. Some old types swear by it, but it's fallen out of favor mostly just because it's antiquated and most people know C/C++ and libraries are available for it.

At some point you have to decide what your strength will be. I love learning about CS and try to improve my coding skills, but it's just not my strength. I'm hired because of my DSP knowledge, and I need to be able to program well enough to translate algorithms to programs. If you really want to squeeze out performance then you'll probably want to learn CUDA, assembly, AVX/SSE, and DSP specific C programming. But I haven't delved to that level because, honestly, we have a somewhat different set of people at the company that are really good in those realms.

Of course, it would be great if I could know everything. But at the moment it's been good enough to know C/C++ for most of our real time signal processing. If something is taking a really long time, we might look at implementing a vectorized version. I would like to learn CUDA for when I get a platform that has GPUs but part of me wonders if it's worth it. The reason C/C++ has been enough so far is that compilers are getting so good that you really have to know what you're doing in assembly to beat them. Casual assembly knowledge probably won't help. I might be wrong, but I envision that being the case in the not too distant future with GPUs and parallel programming.

Do you have access to MATLAB or a similar analysis tool? Many universities have licenses, and overall it seems like it might be a good choice for you. These programs usually have a lot of build-in functionality that will be difficult to reproduce if you are not an experienced scientific programmer.

I haven't done ANY programming in about 12 years, so it would almost be like starting from scratch.

This is probably a bigger problem than choosing which language to use. If you don't know how to program properly and efficiently, it doesn't matter which language you choose. If you go this route I'd suggest taking a course to refresh or upgrade your skills. Since you're familiar with C that might be a good language to focus on in the course. Another factor is if you have to work with any existing libraries it might limit your choices. I program in C, FORTRAN, and VB and find that for computationally intensive programs C is usually the best fit, sometimes FORTRAN, and never VB.

No Matlab. Not portable, not open, and it perpetuates a vendor lock-in for quantitative scientists/engineers every bit as bad and destructive as the stranglehold Windows has enjoyed on the desktop for decades.

I think you're over-stating things a touch. Some of the core stuff is closed source but most of the functions are open, meaning that they are readable.m scripts. e.g. if you're worried about how MATLAB implements ANOVA then you read the file and check. You can modify if needed. So MATLAB is open enough in most normal usage scenarios. You're not really locked in given that we have Octave.

Python is more readable, more enjoyable to code, has equivalent IDEs available (Spyder), far more user-friendly features, you can use your code literally anywhere you go without worrying about a Matlab license, and the SciPy Stack has reached functional feature parity with Matlab (and is evolving well beyond in certain areas).

I like Python and I've spent some time learning it recently and ported some of MATLAB code. Python is not a panacea, h

Personally, I would do it in C unless you have Fortran libraries you want to use, then I'd use Fortran. However, if you have existing VBA code you want to leverage, I'd just use VB.Net, import the core parts of the code and run with it. There's a moderately steep learning curve going from VB6 or VBA to VB.Net; but, it'll be much less effort than learning a new language.

If you are working in academia, then you probably have access to Matlab. Matlab, as a language, has both scripting abilities and programming abilities. The scripting was born from Matlab's roots in Unix, which makes it handy for batch processing lots of files. It's programming functions started off as C, but has since incorporated features from C++, Python, and Java. The programming side of it has, in my opinion, more structure and formalism than Python, but makes certain things like file IO and data visualization (i.e., graphing) easier than straight up C/C++. The basics of using it can be picked up in an afternoon, and the sky's the limit from there. There are lots of well-written and documented functions built in; specialized toolboxes can be had for additional fees. There's a fair bit of user-generated code out there. Plus, I expect you can find a lot of people around you who know plenty about it.

I worked as a sysadmin for a high energy physics group at the Beckman Center. Day and night, it was Fortran, on big whopping clusters, doing monte carlo simulations.

Though it ~was~ many years ago.

Elsewhere, I worked for a company doing datamining on massive datasets, over a terabyte of data back in 2000, per customer, with multiple customers and daily runs on 1-5 gig subsets. We used C + big math/vector/matrix libs for the processing because nothing else could come close, and Perl or Java for the data mana

Don't use a programming language. Use a tool like Matlab or Mathematica instead. These tools are well designed for scientific computing and have sufficient scripting built in to support the programming-language-like functionality you're probably looking for.

You won't be able to call yourself a programmer. But you're not a programmer, you're a scientist.

I run lots of statistical analyses. Most of the code is in R with some wrappers in Perl and some specific libraries in C. The R and Perl code is pretty much all my own. The C is almost entirely open source software with very minor changes to specify different libraries (I'm experimenting with some GPU computing code from NVidia). Most of the people who are doing similar things are using Python with R (or more specifically, the people I know who are doing the same thing are using Python/R).

An average run with a given data set takes approximately 20 minutes to complete on an 8-core AMD 8160. About 80% of the run is multi-threaded and all cores are pegged. The last bit is constrained mainly by network and disk speed.

You may consider using something like Java/Hadoop depending on your data and compute requirements. Though my Java code is just a step above the level of a grunting walrus, I've found that the performance is actually not that bad and can be pretty good in some cases.

Not only is C# easy to learn, and easy to both read and write, it also runs at a fairly high speed when it is compiled. To make use of multiple CPU Cores, C# has a neat feature named PARALLEL.FOR. If your algorithm scans across a 2D Data Array using a FOR LOOP at all, Parallel.For will automatically break that array into smaller arrays, and have each calculated by a different CPU core, resulting in a much faster overall computation speed. I develop algorithms in C# and highly recommend it if you want a) a n

In order to realize all possible performance from your hardware, I would suggest linux over XP.
With xeons going 64-bit around 2005, it would have to be really old to be only 32 bit.
And even if it was an ancient 32-bit only xeon, XP is still going to have issues using more than 3.5 gb ram.
XP process management seems weak to me compared to the linux side of things.

I don't have a favorite brand of linux to recommend; I would ask your professors and fellow researchers if they have a preference (because the

It sounds like you have control of the whole machine, which makes you the sysadmin. You don't only get to choose the programming language. You have to design a workflow. The programming language will fall out of you designing your plan of attack. You have to do so within the limitation of your advisor's budget, the assistance you can beg, etc. Take comfort in the fact that procedural languages are deep down 98% the same with different words for things, it is the libraries that get confusing. And read the li

The problem with this question is that "scientific computing" is an over-broad term. The truth is that certain languages have found specific niches in different parts aspects of scientific computing. Bioinformatics, for example, tends to involve R, Python, Java, and PERL (the prominence of each depends largely on the application). Big-data analytics typically involves Java or languages built on Java (Scala, Groovy). Real-time data processing is generally done in Matlab. pharmacokinetics, some physics, and some computational chemistry are often done in FORTRAN. Instrumentation is generally controlled using C, C++, or VB.NET. Visualization is done in R, D3 (JavaScript), or Matlab. Validated clinical biostatistics are all done in SAS (!).

Python is a nice simple to learn start, very powerful, and the NumPy package is important to learn for scientific computing. R is the language of choice for many types of statistical and numerical analysis. Those are a good place to start, if incomplete. From there, I'd look at the specific fields of interest and look at what the common applications and code-base are for those.

With regard to the OS, that's pretty easy: Linux (though OS X is a reasonable substitute). Nearly all scientific computing is done in a UNIX-like environment.

I suspect that VB is NOT your problem here. But, if you have a VB program that is too slow, then I'm going to suggest you do the following:

1. Profile your program and see if you can figure out what's taking up all the processing time. It may be possible to change the program you already have slightly and get the performance you need. It would be a shame to go though all the trouble to learn a new language and recode the whole thing if replacing some portion of your code will fix it. Do you have a geometric solution implemented when a non-geometric solution exists?

2. Consider adding hardware - It's almost ALWAYS cheaper to throw hardware at it than to re-implement something in a language you are learning.

3. Rewrite your program in VB - This time, looking for ways to make it perform faster (you did profile it right? You know what is taking all the time right?) Can you multi-thread it, or adjust your data structures to something more efficient?

4. Throw hardware at it - I cannot stress this enough, it's almost ALWAYS easier to throw hardware at it, unless you really have a problem with geometric increases in required processing and you are just trying to run bigger data sets..

5. If 1-4 don't fix it, then I'm guessing you are in serious trouble. If you really do not have a geometric problem, You *MIGHT* be able to learn C/C++ well enough to get an acceptable result if you re-implement your program. C/C++ will run circles around VB when properly implemented, but it can be a challenge to use C/C++ if your data structures are complex.

6. Throw hardware at it - seriously.

Unless you really just have a poorly written VB program or you are really doing some geometric algorithm with larger data sets (In which case, you are going to be stuck waiting no matter what you do) getting better hardware may be your only viable option. I would NOT recommend trying to pick up some new language over VB just for performance improvement unless it is simply your only option. If you do decide to switch, use C/C++ but I would consider that a very high risk approach and the very last resort.

You know C. C is simple, as fast as any alternative, it's straightforward to optimize (aside from pointer abuse), and you always know what the compiler/runtime is doing. And threading libraries like pthreads or CUDA are best served via C/C++. Why use anything else?

Another thought: scientific libraries. If you need external services/algorithms then your chosen language should support the libraries you need. C/C++ are well served by many fast machine learning libs such as FANN, LIBSVM, OpenCV, not to mention CBLAS, LinPACK, etc.

Before you C++ kids want to tell me something, read up on that Mr Kuck and his optimizers. Fortran optimizers did things about 20 years ago which C++ optimizers still cannot do.

Such as? (Please understand that I'd opt for Fortran instead of C++ for numerics any day of the week myself. But I think this is mostly a fallacy nowadays - I'm pretty sure the Intel stuff shares a major part between the two compilers.)