I'm in Computational Biology, and I'd say that the most valuable skills you should learn (and the ones most often seen in this field) are more mathematical and/or statistical than "big data." Understanding how to properly normalize your data or calculate a p-value will take you much further than being able to spin up a 100-node Hadoop instance in most labs.

I think you should spend the first year on your home PC. Download RStudio and work through a few R Tutorials, then find some data/questions that interest you and poke around. Post your results to a blog so that you'll have something to show for the time you spend and release the code on GitHub so that it's open to future employers.

I'd say get comfortable with a data analysis language (R will probably serve you best currently), and a data manipulation language (Python, Perl, etc.) and start asking questions of data that's around you (your email archives, a log of Internet sites you visit, your spending records, etc.). Once you've found that well-designed algorithms can't handle some dataset you're looking at, then look at Hadoop and other "big-data" projects.

When you're ready, I'd steer your towards Next-Generation Sequencing data. Most of the bioinformatics questions being asked (and funded) now have at least some interaction with NGS, and analysts capable of working with that data are highly valuable. Check out the 1,000 Genomes project when you're ready to start playing with free Sequencing data.

I think most people pick these skills up at work -- seeing these practices employed by others. I guess the short answer would be: go work for a company which is already employing people who understand these topics and learn from them for a couple of years.

Of course, that doesn't help you if you're not looking to change careers...

An anonymous reader writes: I'm a fairly recent CS graduate with a few years of work experience in Academia under my belt. I find myself in charge of a team of 3 developers working on Academic software. While I have a good understanding of the theoretical underpinnings of CS (algorithm complexity, etc.), I'm completely out of touch with the recent trends in software development (Agile, Continuous Integration, Automated Testing, Persistence Frameworks, Dependency Injection, and the rest). How should a programmer go about picking up the recent developments in this field? Is there any training or book you could recommend that cover many of these topics? I'm feeling a bit overwhelmed by the prospect of taking a week-long course in each one of these topics (if that were even an option!)

An anonymous reader writes: I am a mid-career IT professional in the middle of a transition from IT to a domain within the biological sciences. My planned academic route to the target new domain will take at least 3-5 years to finish. In the interim, I want to work in (and earn from) the IT domain of Big Data/Data Science, since that is more aligned with the skills I need in my target new domain: data analysis, visualization, signal processing, imaging, simulation etc. The problem is that apart from early career stints, I've very little and only surface level experience with these topics. So I want to ask Slashdot for suggestions on the tasks Ive set myself to accomplish this transition. Specifically:

What are the foundational topics I need to learn. What parts of math, statistics, machine learning, text analysis, scientific programming...?

What books to read?

What courses (preferably open/online) to take?

I want to set up an online portfolio of big-data projects that I work on to showcase skills that I acquire in this domian. What are some of the more challenging, topical and novel applications areas and open problems to showcase in a portfolio, such that it is distinctive and interesting. E.g., consumer behavior, neuro-/bio-informatics, socio-economic trends...

How do I find sources of open/non-propreitary data sets to use for my portfolio projects?

What hosting resources do I need to set up a portfolio of big-data projects? Any suggestions on specific hosting providers?

Interesting. I view this from a completely different perspective: if DNA sequencing really is outpacing Moore's Law, that just means that the results become disposable. You use them for your initial analysis and store whatever summarized results you want from this sequence, then delete the original data.

If you need the raw data again, you can just resequence the sample.

The only problem with this approach, of course, is that samples are consumable; eventually there wouldn't be any more material left to sequence. So this wouldn't be appropriate in every situation.

I assume you're talking about incoming data, not the final DNA sequence. As I understand it the final result is 2 bits/base pair and about 3 billion base pairs so about a CD's worth of data per human. And if you were talking about a genetic database I guess 99%+ is common so you could just store a "reference human" and diffs against that. So at 750 MB for the first person and 7.5 MB for each additional person I guess you could store 2-300.000 full genetic profiles on a 2 TB disk. Probably the whole human race in less than 100 TB.

The incoming data is image-based, so yes, it will be huge. Regarding the sequence data: yes; in its most condensed format it could be stored in 750MB. There are a couple of issues that you're overlooking, however:
1. The reads aren't uniform quality -- and methods of analysis that don't consider the quality score of a read are quickly being viewed as antiquated. So each two bit "call" also has a few more bits representing the confidence in that call.
2. This technology is based on redundant reads. In order to get to an acceptable level of quality, you want at least ~20 (+/- 10) reads at each exonic loci.
So that 750MB you mention for a human genome grows by a factor of 20, then by another factor of 2 or 3, depending on how you store the quality scores.

Your suggestion of deduplicating the experiments could work, but definitely not as well as you think because of all the "noise" that's inherent in the above two steps.

If you really just wanted to unique portions of a sample, you could use a SNP array which just reads the samples at specific locations which are known to differ between individuals. Even with the advances in the technology, the cost of sequencing a genome still isn't negligible. For most labs, it's still cheaper to store the original data for reanalysis later.

I often stumble across some product on Wikipedia that I'm interested in buying (album, book, etc.). I actually would find it very convenient if such pages had a "Purchase this Item" link.
I'm sure Amazon would kick in a few million for that privilege, or you could use their pre-existing referral program. I think most users would view those links as added value to Wikipedia.

And in light of this, why are we assuming GPS? I can't get find GPS satellites through the metal in my car roof, let alone through my entire car. Is it more likely they they're just tracking the cellular connections?

I think there is a software for that. But it may not be out there iin public. Maybe developers are just keeping them private. Nevertheless, there's a high probability that the software you will need already exists. what is more important is your ETA. So you may find other means. Laptop Troubleshoot Tips

jda104 writes: I work with a group of about a dozen "data analysts," most of whom have some informal programming experience. We currently have an FTP server setup for file/code sharing but, as the projects get more complicated, the number of outdated versions of code and data floating around among group members has become problematic; we're looking for a more robust solution to manage our files.

I see this as a great opportunity to introduce a revision control system, though there will surely be a bit of a learning curve for non-programmers. I've primarily worked with Subversion (+TortoiseSVN), but I would rather not have to spend my time manually resolving file conflicts and locking issues for each user and anything beyond commit, update, and revert (such as branching, merging, etc.) would probably not be used.

We're definitely not "software developers," but we write many Perl and R scripts to process datasets that can be many dozens of GBs. The group's personal machines are evenly split between Windows and Macs and our servers are all Linux, currently.

Is there a revision control system that "just works" — even for non-programmers? Or should we just head in a different direction (network share, rsync, etc.)?

A big-ass Oracle or IBM-DB2 can do the job if you pay enough for tuning.

Why is it that, ever since Key-Value DBs came into vogue, that relational databases instantly got perceived as so neanderthal?

A normal-ass Oracle database would surely be just fine for storing a no-fly list which, by necessity, has magnitudes of order less than 6.whatever billion names; I'm guessing it would do so without much tuning, also.