Friday Philosophy – Software being Good is not Good Enough February 5, 2010

In a previous Friday Philosophy on In Case of Emergency I mention that something being simply a “Good Idea” with technology is not good enough. Even being a GREAT idea is not enough. It also has to be:

Easy and simple to use. After all, using that funny stick thing behind your steering wheel in a car, to indicate which direction you are turning, seems to be too much of an effort for many people. If installing some bit of softare or running a web page is more than a little bit of effort, most people will not bother.

Quick. No one has patience anymore, or spare time. This will probably be the fourth year in a row I do not plant any vegetables in our garden as I need to spend a day or two clearing and digging over the spot for said veg. You can’t beat home-grown veg. Similarly, I won’t use a web pages that takes as long to load as it does to plant a carrot seed.

Known about. There could be a really fantastic little program Out There that allows you to take a screen shot, add a title and comment and pastes it straight into a document for you, converting to half a dozen common formats on the fly. But I do not know about it. { ScreenHunter is pretty good, I have to say, and when I have shown it to people a lot of them like it}.

Popular. This is not actually the same as “known about”. For a stand-alone application to be good for you, you just need to know where it exists. Like maybe a free building architecture package. Whether thousands of people use it is moot, so long as you can get your extension drawings done in it with ease, that makes it great. But something that relies on the community, like a service to rate local eataries, unless lots of people use it and add ratings, well who cares. There are dozens (if not hundreds) of such community “good ideas” started every day but unless enough people start to use it, it will fizzle out, as the vast majority of them do.

Point 4 is highly relevant to “In Case Of Emergency” as it is simple, quick and relativley known about. It just needs to be ubiquitous.

I became very aware of point 3 a few years ago and also of the ability for very clever people to be sometimes very stupid when it comes to dealing with their fellow humans.

I was working on a database holding vast quantities of DNA information. If you don’t know, DNA information is basically represented by huge long strings of A, C, T and G. So something like AACTCGTAGGTACGGGTAGGGGTAGAGTTTGAGATTGACTGAGAGGGGGAAAAATGTGTAGTGA…etc, etc, etc. These strings are hundreds, thousand, hundreds of thousands of letters long. And Scientists like to search against these strings. Of which there are millions and millions. Not for exact match mind, but kind-of-similar, fuzzy matches, where for example 95% of the ACTGs match but some do not. It’s called a BLAST match.

Anway, suffice to say, it takes a lot of compute power to do this and a fair amount of time to run. There was a service in America which would allow you to submit a BLAST query and get the answer in 20 minutes or so {I have no idea how fast it is now}.

Some extremely clever chaps I had the pleasure of working with came up with a faster solution. Same search, under 5 seconds. Now that is GREAT. We put together the relevant hadware and software and started the service. Now I thought it went beyond Good or even Great. It was Amazing (and I mean it, I was amazed we could do a fuzzy search against a billion such strings in 2, 3 seconds using a dozen or so PC-type servers).

No one used it. This was because almost no one knew about it and there was already this slow service people were used to using. People who used the old service never really thought to look for a new one and the chances were they would not have found ours anyway.

I pushed for more to be made of this new, faster service, that it should be advertised to the community, that it should be “sold” to people (it was free to use, by “sold” I mean an attempt made to persuade the scientific community it was worth their while investigating). The response I was given?

“If the service is worth using, people will come and use it”.

No they won’t. And indeed they didn’t. It was, I felt, a stupid position to take by an incredibly inteligent person. How were people to know it existed? Were they just supposed to just wake up one morning knowing a better solution was out there? The internet pixies would come along in the night and whisper about it in your ear? In the unlikely event of someone who would be interested in it just coming across it, were they then going to swap to using it? After all no one else seemed to know about it and it was 2 orders of magnitude faster, suspiciously fast, how could it be any good?

The service got shut down as it was just humming in the corner consuming electricity. No one knew it existed, no one found it, no one came. I can’t but help wonder how much it could have helped the scientific community.

There must be thousands of other “failed” systems across the IT world that never took off just because the people who could use it never knew it existed. Depressing huh?

Like this:

Related

Thanks for these thought-provoking comments. Is
the BLAST alternative the FASTA algorithm, or
did you have something else in mind?

I found comment in Dr. Xuhua Xia’s
“Bioinformatics and the Cell” mentioning that
FASTA is more frequently used in European data
centers, while BLAST is used in America. Perhaps
the root cause of this is that Americans might
start by following the nih.gov or nist.gov
pointers, so searches lead you to the U.S.
gov’ts free BLAST search engine.

I notice at the top of your page that you’re
“Yet Another Oracle Blog”… Apropos of Oracle,
I was talking with a researcher who works for
Oracle, and who I inferred helped design
Oracle’s BLAST implementation. Notice that
Oracle doesn’t have a native package for faster
search methodologies, only BLAST.

I’m sure it took you longer to write this page
than to plant a carrot seed, but I’m sure more
people will benefit!

Thanks for that.
Hmmm, BLAST or FASTA search? I should know about all this but it is now 3 or 4 years ago and I was much more on the database side of things then the proper science, more is the pity. Let me cast my mind back.
FASTA as far as I was aware was the format of the data. So our strings of DNA were also known as FASTA files. There were the 4 letters ACTG that represent adenosine, Cytosine, thymidine and Guanine and then various other letters which represented things like {Adenosine or Thymidine} or (Guanine or Cytosine} where there was some doubt. Oh, and “-” represented some more more unknown characters.
Similarly proteins are made up of amino acids of which there are 20 main amino acids and each one can be represented by a letter of the alphabet, so you can represent a protein string by a list of characters.
BLAST is the standard DNA (and also protein) searching/matching algorithm, which I think {think} works against FASTA data. And is what Oracle built into the database for version 10. Yes, you can BLAST search database strings natively within the Oracle database. It was very good of them to do that and I can see potential for it, if it is part of what you want to do with your analysis of the data in your database, but most people who need to do BLAST searching need to do lots and lots of it, so have built some sort of facility to do it.
The terribly cunning system I worked on used SSAHA. You can check out SSAHA {more accurately, SSAHA2} at http://www.sanger.ac.uk/resources/software/ssaha2/. That in turn links to a paper on the subject written by Zemin Ning, Tony Cox and Jim Mulliken, the clever chaps behind it.
SSAHA uses hash tables to do it’s search. You basically chose a string length, say 7 characters and make up hash tables of all the possible permuatations – AAAAAAA, AAAAAAC, AAAAAAG,AAAAAAT,AAAAACA…to TTTTTTT. Against each entry you then record where in your database of DNA strings where you saw each occurance of, eg, AAAAAAG. Spread the hash tables across a few servers in memory (as it is cheaper to have 12 servers with 4 GB memory than one with 48GB of memory) and you scan the servers in parallel with fiendish C code.
Thanks for your comments and I hope you sometimes find useful stuff here.

Thanks for the pointer to sshaha2. I take it that’s the
much faster approach to which you refer above. You’ve
left me curious!

You’re right that FASTA is the name of a file format. It’s
raw and unannotated data, so can easily be loaded into an
Oracle CLOB.

FASTA is also the name of an algorithm, which grew out of
FASTP. The FASTA algorithm extends FASTP to a larger
group of problems such as Expressed-Protein-to-DNA. That
problem is algorithmically much harder but is important to
some researchers (for example, if the researcher is
searching for convergent evolution that produced like
proteins; there’s no reason that the DNA in the two
animals would look alike). Like BLAST, it’s much faster
than what preceeded it, but has a finite possibility of
missing items or prioritizing them incorrectly (like those
algorithms that say a number is “probably prime”)

Perhaps SSHAHA2 handles protein-to-DNA as well. If an
algorithm do DNA-to-DNA only, the problem gets a lot
cheaper! So that may be part of the explanation of the
popularity of the BLAST and FASTA implementations. But if
DNA:DNA is all you need, go with what’s fast!

Anyway, back to your main point, I’m sure there’s a lot of
great and free tools out there that missed out even on
word-of-mouth advertising. I wonder if the growth of
social networking helps address this…

You know more about Bioinformatics than you let on, or else you have been checking it out :-). Bioinformatics and Genetics are absolutely fascinating topics, I did a degree in Genetics way back in the 80’s, which is why I went at worked at the Wellcome Trust Sanger Institute, it was like going back to my “roots”, though I never got back into the science in the way I would have liked to. The Day Job of databases took up more than the day as it was.

Yes, there are various forms of the FASTA searches (I refreshed my memory on it), allowing you to search for polypeptides from DNA and DNA from Polypeptides (which is not a simple search as most amino acids can be coded for by several DNA triplets (codons). eg the DNA triplet of GCT codes for the amino acid Alanine but Alanine can be coded for by not just GCT but also GCG, GCC and GCA. See this wikipedia on DNA coding for details. Thus converting DNA to Amino Acids is much simper than the other way around).

SSAHA was indeed the core of the much, much faster search system. I could see it being of use in general searches but it is very well suited to working with a restriced number of “letters” as it has to have these hash tables of all combination. For 4 letters a hash table of all combination up to 7 characters long is 4 to the power 7 (4*4*4*4*4*4*4), which is 16384 combinations. with 26 letters in normal english and the common punctuation characters you are probably looking at 40 characters at least. So to have a hash table of 7 characters you are looking at 163840000000 combinations (i think).

Last I knew, SSAHA2 allowed only for DNA matching but the person to check with is Zemin Ning at the Sanger Institute. I think he and his team were considering expanding the algorithm to cope with DNA codon to amino acid mapping but Zemin always had a lot of things that needed his attention. If you want to follow up on this, let me know and I can put you in touch with Zemin. If you want to see something about the TraceSearch system that was set up, see this press release. It really was brilliant. I’m not mentioned as it was not my work, but the trillion base Trace Archive database it was built from, that was my baby. It was all in Oracle and was a 25TB database. Last I knew it was up around the 100TB limit.