A real techie post today. I’ve just spent most of an afternoon trying to track down a complete list of steps for changing your default data source in JBoss 4.0.4.GA from Hypersonic to MySQL. I’ve had to visit at least a couple dozen forum posts, but have only found the 5 linked in the steps below to be useful in answering my question.

However, none of those 5 had the entire answer, which leads me to the reason for this post. I hope it can help other people out!

(From OnJava.com): With your MySQL driver jarfile downloaded and added to your classpath, copy it to the [jboss-location]/server/default/lib directory.

(From OnJava.com and EPFL): copy [jboss-location]/docs/examples/jca/mysql-ds.xml to this location. Modify the <local-tx-datasource> element inside the mysql-ds.xml configuration file to match the following (Note there are some parts below which you will have to change!):
<local-tx-datasource>
<jndi-name>DefaultDS</jndi-name>
<connection-url>jdbc:mysql://your-hostname:your-port/your-database-name</connection-url>
<driver-class>com.mysql.jdbc.Driver</driver-class>
<user-name>your-user</user-name>
<password>your-pass</password>
<security-domain>MySqlDbRealm</security-domain>
<exception-sorter-class-name>org.jboss.resource.adapter.jdbc.vendor.MySQLExceptionSorter</exception-sorter-class-name>
<!-- should only be used on drivers after 3.22.1 with "ping" support
<valid-connection-checker-class-name>org.jboss.resource.adapter.jdbc.vendor.MySQLValidConnectionChecker</valid-connection-che\cker-class-name>
-->
<!-- sql to call when connection is created
<new-connection-sql>some arbitrary sql</new-connection-sql>
-->
<!-- sql to call on an existing pooled connection when it is obtained from pool - MySQLValidConnectionChecker is preferred for n\ewer drivers
<check-valid-connection-sql>some arbitrary sql</check-valid-connection-sql>
--> <!-- corresponding type-mapping in the standardjbosscmp-jdbc.xml (optional) -->
<metadata>
<type-mapping>mySQL</type-mapping>
</metadata>
</local-tx-datasource>

Next we have to change and replace some files in [jboss-location]/server/default/conf:

(From OnJava.com, Experts’ Exchange and EPFL):In standardjaws.xml change the following two elements, and leave the rest in place:
<datasource>java:/DefaultDS</datasource>
<type-mapping>mySQL</type-mapping>

New Scientist released an article last week summarizing the work of Finlayson et al in Nature. Their work shows that “the Neanderthals survived in isolated refuges well after the arrival of modern humans in Europe.” Gorham’s Cave, Gibraltar, was systematically – and deeply – excavated by the authors between 1999 and 2005 over an area of 29 square meters. There was a low population density for both Neanderthals and humans during the time that they both lived in the area, and “the late survival of Neanderthals and the arrival of modern humans was a mosaic process in which pioneer groups of moderns and remnant groups of Neanderthals together occupied a highly heterogeneous region for several thousand years”. Up until this paper, the survival of Neanderthals past 35,000 years ago had not been proven. However, this new data proves that Neanderthals used Gorham’s cave until 28,000 years ago. as modern humans began moving into Europe around 32,000 years ago, this makes an overlap of at least 4,000 years.

It doesn’t sound like much, in the larger context of the evolution of the species, but 4,000 is still – obviously – a long time. One can imagine, even with low population densities, many encounters between groups of “moderns” and remnant groups of Neanderthals. This can lead to trying to imagine what the answer would be to one of the “ultimate” questions: what would it be like to meet another sentient species? Whether via sci-fi or prehistory, it makes for some fantastic daydreams.

The second day of IB2006 was the longest of the three days, and the only “full” day. From my point of view only, the talks were more relevant and interesting to my work. The second evening was also the conference dinner, which was very sociable and the conversations continued straight through the dinner and into late in the night back at the conference hotel. But back to the day itself: there was a fantastic keynote by Pedro Mendes, and a number of other interesting talks. The highlights are presented below.

Systems biology, in his view, is the study of a system through synthesis or analysis, using quantitative and/or high-throughput data. Origins of systems biology as early as 1940s, but with a large amount of work done in the 1969s-70s. It didn’t really take off during this time due to lack of computing power and lack of experimental “ability” for getting the large amounts of data required.

Pedro is interested in the top-down modeling approach because there is a large amount of data, with a lot of numbers, and people naturally want to make models from them. Many people think this isn’t the way to build models, but he believes otherwise. In bottom-up modeling you start with a small number of known reactions and variables, while in top-down modeling you start at a coarse-grained level, with loads of data and you try to work “backwards” (compared to traditionaly modeling procedures) to find the steps that will produce the HTP data you started with. In other words, it derives elementary parts from studying the whole.

Colin gave an interesting talk on the availability and usefullness of a web-based stochastic simulator. You can create, access and run sbml models via web services (and a web-page front-end to the web services). Their aim is to make their own models available to other researchers and also to provide a framework for others to build their own models. In general, their models can be envisaged as networks of individual biochemical mechanisms. Each mechanism is represented by a system of chemical equations, quantified by substrate and product concentrations and the associated reaction rates. The connected series of reactions are then simulated in time. Simulation may be stochastic or deterministic depending on species concentration. They have funding for another 6 years and are planning many additions to the tool.

Multi-model inference of network properties from incomplete data (Michael Stumpf)
Estimates for rates of false positives range from 20 to 60%, and in connection with this he recalls a quote he read at one time stating that gene expression is as close to scientific fraud as is accepted by the scientific establishment. At least at the moment, it appears to be a trade off between data quality and data quantity. In other words, you must take noise into account in any analytical work you do.

For most species, you only have interaction data for a subset of the proteome. Missing such data means that you can get quite different networks (currently known versus “actual” network). This affects summary statistics, among many others. They discovered that generally, inference for networks comprising less than 80% of the full graph should be treated with caution, however above that value the inference model developed is very useful. Given a subnet it is possible to predict some properties of the true network if we know the sampling process. (independent of the process by which the network has grown). For different data sets, there seems to be a huge difference between different experimental labs, and how each has mapped parts of the interactome. However, overall this is a good way of estimating total interactome size by performing this test on multiple subnets from different PPI experiments. There are limitations, though: it ignores multiple splice variants and domain architecture, so any organism affected by these will not necessarily have as good a result. By interrogating all these different models, and averaging over that, useful estimates of total interactome size is possible. Useful estimates can even be retrieved when using partial data as long as the number of nodes is at least 1000.

Other interesting talks included Stuart Moodie’s discussion of the current state of affairs in standardizing systems biology graphical notation and visualization (sbgn, kitano and others), Jab Baumbach’s work on performing knowledge ‘transfers’ for transcriptional regulatory networks from a model species to 3 other similar species important to human pathogen studies, Jan Kuentzer’s biological information system using both C++ and Java called BN++, an eye-opening overview of the current status of biocomputing at Singapore’s Biopolis by Gunaretnam Rajagopal), a lovely swooping demo of a targeted projection pursuit tool for gene expression visualization by Joe Faith, and a wonderfully presented (which in my mind equates to “easily understood” because of her skill as a speaker) statistical talk on modeling microarray data and interpreting and communicating biological results by Yvonne Pittelkow. (Yes, a couple of those were from day one, but they still deserved a mention!)

The 3rd Integrative Bioinformatics Workshop began yesterday at Rothamsted Research Centre in Harpenden, England. Rothamsted Research is a lovely campus reminescent of the Wellcome Trust Genome Campus just south of Cambridge. It has a long history of plant research as well as being the workplace of the mathematician and statistician Ronald Fisher.Keynote speech
Day One was a single afternoon session with one keynote and seven 25-minute talks. The keynote was an entertaining overview of current methods in semantic integration, and an update on the status of SRS by Steve Gardner from Biowisdom, the new owners of SRS (Lion Biosciences has folded). His classifications of integration methods are:

rules-based (eg SRS)

data warehousing – his opinion is that this method is best for repetitive analysis, or “same analysis, different data”. As this is not the sort of work that is often done in bioinformatics, he believes this technique is not as useful as the others. However, it is a common method of integration in bioinformatics anyway.

What these methods are not is scaleable – it is his opinion that none allow you to understand the meaning of the data. If you don’t understand the concepts or relationships inherent in the data, then it is difficult to do a useful integration.

Semantics is about 1) disambiguation and 2) aggregation. He says that relationships should be more than is_equivalent_of, is_same_as, is_a and is_part_of, and that more descriptive relationships should be used. He posits that current tools don’t have useful search results anymore. When you search pubmed, you are not getting meaningful answers. You get loads of hits, but only take a few out of the “top 10”. ended with a self-explanatory statement about the benefits of semantic integation: “If you can build on semantic consistency, you can get quite a lot for free.”

It was an interesting talk, but I am not sure I agree with everything he says. For one thing, I believe that data warehousing does have a place in modern bioinformatics: look at ONDEX, for example. However, various discussions during day 2 of this workshop made it clear that his definition of “data warehousing” and mine were actually different. The sort of data warehousing that he described as not useful to bioinformatics is the sort where all data sources are forced into a single, but NOT unified, schema. There is no attempt to actually integrate these sources into a unified schema where the semantically identical terms are stored in the same location. My definition of data warehousing has always been multiple data sources integrated into a unified schema, of which ONDEX is an example. So, with these revised definitions, I don’t see as much controversy.

He also is very positive about OWL Full, which is a fully semantic ontology. However, it is not guaranteed to be computationally complete, which is one of the reasons why OWL-DL is considered a good compromise.

My lack of understanding of his definition of data warehousing leads me to a final point: there are many data integration methods out there, but even more synonyms for data integration methods. It seems many papers create a new term rather than using currently available ones, and some of those which do re-use terms don’t always agree on the definition. Perhaps an ontology of data integration terms is required? 🙂

Last week I got to see, touch, and hear about an Enigma machine. It was a really amazing experience. I meant to write about it then, but a variety of things (including having to decompile a variety of .class files because IntelliJ, which up until now was a picture-perfect IDE, emptied the corresponding .java files) came up. A very good post on the same experience can be found in Dan Swan’s blog, so I won’t duplicate that work here.

Suffice it to say it stuck in my head, especially since I had recently read Cryptonomicon (Neal Stephenson). Simon Singh was recommended as author for a very good non-fiction codebreaking history book. Another recommended book in this area is Jack Copeland’s Colossus, on the Colossus machine, which was not used for breaking Enigma, but rather for breaking a completely different cipher from the Lorenz SZ 40/42 cipher machine.

p.s. Yes, I had a backup of my java files, but no, they were a couple of days old and therefore it was quicker to decompile using jad.

The entire university was evacuated this morning around 10:20am. What was originally thought to be another fire drill was announced (after around 20 minutes) to actually be a bomb scare. Everyone in the University was moved from the fire-safe positions to (I am guessing) bomb-safe positions at Exhibition Park nearby. Supposedly an anonymous phone call had been made to the University this morning, warning that a bomb would go off around 11:00am. Most of what I’m saying are rumours that went around the people waiting at the park, so it is unclear what was really going on.

It was a surreal situation: first, the migration en masse of a very large number of academics down the street and across a junction to the park. Everyone actually used the crosswalk! I would say that was a very British thing to do, except that there were many people who weren’t British among the group. There was certainly no panic. Then, when we got to the park, the only entertainment (initially at least) were all the young men who were practicing their cycling skills in the skatepark. Later, of course, there were the police, TV crew and firemen to watch, as well as our fellow academics. People watching at its finest! Believe me, seeing a group of about 20 cleaning ladies dressed in neat blue-and-white checked dresses sitting inside the skatepark, themselves watching the bike tricks, was definitely a memorable moment in time.

Once our half of the university was re-opened, walking back en masse was another interesting experience. There were so many of us that it seemed like we were a particularly oddly-dressed section of the Great North Run. We even went over the police tape that had earlier been barring the road, which made it seem like we were all crossing a finish line. I should have taken a picture with my phone, but I still forget such new-fangled technology is sitting in my pocket.

By 12:15, half of the university was back in their offices: the other buildings, including the medical school, still hadn’t been cleared. Conflicting rumours were passed around while we were at Exhibition Park: some said the caller identified the Medical School as the location of the bomb. Others, the new Devonshire building, while still others said no specific building was named, which was why the entire University needed to be cleared. I don’t know whether just a prank or something more.

In any case, as long as it remained a threat and nothing concrete, I could think of worse ways to spend more than two hours than in a park chatting with friends.

Update: There is now a news item on the Newcastle University website (Newcastle staff only). Basically, the link says that they received a warning for a bomb threat that they considered serious. I’ll post a link to a public site if one becomes available. By 15:00 all University buildings had been reoccupied.

I couldn’t resist posting on the topic of C++ for-loops, as described by this wonderfully irreverent Reg Developer article. I shall leave you to read it, rather than summarizing it in detail here, but as a quick one-liner, it goes into the argument of incrementing counters versus incrementing iterators, and then in a final touch, touches on using algorithms over iterators.

As someone who came to C++ via the unintuitive path of Java -> C -> C++ (my coursework was more “advanced”, if I won’t get torn to pieces saying that, than the coding required in my first job), good ol’ i++ was one of my best friends. Admittedly, using other aspects of “pure” C++ were not problematic: I loved lording the use of templates in vectors, lists etc. over the Java folks. No casting in and out of Object for me! Shame Java had to go and sort out their language to allow that: no lording has happened recently.

Back to the subject at hand. I have to agree with the author of the Reg article, and say that about 90% of the time I try to “be good” and use an algorithm instead of for loops, I end up writing my own functors. (Ok ok – so the first few times I wrote a functor because I thought it was fun, not because it was neccessary – I still can count that in the percentages, can’t I?)

In short: looping algorithms – more fun than they look, but a guilty pleasure as I still cannot quite justify them for the simpler cases. But then the answer is never almost never “all or nothing” programming choices. (I’m sure there’s a completely “all or nothing” choice out there, enjoying its role as the exception that proves the rule.) As a biologist and a programmer (read “bioinformatician”, which is just too tough to say), I find I like this result, in keeping with the biologist’s perspective: the messy answers are the best ones.