Search This Blog

Tuesday, December 25, 2012

someone that wants to deploy virtual servers in a newly acquired multi-core server using RHEL 6 and nothing more than the Linux KVM and RedHat's basic virt-manager application and/or

you wish to gain an understanding of KVM's virtual networking architecture

then this article/technical walkthrough is for you. Most of these techniques will work on other Linux distributions besides RHEL 6. Admittedly, there are more user friendly, free and commercial tools that allow you to deploy virtual machines. The usual suspects include VMware, RedHat, Oracle, Parallels that provide industrial strength solutions with intuitive point-and-click interfaces that make the setup of virtual machines an easy task.

However, I like to keep my production server software stack as simple as possible. Those of you that had to troubleshoot VM performance or other problems and faced the 'ping-pong' between the virtualization and the OS vendors will know what I mean. Thus, I use KVM/qemu and virt-manager to cater for my VM needs. The downside is that these tools are less intuitive to use for the newcomer, but with a little bit of good documentation and practice, they can be effective. I draw this conclusion after looking around in various technical support threads and after browsing RedHat's documentation on the subject. The threads seem to confuse the various virtual switching modes and techniques when things could be done more easily with interface bridging. The same can be said for Redhat's Virtualization Administration Guide, which does a fairly good job detailing the Routed, NAT and isolated virtual networking modes (Chapter 18), however it fails to mention how bridging could be used for hosting virtual servers. I am going to spend the rest of the article to explain this in detail.

The Theory

Let's be more specific now and explain what I mean when I say I need to deploy a fully networked virtual server. When you use the virt-manager application, it's easy to deploy a network enabled guest OS by means of using Network Address Translation (NAT). In fact, NAT (IP Masquerading, a specific mode of NAT) is the default guest OS virtual networking mode, using the IP address of the physical host server.

Figure 1

The figure above displays the networking data path traversal from the VM guests, all the way to the physical network/VLAN, when using the default virtual networking mode (NAT). Starting at the bottom of the figure, each guest has been assigned to a virtual network interface (vnetx). This is essentially a software implementation of an interface which is part of a virtual switch. At the other end of the virtual switch, a virtual bridge interface (virbr0) merges the traffic from the VMs and interfaces to the IPTABLES module which performs the actual NAT. At the end, you have the eth0 physical interface which carries the packets to the actual wire.

In this scenario, your guest OS will have outbound network connectivity. Should you wish to enable inbound network connectivity, you will fail. It is possible to perform other tricks and enable port forwarding/SNAT/DNAT to enable inbound connections. However, this is cumbersome. As a result, my definition of deploying a proper virtual server resembles the following aspects of a true physical server:

You have a physical MAC address tied to a network/VLAN broadcast domain

You can deal with that MAC address in any way you would deal with a true physical NIC: ARP, assign a static IP, (static) DHCP, etc.

You can have unrestricted outbound and inbound network access within that network/VLAN broadcast domain, a must requirement for a server system.

In order to achieve this, we need to employ the technique of interface bridging. For references on bridges, you can consult a variety of sources such as:
i)The IEEE 802.1D standard
ii)The older (out of date but still useful) Ethernet Bridge + netfilter HOW TO from TDLP.
iii)A copy of A. S. Tanenbaum's Computer Networks classic textbook.
However, prior explaining how this works, let's throw in a realistic production environment scenario.

Figure 2

Figure 2 displays the network topology of a production VM server scenario. There are two networks. One Class C internal (192.168.14.24), where hosts may or may not have outbound connectivity. Inbound connectivity to this network is prohibited by the top server which offers FTP, DMZ, FIREWALL, DHCP, and DNS services on the INTERNAL net. The other network is a world routable Class B (129.230/16).

The VM host server needs to serve a number of virtual servers that have different network access criteria:

Guest_01: Linux server to run an LAMP stack, exposed on the internal network.

Guest_02: Development Windows 7 box, which needs to be accessible via non standard port ranges on the internal network, but also needs Internet access.

Guest_03: Legacy SCADA Windows XP based system which needs to be accessible only via the internal network.

Clearly, Guest_01 is the least restricted system, so it makes sense to place it on the INTERNET/EXTERNAL Class B net. Guest_02 needs some protection so the outside folks cannot reach it, only it should reach the outside world by means of IP Masquerading, by using the Public routable IP of the FTP/DMZ/FIREWALL/DHCP/DNS server (129.230.135.131). Thus, it's a candidate for the INTERNAL Class C net. The same goes for Guest_03, which is the most isolated environment we need to protect, accessible only by INTERNAL network hosts.

At this point, it is useful to modify Figure 1 to illustrate the virtual network data path of our new scenario.

Figure 3

Figure 3 above illustrates the virtual network data path of our production scenario (Figure 2). In this case, instead of the virbr0 we have bridging modules bound to physical interfaces. Each physical interface is connected to the proper network/VLAN and has a bridge bound to it (we will illustrate how this is done). The role of the bridge is to create a data channel and forward traffic between the vnetx interfaces of the virtual switch and the physical interfaces. The objective is to enable the MAC address of the Guest_X machines to connect to the actual physical network/VLAN, as stated earlier. As a result, via bridge br3, we enable the virtual servers Guest_02 and Guest_03 for the internal network and via br4, we connect Guest_01 to the external world.

The practice

The previous section presented the theory. It's time now for the hands-on practical part. First of all, if you are dealing with a fresh installation, make sure you yum install the following groups, in order to have the full range of virtualization utilities and install your guests.

The next thing you should ensure is that you have enough physical network interfaces on your VM host server. In order to implement our production scenario, Figure 2 indicates clearly that we need four Ethernet NIC ports: Two of them (eth2, eth3) are used to enable the server to have IP connectivity and routing on both networks. In contrast, eth4 and eth5 will be dedicated to carry the virtual server traffic.

We will not need IP addresses for interfaces eth4 and eth5. They will be brought up only to carry the bridged VM traffic. Make sure you identify the NIC ports properly and connect them to the proper network/VLAN Ethernet switch ports. To do that, you can remove their network cables and use the ethtool command to blink the NIC lights on the server side by doing a:ethtool -p eth4

andethtool -p eth5

to respectively identify the proper NIC ports. The next step is to connect them to the proper switch ports. In principle, once you identify the NIC port side with ethtool you should be OK. In practice, it is easy to make mistakes in messy/unlabelled network panels. Thus, after connecting the cables to the switch ports, one easy check is to bring the interface to promiscuous mode and watch for traffic indicating you are indeed on the right network/VLAN, by doing things like:tcpdump -i eth4

Now that the cables are connected properly we can start configuring the Ethernet bridges. A bridge is just another interface and the best way to configure this on a RHEL 6 system is by getting your hands dirty. Go right under the /etc/sysconfig/network-scripts directory and use your favourite text editor (vim, nano, Emacs) to make two files, one for each bridge interface device

ifcfg-br3 with the following contents:DEVICE=br3BOOTPROTO=noneTYPE=BridgeONBOOT=yesDELAY=0

ifcfg-br4 with the following contents:DEVICE=br4BOOTPROTO=noneTYPE=BridgeONBOOT=yesDELAY=0

This takes care of the bridge interface declaration. What's left is to associate the newly defined bridges with the right physical interface. Thus, under the same directory (/etc/sysconfig/network-scripts), we create two more files:

ifcfg-eth4 with the following contents:DEVICE=eth4HWADDR=00:10:18:31:5A:5BNM_CONTROLLED=noONBOOT=yesBRIDGE=br3

ifcfg-eth5 with the following contents:DEVICE=eth5HWADDR=00:10:18:19:4F:5CNM_CONTROLLED=noONBOOT=yesBRIDGE=br4

In short, with these four files we ensure that we have a persistent config where all interfaces (bridges and physical ones) are up on boot and we associate br3 to eth4 and br4 to eth5 (Figure 3). Fans of the brctl utility could also achieve the same result by doing a:

and check that the bridges and physical interfaces are up and available by issuing an ifconfig command. If all is well, you should see output like the one below (I have excluded some of the non relevant output for length reduction purposes):

Note that all relevant interfaces are up and do not have an IP address . The second thing you should note is that the each bridge interface has the same MAC address as the physical interface it is associated with.

If you have reached that point, you are almost done. What you need to do now is to build your virtual machines. I assume you are familiar with how to build VMs on virt-manager. If not, I have written a quick summary of the procedures. Alternatively, if you have already existing VMs, you could reconfigure their networking to use the bridge interfaces.

Figure 4

Figure 4 above illustrates the network config for Guest_02. Make sure that the 'Source device' is one the available vnet interfaces that connects to br3 and apply the changes. You can do the same for the rest of the virtual server VMs. When you are done, you can now check with the brctl utility the final configuration by doing a:

brctl show

and you should get output similar to the one below:

Figure 5

Note the interfaces column which should correctly list all the physical and vnet interfaces associated to each bridge. When you fire up any of the virtual servers, you should be able to see it with its vnet's interface MAC address on the virtual network. Let's take Guest_02 as an example. From our VM host server console, we type:

Note Guest_02's MAC address from Figure 4. That's the one replying and bridged into the internal network. This means that for all intents and purposes, Guest_02 is just another server on the internal network. Mission accomplished.

Friday, August 3, 2012

EMBOSS Database configuration

Part 1 of this article series covered a basic installation of EMBOSS from sources. The
configuration of EMBOSS databases merits a separate article Part as
it requires some knowledge of the indexing process and the various
mechanisms to download and index flat file databases. Correspondence
from the EMBOSS mailing list shows that this is a topic that confuses
users and admins frequently. Thus, we are going to take a detailed look
at it.

Remote data access methods and the emboss.default file

If
you would like a recap of what is a flatfile database and what EMBOSS
can do for you in terms of accessing indexed flatfile databases, you
might like to take a look at some of the lectures I have given on the
subject (slides,
video). EMBOSS is not the fastest and most efficient way to index your
flatfile databases. You should look at something like MRS and similar
systems to have a more efficient way to index and perform comprehensive
queries on flatfile databases. In fact, EMBOSS can access MRS indexed
databases and in my opinion, this is better than a pure EMBOSS index
system in many perspectives (speed of indexing/quering the index,
storage efficiency etc). Nevertheless, EMBOSS does its job and this
section describes only the process of indexing flatfile databases by
using exclusively EMBOSS utilities.

One thing you need to understand is that in order to have access to indexed flatfile databases, you do not always have to index them locally. The EMBOSS applications support a variety of remote data retrieval methods to many useful datasets. Amongst the most popular of them we have:

MRS methods (mrs, mrs3 and mrs4): These allow you to search an MRS based index on a local or remote server.

To understand how to engage/activate these different data access methods, you will need to become familiar with the 'emboss.default' file. Part 1 of this article mentioned that the EMBOSS installation directory was under: /usr/lsc/emboss . You will need to navigate to the following directory:/usr/lsc/emboss/share/EMBOSSWhen you install EMBOSS for the first time in your system, you will see amongst others two files:

The 'emboss.default.template' file: This is a sample configuration file which shows the EMBOSS admin how to define databases. We will explain more in the process, but you can use this file as a reference to see many examples of how to configure properly various types of EMBOSS databases.

The emboss.standard file: This file also contains valid EMBOSS database configuration entries. However, the database definitions are included by default in your current setup.

The idea is that you have some default entries in the emboss.standard file which are included in your database list. So, if on your shell you issue a:

showdb

you will immediately get the following list of database entries by default:

If you wish to define any additional databases beyond this default list, you should create an emboss.default file, using the file 'emboss.default.template' as your reference (we are going to explain how shortly).

For now let's focus on these default databases defined by the emboss.standard file. They are a good example of how the new EMBOSS 6.5 enables remote data access from a variety of global public servers out of the box (I assume your Internet connection is working, right?). Let's use the EDAM ontology to retrieve data about an identifier. To do that I choose the ontotext EMBOSS application and I type:

ontotext edam_data:0849

The resulting file (0849.ontotext) contains the info which is retrieved from available servers. Let's take a look at the emboss.standard file to see how the edam_data database is defined:

The DB definition part: It defines the name, type, format, access method and various fields of the database record.

The RES (resource definition) part: Where the length of the various record fields is defined in the index. (note that RES definitions are normally found towards the end of the file).

The DB and RES fields go together for each database definition. In addition, for remote data access methods, a SERVER definition might be necessary to necessitate access to remote information repositories.

Step 9:The 'emboss.default' file does not yet exist,so create it under the directory where the emboss.default.template. From now on, you will be editing the emboss.default file to define all aspects of the EMBOSS database configuration. Start with a minimal file like the one below:

Show here, we have defined the database 'martensembl' which could retrieve remotely entries from the Homo Sapiens Ensembl gene repository. Save the file and go back to your shell. You can repeat the 'showdb' command and verify that you can see the newly defined martensembl database. Now, test it by typing:

seqret martensembl:ENST00000380152

The resulting fasta file should contain the info you require and this was all the way from the remote Biomart server. Congratulations, you just setup your first remote database access in EMBOSS!Browsing remote access repositories is a good idea and the EMBOSS team was right to enable the functionality in EMBOSS. However, accessing remote datasets does not always work very well if:

You go into a place where Internet availability is sketchy or of limited bandwidth capacity.

The datasets you need to access involve millions of sequences or Gigabytes of information.

In these case, your only reliable option is to setup a database locally and make a flatfile database index. This is explained in the next section.

How to define a local flatfile database index

What was said in the previous section about the main parts of an EMBOSS database definition in the emboss.standard file can also be applied to the emboss.default file. Let's provide an example and give you an example of how you can format the latest Uniprot/sprot database, in three steps:

Step A: Download and uncompress the latest file into your flatfile index area, a directory where you should have plenty of space to hold your flatfiles and the produce indices of your datasets. The file lies here (EBI FTP server). On the command line, you could do a:

The first two lines are optional and provide an alias for the directory locations where you have uncompressed the flatfile and you are going to produce the index. After that you have the database (DB sprot) definition. It is a protein sequence database (type: P). The fields specification is important. It lists all the indices that are going to be produced. So, we know that we will be able to search the database by sprot IDs (id), accession number (acc), sequence version (sv), descriptive text from the sequence header (des), keyword (key) and taxonomy info (org).

Each of these index fields has a defined length as part of the associated RES (resource definition) entry. Note that it is important to define both the DB and the RES blocks. If you do not and for example you forget to define the RES record, the EMBOSS applications will complain until you resolve the issue with an error message similar to this one:

EMBOSS An error in ajnam.c at line 9126:unknown resource 'sprot'

For now, save the file and do a showdb to verify that you can see the 'sprot' database. If you have omitted or misconfigured any important parts of the definition, the command should complain with informative errors.

The same procedure could be used for nucleotide databases (type: N). Remember, you have the emboss.default.template as your guide. I hope you have a better understanding of how you can setup local databases in EMBOSS now.

Tuesday, July 31, 2012

Every 15th of July, the EMBOSS team at EBI releases a fresh version of the European Molecular Biology Open Software Suite (EMBOSS). Started and shaped by the EMBnet community, EMBOSS is one of the most versatile systems to perform sequence analysis and a variety of bioinformatics pipeline tasks, as it copes with a variety of file formats and contains a plethora of applications.

Most of the procedures outlined here are described in more detail by the 'EMBOSS User's Guide: Practical Bioinformatics' book, written by the EMBOSS authoring team. While this is an excellent publication, books quickly get out of date as software evolves. In addition, the on-line EMBOSS administration documentation is out of date. As a result, I felt that this two part article series (Part 2 covers the task of enabling data access in EMBOSS (including local flatfile database setup) will be a quick startup guide for those that have to administer EMBOSS installations.

This year the version clock has turned into 6.5. In this Part, I shall be going through an installation from the sources on a production Linux server, covering all aspects of the system configuration, including the formatting of databases. There might be binary/prebuilt packages available for your Linux distribution. However, I always maintain the principle of building the latest binaries from the sources. This gives you the latest and the greatest with a little bit of extra effort.

Most of the steps below can be automated with simple scripts. However, the process of going through a manual installation of EMBOSS should make you aware of the different system components. Once you have an understanding of the system, it is then wise to automate/script these steps.

What kind of hardware you will need

EMBOSS is a fairly modest system to install in terms of hardware requirements. The only thing that can draw the hardware envelope is how much data you would like to index. If your server should host/index the entire EMBL/Genbank databases, you will need plenty of disk space (I advise you to have at least 3-4 Tbytes to spare, yes you read right).

Memory and CPU wise, 8 cores with 32-64 Gigs of RAM should be enough to keep most user loads happy (30-40 users) on a production server setup. What you do draws the map for the hardware requirements. If you are trying to do a global alignment of large sequences, you might easily eat up 64 Gigs of RAM. In contrast, basic sequence processing could also be performed on a dual core Laptop with 4 Gigs of RAM. By and large, the figures I suggest here should meet most requirements. If you have the task of specing an EMBOSS server, your best bet to get it right is to talk to your scientists and ask for what sort of operations they would be performing, to get an accurate picture of the hardware specs.

The downloading of the sources

Prior starting, I ensure that my Linux system has most of the development libraries installed. Some EMBOSS applications can be sensitive to missing libraries like libpng, libjpeg, etc. You will also need to ensure that you have your C/C++ compilers installed (gcc/g++).

EMBOSS is a large system. Apart from the core EMBOSS packages, there is an entire array of third party applications that are bundled together with the EMBOSS core applications (some examples: PHYLIP, MEME, IPRSCAN). These are the EMBASSY tools. This is a detail for most users, who collectively refer to the entire package as EMBOSS. However, when you go to download the source EMBOSS tarball, it does not contain these additional packages. This means that if you want to have the full array of EMBOSS/EMBASSY applications, you will have to go through the following steps:

1)Go to the main EMBOSS FTP download server and I download the latest EMBOSS tarball (normally named emboss-latest.tar.gz). In my case, it points to the EMBOSS-6.5.7. 2)After downloading this to my source dir, I unpack it by doing a:

tar -xvfz EMBOSS-6.5.7.tar.gz

3)I then cd to the EMBOSS-6.5.7 dir and at the top level of the sources, I do a:

mkdir embassy

4)Under the newly created embassy directory, I then download the tarballs of the EMBASSY packages (version info will vary, but the base name of each package should be more or less the same): CBSTOOLS, CLUSTALOMEGA, DOMAINATRIX, DOMALIGN, DOMSEARCH, EMNU, ESIM4, HMMER, IPRSCAN, MEME, MSE, PHYLIPNEW, SIGNATURE, STRUCTURE, TOPO, VIENNA .I unpack each of the tarballs with the same command as step 2 under the embassy subdirectory. Once I am done, I can delete the remaining *.tar.gz files.

5)At this point, it might be wise to create a tarball with all the sources properly laid out under the embassy subdirectory by going above the EMBOSS-6.5.7 directory and doing a:

tar -cvf embossembassy65.tar EMBOSS-6.5.7/

This will create the file embossembassy65.tar. This is handy in case you wish to erase the whole source tree and start from scratch and/or repeating the installation on other systems by not having to go through the steps 1-4 again to assemble the source tree.

Configure and compile

We are now ready to start configuring the various packages and eventually compiling them into the EMBOSS/EMBASSY binary applications we shall be using. In my system, I choose that the directory holding the binaries and the produced libraries should be under:

/usr/lsc/emboss

You are free to choose what you wish on your system.

6)Thus, I cd into the top level of the EMBOSS-6.5.7 directory and I issue a:

./configure --prefix=/usr/lsc/emboss; make; make install

In one sentence, this says to the config process where to place the produced files and instructs the system to compile and place the produced applications under that location. Grub a cup of tea/coffee/beer as this will take some time. If it all goes well, and you see no errors in the terminal output, you should see the first installed binary applications under the /usr/lsc/emboss/bin directory. In my base, I verify that I have functioning applications by executing embossversion:./embossversion Report the current EMBOSS version number6.5.7.0

This means that I am on good ground and can continue with the installation of the rest of the applications.

One detail new to the process of installing EMBOSS as of version 6.5.x is the automatic kick in of the embossupdate application, which you note in the final output lines of a successful step 6 operation:... make[3]: Entering directory `/usr/lsc/sources/EMBOSS-6.5.7'/usr/lsc/emboss/bin/embossupdateChecks for more recent updates to EMBOSSEMBOSS 6.5.7.0 is the latest version (6.5.0.0)

Basically, the EMBOSS install process will check for patches and updates to the source code, a process performed manually by EMBOSS admins before. This is a very welcome addition and eases the process of receiving up-to-date code, in order to address bug fixes and enhancements.

If you do not get to the point where you see the emboss applications and you see errors as part of the make process, the most likely scenario is that you are missing some development library or tool. You can get help by posting a request for help to the EMBOSS mailing list.

What you need to do now is to repeat step 6 for every subdirectory under the embassy directory and watch gradually the new applications being added to the bin folder.

Post installation configuration

You should have installed by now all the applications of core EMBOSS and EMBASSY packages from source. After this process, you should start configuring your system so you can make the applications available.

7)Make sure that the emboss bin folder is in a system wide path, to ensure that all users can reference the applications. For my systems, all the freshly compiled applications reside under the /usr/lsc/emboss/bin folder. Hence, this is the folder I enter into the system wide PATH. in my server /etc/profile.d/bash_login.sh, there is a line that contains the following: export PATH=$PATH:/usr/lsc/emboss/bin

8)Make sure you install all the application dependencies for the EMBOSS/EMBASSY applications you are going to use . There is a number of EMBOSS/EMBASSY applications that are wrappers around third party packages. This means that the EMBOSS/EMBASSY application will not function, unless you install its required dependencies. This is normally simple. I am not going to mention all the dependencies now, but a few examples from my userbase are the following:-emma which requires the installation of the Clustalw tool. -eiprscan which requires the installation of the iprscan tool. -ememe which requires the installation of the meme tool.

Each of these installations might involve an entire set of separate procedures and instructions, but you get the picture.

Part 2 of this article will examine how to configure the EMBOSS databases.

Friday, January 6, 2012

In Part 1 of the article series, we examined the basics of what MRS is and its computer hardware requirements. It is about time we get our hands dirty and install a production MRS server.

A basic production setup

The image above illustrates a basic production setup for MRS. You do not have to follow this setup, you could have a single server to handle everything. However, the above setup has a number of advantages that I shall explain.

There are two servers here. The front-end one serves the user queries, whereas the back-end server is used for the MRS index build process. You will notice that the front end is more beefed up hardware-wise than the backend. This is because (as explained in Part 1) the MRS queries can scale in terms of CPU, I/O and RAM. In contrast, that is not the case with the index building process, which beyond the 8 cores and the I/O it can create, will not scale to a large number of CPUs/cores. As a result, it makes sense to have the most capable machine at the query response end and keep an 8 core CPU with an adequate amount of RAM to crunch your datasets periodically.

The disk I/O setup reflects the same need/trend. I would recommend to place your disks at the front-end machine and have a capable disk controller (Directly Attached Storage SAS, Fiber Channel, Fiber Channel over Ethernet). The backend machine can access these disks to build the index by means of a well performing NFS setup over 10 Gigabit Ethernet. Plain Gigabit Ethernet should also be acceptable, however, I found that a "jumbo frame" enabled 10 Gigabit Ethernet in comparison to plain Gigabit Ethernet cuts the index generation time by 40-60% on average.

This setup is designed to achieve two things:

To place the performance where is mostly needed (MRS queries), especially if MRS is used as part of a pipeline (command-line or Galaxy based).

To increase the impact of the index generation process on a busy/hard-working server that is hit by queries.

The disadvantage is of course that you have to keep two MRS instances running, so what I describe below should be applied to both servers in order to keep things in sync. However, you will see that once you get a basic instance up and running, most of your attention will turn to post-installation issues and not really on keeping two instances in sync, installation-wise.

Software prerequisites

Before we get to the specifics of an MRS server installation, let us go through some important software requirements for installing on a RHEL 6 platform. If your distro is Redhat based (Fedora, CentOS, and Scientific Linux are some of the most well known free derivatives of RHEL), the instructions should carry you through to a functional MRS installation. If your distro is not RHEL based, you can at least have a good appreciation of what building blocks are required for the proper operation of the system. Here is a list of them:

Working as the root user, on a RHEL 6 platform, most of these components can be easily installed by the yum package manager with the exception of the Boost library and snarf:yum install perl-XML-LibXSLTyum libarchive libarchive-devel

Starting with the gcc compiler, due to some code optimization bug issues, there were issues when attempting to compile MRS and its prerequisites with a compiler more recent than a 4.4.x series gcc. By mid January 2012, this issue was addressed and is now possible to use more recent compilers than 4.4.x Nevertheless, the RedHat default 4.4 gcc compiler (in my case it was 4.4.6 20110731 (Red Hat 4.4.6-3) ) is a stable choice.

At the time of writing, RHEL 6.2 (Santiago) is equipped with Boost version 1.41, as part of its default yum package repository. That´s too old for MRS and thus it means that we have to uninstall the yum related Boost packages and install the Boost libs from source.

tar xvfz boost_1_48_0.tar.gz(Note: Earlier versions of libzeep had a problem with boost version >1.47 and would not build. Around mid of January 2012, it became possible to use boost version 1.48)cd boost_1_47_0./bootstrap.sh --prefix=/usr/lsc/libs./b2 install

At that point, make sure that your shared library config (normally /etc/ld.so.conf should contain the /usr/lsc/libs/lib path and then you should do an ldconfig. Check with an ldconfig -p | grep /usr/lsc/lib/ to see that the boost shared libraries are in place.

For snarf, you need to install the utility in the system wide PATH.

MRS installation

At this point, we should be ready to start installing MRS itself. Libzeep is the first part of installing MRS. It is a bespoke W3C compliant XML processor that enables MRS to talk the SOAP. This allows users to query an MRS server using web services. Still working as the root user, get the latest version (at the time of writing) 2.6.3 from the CMBI SVN server:

svn co https://svn.cmbi.ru.nl/libzeep/trunk

(revision 337)

Modify the makefile and set the following parameters, having in mind a prefix where you want the libzeep to install:

We are ready to install the actual MRS code now. Now let us install the MRS version. Grab the latest from the CMBI svn

svn co https://svn.cmbi.ru.nl/mrs/trunk

(checks out revision 1430)

Note: The MRS SVN repository is an active project and as such, the developers might be in the process of cleaning/modifying the code. It is possible that if you checkout the latest sources from the CMBI SVN server, that something might break/will not compile. When in doubt, please consult the MRS mailing list and verify the latest known working version. At the time of writing, you can be sure that revision 1430 is a working MRS version. If you wish to use it as a reference, you can issue the command:

svn co -r 1430 https://svn.cmbi.ru.nl/mrs/trunk

Then I shall make the directory under which I shall have the MRS binary utilities, as well as the directory where I am going to store the datasets (the large multi Tb volume we talked in Part 1 of the article series):

mkdir /usr/lsc/mrs

mkdir /storage/tools/mrsdata

Then, I am ready to initiate the configuration of the sources by selecting the prefix, the data directory location, as well pointing the location of the boost libraries which I installed from source, just to be safe and ensure the MRS coigure routine will find the right library paths, as shown below:

Various checks will be performed and if no errors are returned at this stage, the configure command should be followed by a:

make; make install; ldconfig

If the compilation stage finishes with no errors (you will see plenty of warnings and you can normally safely ignore them), you have just completed the MRS installation stage. Congratulations!

Post installation config check and orientation

In this section, we will discuss what you should see/check, prior using MRS for the first time. After completing the make install and ldconfig steps as described above, you should familiarize yourself with the directory layout of MRS. So, let us take a tour and show the various MRS directories.

First of all, under the installation prefix (it was /usr/lsc/mrs) you should see the following directories:

bin: This is where the MRS utilities reside: mrs-blast mrs-config-value mrs-mirror mrs-run-and-log mrs-build mrs-lock-and-run mrs-query mrs-test and mrs-update. All these are tools you will be using to configure and query the various MRS datasets.

lib: This directory is meant to hold MRS library modules, but it is empty on 64bit systems (x86_64).

lib64: On 64-bit systems (x86_64), this contains the MRS.so shared library, as well as the MRS.pm Perl module, a vital module referenced by the dataset parsing scripts (share directory)

sbin: Here you should have the mrs-ws binary which is the SOAP web services MRS module.

share: This directory contains a series of Perl parsers, one for each databank MRS supports.

Next, you should navigate your shell to the /usr/local/etc/mrs directory. Under the directory, you should find a series of important configuration files. I shall not go into details on the syntax of these files in this article, but very briefly:

databank.info: This file instructs MRS how to fetch (location and method) and generate index for various databanks you can offer/query under MRS.

mrs-config.xml: This XML formatted file (its DTD schema is in the mrs-config.dtd) controls various operational parameters of MRS such as the location of the various MRS directories (most of them are auto-generated by the configure step of the MRS sources), the location/path of externally used utilities (clustalw, NCBI BLAST), as well as the port number and URL location of the MRS SOAP web services servers. Latter articles will explain this parameters in more detail.

Both of the previously mentioned files have a sample you can use for reference (*.dist files).

If you can see all of these things at this point, you are on good track to fire up MRS for the first time and check it out. We will do that by navigating back to the /usr/lsc/mrs/bin directory. We are going to fetch a simple databank and watch MRS generate the index so we can query the database. We shall do that by running the mrs-update utility:

./mrs-update enzyme

and if everything was compiled properly, MRS will issue the following output:

After that, we can navigate under the MRS data directory to gain a basic understanding of what happens every time MRS generates the index of a databank. Under the data directory (in my case as indicated by the above output /storage/tools/mrsdata), you will find the following sub-directories:

mrs: This is where the MRS index files are produced and stored. Each databank has a number of associated .cmp files, together with an associated dictionary file .dict. For the enzyme databank, the produced files are enzyme.cmp and enzyme.dict.

raw: This directory holds the flatfiles of the databanks. These are downloaded from the URL and method, as specified in the databank.info file.

status: A useful directory for the MRS administrator, as it holds important logs about the status of the mrs-update process for the databanks. For each MRS hosted databank, you can see the fetch_log (whether the flatfile download procedure was completed), the mrs_log which outlines whether the MRS index generation was completed properly. Finally, if all was completed properly, an mrs_done file is created to indicate that MRS was successful in updating the databank. The logs for each databank auto-rotate.

docroot: This directory holds the CSS/HTML and web content of the MRS HTTP server. We will describe how to fire-up this server shortly, together with the system SOAP functionality.

flags: This directory is used internally by MRS to sync certain procedures of the databank fetching process.

blast: This directory contains the NCBI BLAST database index for each databank, in order to BLAST databases via the MRS system.

Hence, every time you mrs-update a databank, the latest flatfiles are fetched automatically under the raw directory. After that, the mrs-build utility will attempt to invoke the MRS parsers and create the index under the mrs directory.

If you wish to see which databanks you can fetch/update with the mrs-update utility, here is a list of them:

I leave the mrs-index generation of them as an exercise to the reader with two hints:

Do not attempt to start multiple mrs-update processes in parallel. Remember, the index generation process does not scale.

Some of the largest databanks (embl_release, genbank, dbest, pdb, hssp) will require entire days to download and index. Thus, what I tend to do is to issue something like: nohup ./mrs-update embl_release & , to ensure that the process will not be interrupted by a terminal session timeout/disconnection.

Querying and firing up the MRS web and SOAP server

If you have followed all the previous instructions, you should have installed MRS and have indexed one or more databanks. What about querying them to ensure that MRS does indeed its job? After all that effort, you should really experience the power and simplicity of MRS searches.

The most user intuitive way is to fire up the MRS web server. Before you fire up the MRS web server interface, you should consider making a non-privileged user. Up to this point, we have been working as root. However, opening an HTTP/SOAP port bound to a process with superuser credentials is not the best thing for the security of your server. What I do is to make a normal system user:

useradd -d /home/users/mrsuser mrsuser

I assign a secure password to the user. As root, I make sure that this user can have access to the /var/log/mrsws.log file which logs the queries that hit the MRS SOAP server:

touch /var/log/mrsws.log

chown mrsuser /var/log/mrsws.log

and also change the owneship of the /usr/lsc/mrs directories to the mrsuser recursively:

chown -R mrsuser /usr/lsc/mrs

After that, I switch to the mrsuser and start the MRS SOAP server by navigating to the following directory:

su - mrsuser

cd /usr/lsc/mrs/sbin

nohup ./mrs-ws &

This starts a number of mrs-ws servers with the credentials of the mrsuser and not the root account. Make sure you do not have a firewall between your desktop and the server and point a recent web browser version (Firefox 8, Chrome) to the IP of your server, following the URL convention:

http://IP_of_your_server:18080

If you hit the Status tab, you will see the MRS web environment as shown above. You can enter your search terms in the top bar and search against all or specific databases.

There are other ways to search the databases that will be outlined in Part 3 of this article series. However, you have now the basic knowledge of how to kickstart MRS in a basic way. In the next article, we will discuss the production usage of MRS.