Pages

Saturday, August 23, 2014

Hi all,

It has been a while since my last post, but the reason is worth the long time absent. Since January I am co-leading the bioinformatics and I.T department of one of Genomika Diagnósticos.

Genomika is one of most advanced clinical genetics laboratory in Brazil. Located in Recife, Pernambuco,in the Northeast of Brazil, it provides cutting edge molecular testing in cancer samples to better define treatment options and prognosis, making personalized cancer management a reality. It also has a vast menu of tests to evaluate inherited diseases, including cancer susceptibility syndromes and rare disorders. Equipped with state-of-the-art next-generation sequencing instruments and a world-class team of specialists in the field of genetic testing, Genomika focus on test methods that improve patient care and have immediate impact on management. There is available a pitch video about our lab and one of our exams (unfortunately the spoken language is portuguese).

Our video about sequencing exams spoken in portuguese

My daily work is to provide tools; infra-structure and systems to support our clients and teams in the lab. One of major teams is the molecular biology sector. It is responsible for the DNA sequencing exams, which includes targeted-panels, specific genes or exons or whole exome. Each one of those genetic tests, before delivered to the patient and the doctor, goes under several data pre-processing and analysis steps organised in a ordered set of sequential steps, which we call a pipeline.

There is a customised pipeline for clinical sequencing; where we bioinformaticians and specialists study the genetic basis of human phenotypes. In our lab pipeline we are interested on selecting and capturing the protein-coding portion of the genome (we call the exome). This region responsible for only 3% of our human DNA can be used to elucidate the genetic causes of many human diseases, starting from single gene disorders and moving on more complex genetic disorders, including complex traits and cancer.

Clinical Sequencing Pipeline overview

For this task, we use several tools that must handle with large volumes of data, specially because of the new next-generation DNA sequencing machines (yeah we have one at our lab from Illumina). Those machines are capable of producing in shorter times and lower costs large amount of NGS data.

Taking those challenges into account, we perform our sequencing, alignment, detection and data analysis of human samples in order to seek variants. This study we call variant analysis. Variant analysis looks for variant information, that is, possible mutations that may be associated to genetic diseases . Let's consider as examples of mutation or variant as follows: a change of nucleotide (A for T) (single nucleotide variant or SNV) or even a small insertion or deletion (INDEL's) that can impact the functional activity of the protein. Looking after variants and even further seek and identify those related to diseases or genetic disorders is a big challenge in terms of technology, tools and interpretation.

The reference in the genome at bottom; the variants above.
In this example there's a possible exchange of A to G (SNV) in a specific position of the genome.

In our lab we are developing a streamlined, highly automated pipeline for exome and targeted panel regions data analysis. In our pipeline we handle multiple datasets and state of the art tools that are integrated in a custom pipeline for generating, annotating and analyzing sequence variants.

We named our internal pipeline tool as MIP (Mutation Identification pipeline). Some minimal requirements we stablished for MIP in order to use it with maximum performance and productivity.

1. It must be automatic; with a limited team like ours (2 or 3 bioinformaticians) we need a efficient service that is capable to execute the complete analysis without typing commands at terminals calling software or converting files among several data formats.

MIP pipeline overview for clinical sequencing. All those steps requires tools and files in a specific format.
Our engine would be capable of manage and execute all or some of those steps with
specific parameters defined by the specialist .

2. It must be user-oriented; it mean that MIP platform must provide an easy-to-use interface, that any researcher of lab could use the system and start out-of-box their sequencing analysis. For biologists and geneticists it would allow them to focus their work on what matters: the downstream experiments.

3. Scalable-out architecture; More and more hight throughput sequencing data is pulled out from NGS instruments, so MIP must be designed to be a building block for a scalable genomics infrastructure. It means that we must work with distributed and parallel approaches and the best practices from high-performance computing and big data to efficient take advantage of all resources available at our infra-structure while thinking on continuous optimization in order to minimize the network and shared disk I/O footprint.

My draft proposal to our exome sequencing pipeline

4. Rich-detailed reports and smart software and dataset updates; In order to maintain our execution engine working healthy, it requires that our software stack always being updated. Since our engine is written on top of numerous open-source biological and big data packages, we need a self-contained management system that could not only check for any new versions but also with a few clicks start any update and perform a post-check for any possible corruptions at the pipeline. In addition to the third-party genomics software used on MIP, we are also developing our tool for variant annotation. So it stands for an engine that could query and analyze several genomic dataset, generate real-time interactive reports where the researchers could filter out variants based on specific criteria and output in formats of QC reports, target and sequencing depth information, descriptions of the annotations and variants hyperlinked to public datasets in order to get further details about a variation.

Example of web interface where a researcher could select any single or combination of annotations to display. Links to the original datasources are readily available (Figure from WEP annotation system)

5. Finally, we think the most important requirement nowadays to MIP is the integration with our current LMS (Laboratory management system), in order to put the filtered variants as input to our existing laboratory report analysis and publishing workflow. It means more productivity and automation with our existing infrastructure.

MIP could be also be acessible via RESful API, where the runs output

would be interchanged with our external LMS solution.

As you may see, there's a huge effort on coding, design and infrastructure to meet those requirements. But we are thrilled to make this happen! One of our current works in this project is the genv tool. Genv is what we call our genomika environment builder. The basic idea behind it is a tool written in python and fabric package, that provides instant access to biological software, programming libraries and data. The expected result is a fully automated infrastructure that installs all software and data required to start MIP pipeline. We are thinking of also arranging pre-built images with Docker. Of course I will need a whole post to explain more about it!

To sum up, I hope I could summarise one of the projects I've been working this first semester. At Genomika Diagnósticos we are facing big scientific challenges and the best part is that those tools are helping our lab to provide a next level of health information to the patients, all from our DNA!

If you are interested on working with us, keep checking our github homepage with any open positions at our bioinformatics team.

Friday, March 21, 2014

Hi all,

It has been a while since my last post at the blog. No, I didn't abandon the blog! It has happened many events at my life since november that I decided to pause a bit my posts marathon and started to organize my work life! The best news this year is that I am now facing new challenges on machine learning, data mining, big data and now on: bioinformatics!! That's right. I am now CTO of Genomika Diagnósticos, a brazilian genetics laboratory at Recife, Pernambuco. The laboratory combines the state-of-art genetic testing with comprehensive interpretation of test results by specialists, geneticists to provide clinically relevant molecular tests for a variety of genetic disorders and risk factors.

My work there now is work with NGS (next-gen sequencing) tools to support the exome and genome sequencing to analyse genes and exons in panels to detect any significant genetic variations, which are candidates to cause the patient's phenotype. There are a lot of work to do, so in the next weeks I will post some tutorials about bioinformatics, machine learning, parallelism and big data applied on genoma sequencing.

This field is a novel study field and there are many applications related to disease detection, prevention, and treatment. Could you imagine that sequencing DNA would cost more than $10,000 dollars in 2001 and it has been decreasing exponentially the cost of the procedure.

My next posts will talk about how DNA sequencing works and how machine learning and data ming can be applied in this exciting and promising field!

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.