15 May 2018

I am investigating bioinformatics1 tools to analyze Whole Genome Sequence (WGS) data. I have access to a WGS for someone who has also tested at several genealogy testing companies. I want to do some comparisons between the raw data from the genealogy testing companies and the WGS, checking for accuracy of the reads. To satisfy my curiosity, I plan to investigate some of the medical implications and traits discussed in scientific papers.

Once I have multiple WGSs from relatives, I plan to do some comparisons as to whether segments that the testing companies indicate match really do match completely with the higher resolution data. I am interested in how closely the statistical predictions on linkage disequilibrium and crossovers mirror what is seen in real family multi-generational studies. For example, in the shared segments marked below, not every SNP is tested. A number of SNPs in a segment are tested and we assume the non-tested SNPs match based on statistical predictions.

By the way, just as with some of the best genealogy articles, the reference notes in this article led me to several additional sources I now need to consult.

As a woman, this sentence is especially depressing: "... the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists".2 I hope this changes and more women participate in bioinformatics.

I am impressed with how many databases and tools are out there for DNA analysis. I did not realize there are over 1,700 bioinformatics repositories and "23 'high profile' GitHub repositories containing source code for popular and highly respected bioinformatic tools."3 "Our analysis points to simple recommendations for selecting bioinformatic tools from among the thousands available."4 Some of these will not be useful for genealogy, but some will.

One tool aimed at the genetic genealogy community is Thomas Krahn's tool for annotating a BigY VCF file and identifying derived and novel SNPs.5 Thomas kindly shared this tool so others can do the analysis instead of having it done by his company YSEQ.net.

Some of the discussions in the scientific world parallel those we are having in the genealogy world.

"In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. Reproducibility requires that authors publish original data and a clear protocol to allow repetition of the analysis in a paper."6 In the genealogy world we are discussing publicly available DNA data, such as on GEDmatch.com, allowing DNA analysis to be reproduced and referenced from a publication.

"The bioinformatics field embraces a culture of sharing — for both data and source code — that supports rapid scientific and technical progress."7 In the genealogy world we are discussing privacy issues versus sharing data, especially with the recent proliferation of stories on law enforcement use of genealogy databases.

I have been musing on whether to learn Python or Ruby. A recent discussion with a young programmer had me leaning towards Python. Since the "greatest amount of code in the main dataset was in Javascript, followed by Java, Python, C++, and C"8 maybe I will stay with Javascipt and Java, which I already know, if I develop any new tools for web usage. I have a few tools I wrote in Perl for my own use that I hope to clean up and share eventually.

In addition to DNA adding to my knowledge of my family tree, it is forcing me to upgrade my data analysis knowledge and computer tools familiarity. I hope all of this study helps keep my mind active and reduces those "senior moments" that seem to occur more frequently with the years.

1. The science of collecting and analyzing complex biological data such as genetic code.2. Pamela H Russell, et al., "A large-scale analysis of bioinformatics code on GitHub," 15 May 2018, BioRxiv pre-publication, https://doi.org/10.1101/321919, line 35.3. Ibid., line 27.4. Ibid., line 148.5. Thomas Krahn, "bigY_hg39_pipeline.sh," GitHubGist (https://gist.github.com/tkrahn/283462028c61cd213399ba7f6b773893).6. Russell, "A large-scale analysis of bioinformatics code on GitHub," line 84.7. Ibid., line 120.8. Ibid., line 208.
All statements made in this blog are the opinion of the post author. This blog is not sponsored by any entity other than Debbie Parker Wayne nor is it supported through free or reduced price access to items discussed unless so indicated in the blog post. Hot links to other sites are provided as a courtesy to the reader and are not an endorsement of the other entities except as clearly stated in the narrative.
To cite this blog post:
Debbie Parker Wayne, "Whole Genome Sequence (Part 2) - Analysis Tools," Deb's Delvings, 15 May 2018 (http://debsdelvings.blogspot.com/ : accessed [date]).

No comments:

Post a Comment

Search This Blog

About Me

Debbie Parker Wayne, Certified Genealogist®

East Texas, USA

I am the owner of Wayne Research, a genealogical research service. Laws affecting family history and genetic genealogy (DNA) are areas of special interest to me. Many of my posts will be in those areas as well as topics of general interest to genealogists. Contact me through my Web site at debbiewayne.com.

Credentials

The words Certified Genealogist and letters CG are registered certification marks, and the designations CGL and Certified Genealogical Lecturer are service marks of the Board for Certification of Genealogists®, used under license by board certificants after periodic evaluation.

This blog stores no cookies other than those created by Google's blogspot.com service. By posting a comment, you agree that this website can store and handle your data; your name and other identifying information you enter in a published comment will be publicly displayed.