JRuby version 1.6.3 or later - Appropiate Java heap size may have to be specified to invoke JRuby, especially when you use Bio::Ucsc::File::Twobit. Try "jruby -J-Xmx3g your_script.rb" to keep 3G byte heap.

Ruby version 1.8.7 or earlier are no longer supported by UCSC API v0.6.0 and later because Ruby on Rails and ActiveRecord version 4.0 do not support these old Rubies.

Change Log

BUG-FIX (v.0.6.2) NoMemoryError in Bio::Ucsc::File::Twobit.open. Now internally File.open is used instead of Kernel.open.

UPDATE (v.0.6.1) Support new human hg38 reference. Newer tables after this release will be supported in later varsions.

NEW (v0.6.0): Support ActiveRecord v4.0 (a part of Ruby on Ralils v4.0). Other related library dependencies are also updated. To Supress deprecation warnings for using dynamic finders such as 'find_by_name_and_chrom', use ActiveSupport::Deprecation.silenced = true. Because ActiveRecord v4.0 does not support Ruby v1.8.7 and earlier, Ruby UCSC API no longer supports these older Rubies.

UPDATE (v0.6.0): Ruby UCSC API is now licensed under the MIT License. See also the LICENSE.txt file.

See 'ChangeLog.md' for older changes.

How to Use

Basics

A database of a genome assembly is represented as a module in the Bio::Ucsc module. For example, the human hg19 database is referred by Bio::Ucsc::Hg19.

Before using a database, establish a connection to the database. For example, Bio::Ucsc::Hg19.connect.

A table in a database is represented as a class in the database module. For example, the snp138 table in the hg19 database is referred by Bio::Ucsc::Hg19::Snp138.

Queries to a field (column) in a table are represented by class methods of the table class. For example, finding the first record (row) of the snp138 table in the hg19 database is Bio::Ucsc::Hg19::Snp138.first.

Queries using genomic intervals are supported by the named scope ".with_intervals" and ".with_intervals_excl (omitting pertially included annotations)" method of the table class. These methods accept a genomic interval string like chr1:1233-5678. If a table to query has the "bin" column, the bin index system is automatically used to speed-up the query.

Fields in a retrieved record can be acccessed by using instance methods of a record object. For example, the name field of a table record stored in the result" variable is result.name.

Sample Codes

At first, you have to declare the API and establish the connection to a database.

require'bio-ucsc'DB=Bio::Ucsc::Hg19DB.connect
# Suppressing deprecation warnings for using dynamic finders such as 'find_by_name_and_chrom'.# These syles are deprecated in ActiveRecord 4.0. ActiveSupport::Deprecation.silenced =true

Notes of Exceptions in Table Support

Table names starting with a number: Because Ruby class names cannot start with number, use the table class name starting with "T" (T for Table). Thus, the "2micron_est" table is supported by the T2micron_est class.

Accessing chromosome-specific tables: For example, the 'rmsk' table in hg18 is actually separated into 'chr1_rmsk', 'chr2_rmsk'... There is two way to access to them. (1) Accessing separated tables directly. There is no difference from other regular tables. However, you have to manage each separated tables. (2) Use abstract table classes (e.g., 'Rmsk') and their class methods .find_by_interval or .find_all_by_interval. These methods look for correspondent separated tables automatically. However, you cannot combine with other find_by_[field] methods. Moreover, if you have to perform single- or multi-chromosomal search, you have to access separated tables individually and integrate results by yourself. Fortunately, recent databases, including hg19, seem not to use chromosome-specific tables.

For honey bee ApiMel2 database, Group_chainDm2 and Group_chainDm2Link tables are accessible using find(_all)_by_interval class methods of the ChainDm2 and ChainDm2Link classes.

Special field (column) names: Field names such as attribute, valid, validate, class, method, methods, and type cannot be accessed using instance methods. This restriction is because of the collision of method names that are internally used by ActiveRecord. Instead, use hash to access the field like result[:type].

Details in "with_interval"

When a table class is referred first time, the API prefetches the table to get a list of fields and dynamically defines a class using following algorithm.

If chrom/chromStart/chromEnd fields exist (BED table), the API uses them for interval queries.

When tName/tStart/tEnd fields exist (PSL table), the API uses them for interval queries.

When chrom/txStart/txEnd fields exist (genePred table), the API uses them for interval queries.

When genoName/genoStart/genoEnd fields exist (RMSK table), the API uses them for interval queries.

If the table has the "bin" column, the API calculate bin index to build a query.

Otherwise, the API does not support interval queries but support only ActiveRecord's standard methods such as "find_(all_)by_[field name]".

Table Associations

Ruby UCSC API supports two ways to define table association/relation, manual and automatic. Manual definition can define minimum association set you need. Automatic definition is easy to use. However, automatic definition may define huge number of association. You may have to restrict database set before definition.

Manual definition of table associations

See samples/snp2gene.rb. Association definitions using has_one/has_many methods are shown below. class_eval is used not to replace but to add definition.

Next, you can overwrite all.joiner variables by the Joiner#variables method. For examle, a "gbd" variable means "all databases". Overwriting this variables can restrict databases to be used in table assocations and makes automatic definition faster. Unconnected databases and undefined tables are ignored during definition.

Then, you can access an associated tables using a method. Note that automatic definition always use "has_many" methods. Thus, resuls are always returned as an array.