Multiple Alignment Format

The Multiple Alignment Format, described by
UCSC, stores a
series of multiple alignments in a single file. Suitable for
whole-genome to whole-genome alignments, metadata such as source
chromosome, start position, size, and strand can be stored.

Getting the AlignIO code from GitHub

If you can’t wait for Biopython 1.69 to be released, get the latest
Biopython from GitHub:

First, clone the repository with git from the command line, like so:

git clone git@github.com:biopython/biopython.git

This will give you the default master branch. Then install from source.

Reading in a MAF file

Parsing a MAF file is similar to any other alignment file in AlignIO.
Additional data, however, is stored as a dict in the .annotations
property of SeqRecords belonging to returned MultipleSeqAlignment
objects.

Annotations available in SeqRecords

Key

Type

Value

start

integer

The start position in the source sequence of this alignment

size

integer

The ungapped length of this sequence

strand

enum(“+”, “-“)

The strand this sequence originates from on the source sequence/chromosome

srcSize

integer

The total length of the source sequence/chromosome

Example

fromBioimportAlignIOformultiple_alignmentinAlignIO.parse("chr10.maf","maf"):print("printing a new multiple alignment")forseqrecinmultiple_alignment:print("starts at %s on the %s strand of a sequence %s in length, and runs for %s bp"% \
(seqrec.annotations["start"],seqrec.annotations["strand"],seqrec.annotations["srcSize"],seqrec.annotations["size"]))

MafIndex

Biopython may soon provide an interface for fast access to the multiple
alignment of several sequences across an arbitrary interval: for
example, chr10:25,079,604-25,243,324 in mm9. As MAF files are available
for entire chromosomes, they can be indexed by chromosome position and
accessed at random. This functionality would be available in the class
Bio.AlignIO.MafIO.MafIndex.

Creating or loading a MAF index

Indexes are created by determining the chromosome start and end position
for a specific sequence name (generally a species), which must appear in
every alignment block in the file. An index can be generated for only
one species at a time. In whole-genome alignments generated by Multiz,
the chromosome of one species is generally used as the reference to
which other species are aligned. This reference species will appear in
every block, and should be used as the target_seqname parameter. For
UCSC multiz files, the form of species.chromosome is used.

To index a MAF file, or load an existing index, create a new
MafIO.MafIndex object. If the index database file sqlite_file
does not exist, it will be created, otherwise it will be loaded.

# index mouse chr10 from UCSC and store it in a file for later usefromBio.AlignIOimportMafIO# idx = MafIO.MafIndex(sqlite_file, maf_file, target_seqname)idx=MafIO.MafIndex("chr10.mafindex","chr10.maf","mm9.chr10")

Retrieving alignments overlapping a given interval

The MafIO.MafIndex.search() generator function accepts a list of
start and end positions, and yields MultipleSeqAlignment objects that
overlap the given intervals. This is particularly useful for obtaining
alignments over the multiple exons of a single transcript, eliminating
the need to retrieve an entire locus.

# count the number of bases in danRer5 (Zebrafish) that align to the# Pcmt1 locus in mousefromBio.AlignIO.MafIOimportMafIndexidx=MafIndex("chr10.mafindex","chr10.maf","mm9.chr10")results=idx.search([7350034],[7383048])total_bases=0formultiple_alignmentinresults:forseqrecinmultiple_alignment:ifseqrec.id.startswith("danRer5"):# don't count gaps as basestotal_bases+=len(str(seqrec.seq).replace("-",""))print("a total of %s bases align"%total_bases)

Retrieving a pre-spliced alignment over a given set of exons

The MafIO.MafIndex.get_spliced() function accepts a list of start
and end positions representing exons, and returns a single
MultipleSeqAlignment object of the in silico spliced transcript from
the reference and all aligned sequences. If part of the sequence range
is not found in a particular species in the alignment, dashes (“-“) are
used to fill the gaps, or “N”s if the sequence is not present in the
reference (target_seqname) sequence. If strand is opposite that in
the reference sequence, all sequences in the returned alignment will be
reverse complemented.

# convert the alignment for mouse Foxo3 (NM_019740) from MAF to FASTAfromBioimportAlignIOidx=AlignIO.MafIO.MafIndex("chr10.mafindex","chr10.maf","mm9.chr10")multiple_alignment=idx.get_spliced([41905591,41916271,41994621,41996331],[41906101,41917707,41995347,41996548],strand="+")AlignIO.write(multiple_alignment,"mm9_foxo3.fa","fasta")

# find every gene on chr10 in the current UCSC refGene database,# retrieve its spliced multiple alignment, and write it to# a FASTA file in the current directory## depends: MySQLdbimportMySQLdbfromBioimportAlignIO# connect to UCSC's live MySQL databasemysql_conn=MySQLdb.connect(host="genome-mysql.cse.ucsc.edu",user="genome",passwd="",db="mm9")db_conn=mysql_conn.cursor(MySQLdb.cursors.DictCursor)# load MAF indexidx=AlignIO.MafIO.MafIndex("chr10.mafindex","chr10.maf","mm9.chr10")# fetch all records on chr10db_conn.execute("SELECT * FROM refGene WHERE chrom = 'chr10'")forrecordindb_conn.fetchall():multiple_alignment=idx.get_spliced(map(int,record["exonStarts"].split(",")[:-1]),map(int,record["exonEnds"].split(",")[:-1]),strand=record["strand"])print("writing %s.fa"%record["name"])AlignIO.write(multiple_alignment,"%s.fa"%record["name"],"fasta")