Wow, the more I dig into sam the more it looks like a horse put together by a committee. These bit flags are ridiculous for a human readable text alignment format.

It was put together by a committee. The flag field, in my opinion, is left over from the internal C representation, which was created for compactness (then compressed into SAM). You will encounter the internal format if you program using the SAMTools C API but nonetheless the internal (compact) data structure should not have been shown to the user when viewing SAM (not BAM files). Just wait until you try to get anything clarified/changed/included in the specification.

I particularly like the TCGA BAM specification for the BAM headers. It uses most of the header types and their associated tags to capture things like: sequencing platform, reference contig details(MD5s, url..), samples, institution, library, etc...

If SAM were designed by one person, it would not be widely accepted. There are pros and cons. As to the bit flag, have you tried "samtools view -X"?

It is widely accepted because good tools (i.e. lh3's implementation in SAMTools and now Picard) were available early and was used by the first major sequencing efforts (aka 1000 Genomes). I no way mean to lambaste the creators of SAM, only to be honest about its development and implementation.

Yes I did try Picard's SAMFileReader but it has a major memory management problem when trying to parse RNA-Seq data that contains alignments to chromosomes and splice junctions. These alignment files contain headers with 2.3 million splice junction 'chromosomes' that cause the SAMFileReader to quickly throw an out of memory error. The SAM format requires one '@SQ' line for each splice junction. Ouch.

Thus, I had to create a light weight file parser that would skip such header info. I also wanted to get a feel for whether sam could be used as a universal short read alignment format. I'm part of two projects (GenoViz and a $100M NHLBI cardiac project) that are looking to see if SAM could serve that function.

I should say that now I understand bitwise flags, they are a pretty clever trick for compressing a bunch of boolean flags in a binary file. For SAM spec 2 though, they should be removed from the text format.

what would a flag 0 imply in sam output? I used novo align and the only flags I see are 0 4 and 16. 4 is unmapped read. 16 is for the strand, but how to interpret 0.

Also, How can I ascertain the reads that are *not* uniquely mapped. I read that the 5th column MAPQ should be of help to determine multiply-mapped reads. Is MAPQ=0 an indication that the read is multiply-mapped?

Thanks

I have a similar question. All I see are 0, 4, 16, and 20s. I do not understand how to interpret 20. I know the hexadecimal would be 14, which should mean this is a combination of both strand and not mapped. Please correct me if i'm wrong. I see a MAPQ score and also reference hit, so does it still mean it mapped? Thanks

I should say that now I understand bitwise flags, they are a pretty clever trick for compressing a bunch of boolean flags in a binary file. For SAM spec 2 though, they should be removed from the text format.

I'd go further and say the flag (and other things) will need redoing to cope with more than just paired reads - it will need to cope with N-tuples of reads each separated by an insert of some estimated size (e.g. Strobe Reads from Pacific Biosciences, or what Helicos calls dark fill).

hi guys,
I found the SAM FLAG encoding method is very clever for storing the alignment information. But I also found that the the negative sign for the insert field in the following pair-end example:
The manual said the negative sign of insert fileld means the mapping position is smaller than the current one. But the fact is the reverse.
And also, in the following pair-end, the mapping position fileds are equal for the pair-end reads (2005683). But they are not equal just having overlap.

Please use "samtools view -X" to see a human readable FLAG. I agree that not specifying a better FLAG field initially is a shortcoming, but it is too late to change the spec at the moment. samtools view -X comes as a temporary hack which I find useful.

Could you suggest a better format for the aux fields or to make SAM simpler? Note that SAM should be both human readable and machine readable. The current form is the best we can come to so far. Genbank/EMBL files are human readable, but they cause a lot of troubles in parsing, and we do not want to go in that way again. I think the best solution to human readability is not to change the spec, but to write a script to print a SAM alignment in multiple lines in a beautiful way. If you want to contribute to such a script, that would be great. Thanks.

hello everyone, the information above help me further understand the flag in the SAM format. But I still have problems in fully understanding the flag, like:
0x0002 the read is mapped in a proper pair
0x0004 the query sequence itself is unmapped
0x0008 the mate is unmapped
I don't know what is the meaning of "a proper pair" and the difference between "pair" and "mate", could anyone help me explaining them ?
And another question is I used tophat to deal with my PairEnd Illumina Seq, but the SAM file produced by tophat is like below:
FC30W3GAAXX:7:53:723:1789#0 73 1 487961 3 62M1849N13M * 0 0 ATCAGCTTCATTCCCTCAACAGTGTTCTTC
TTCAACGGGCAGCACATGAAGGTCGACTATGGATCTCCAGATCAC 84AB:B@:=A=-9BB?BB>@7>A@ABBBBB=;@B:BABBB?B>@AC@@AAB=CCA6@>?ABB>9@@ACCCA@C@B NM:i:7
XS:A:+ NH:i:2
FC30W3GAAXX:7:89:981:2025#0 137 1 487982 3 41M1849N34M * 0 0 GTGTTCTTCTTCAACGGGCAGCACATGAAG
GTCGACTATGGATCTCCAGATCACACCAAGTTTGTGGGAAGCTTC 8:;?6886<=:><6>8<=?>A:7=;8A?@@:BAA=A@@A@A@AAB@?@@@@B;B@AABA@BBABCCCBB@CBA;< NM:i:4
XS:A:+ NH:i:2
I want to know whether colomn 7-9(* 0 0) indicate my data were not considered as PE?