sam格式的结构和意义_NGS数据格式02-SAM/BAM最详细解读
本篇是自己學(xué)習(xí)SAM和SAMtag的資料心得,詳細(xì)介紹高通量測(cè)序中SAM/BAM格式文件。
本文將了解什么?1、SAM/BAM格式簡(jiǎn)介
2、術(shù)語與概念理解
3、標(biāo)頭部分(header section)詳述
4、比對(duì)信息部分(alignment section)詳述
第一列、QNAME
第二列、FLAG
第三列、RNAME
第四列、POS
第五列、MAPQ
第六列、CIGAR
第七列、RNEXT
第八列、PNEXT
第九列、TLEN
第十列、SEQ
第十一列、QUAL
第十二列之后,Optional fields
1.1 Additional Template and Mapping data(一些比對(duì)信息)
1.2 Metadata(這部分內(nèi)容和 SAM中header section部分相關(guān),描述read測(cè)序相關(guān)信息)
1.3 Barcodes(UMI/單細(xì)胞測(cè)序cell barcode)
1.4 Original data
1.5 Annotation and Padding
1.6 Technology-specifific data
2 Locally-defifined tags
1、SAM/BAM格式簡(jiǎn)介SAM存儲(chǔ)格式發(fā)明的目的:使不同測(cè)序平臺(tái)下機(jī)數(shù)據(jù),經(jīng)過不同比對(duì)軟件后有一個(gè)統(tǒng)一的存儲(chǔ)格式。
SAM(Sequence Alignment/Map format簡(jiǎn)寫)格式文件,存儲(chǔ)測(cè)序數(shù)據(jù)和參考基因組比對(duì)結(jié)果的文件,每行以table鍵分割,包含標(biāo)頭部分(header section)和比對(duì)部分(alignment section)見下圖。
BAM(Binary Alignment/Map format簡(jiǎn)寫)格式文件,SAM的二進(jìn)制格式文件,通過BGZF library參考庫壓縮而成。
2、術(shù)語與概念理解
該部分有助于后文SAM格式理解,后文反復(fù)出現(xiàn)如下概念。模板(Template):一段DNA/RNA序列,它的一部分在測(cè)序儀上被測(cè)序,或被從原始序列中組裝。(意思就是:我們通過測(cè)序儀測(cè)序的那段序列,或者通過組裝原始序列得到的更長(zhǎng)的序列,就是模板的一部分)。(從后文來看,對(duì)于Illumina雙端測(cè)序來說,template指的就是插入片段)
片段(Segment):一段連續(xù)的序列或子序列(subsequence)(從上下文來看,segment既可以指一條完整的read,也可以指read的一部分);
讀段(Read):一段來自測(cè)序儀的原始序列。read可以包含多個(gè)片段(一條read在比對(duì)過程中可能會(huì)被拆分成幾段,對(duì)應(yīng)到參考序列不同的位置上。read被拆分后形成的片段即為segment)。對(duì)于測(cè)序數(shù)據(jù),reads根據(jù)測(cè)序順序進(jìn)行編號(hào);
線性比對(duì)(Linear alignment):一條比對(duì)到參考序列上的read可能會(huì)有插入、缺失、skips和切除(clipping),但只要沒有方向的改變(例如,read的一部分比對(duì)到了正義鏈上,另一部分比對(duì)到了反義鏈上),就是Linear alignment。一個(gè)線性比對(duì)結(jié)果可以代表一個(gè)SAM記錄;(意思似乎是:一條SAM記錄能且只能保存一個(gè)線性比對(duì)結(jié)果)
嵌合比對(duì)(Chimeric alignment):不是線性比對(duì)的比對(duì)。嵌合比對(duì)中包含了一套沒有大范圍重疊的線性比對(duì)(嵌合比對(duì)中的每一個(gè)片段都是線性比對(duì)。關(guān)于大范圍重疊的說法是為了和多重比對(duì)區(qū)分)。一般地,嵌合比對(duì)中的一個(gè)線性比對(duì)被認(rèn)為是“有代表性的比對(duì)”(representative alignment),而其他的線性比對(duì)被稱為補(bǔ)充的(supplementary),用補(bǔ)充比對(duì)標(biāo)志(supplementary alignment flag)加以區(qū)別(representative和supplementary成一對(duì),對(duì)應(yīng)嵌合比對(duì))。嵌合比對(duì)的所有SAM記錄有相同的QNAME,其flag值的0x40和0x80位都相同(見1.4節(jié))(0x40位和0x80位分別表示模板中的第一個(gè)片段和最后一個(gè)片段,為什么會(huì)都相同呢?總要有一個(gè)是第一個(gè)片段,總要有一個(gè)是最后一個(gè)片段吧,它倆的0x40位和0x80位不應(yīng)該相同啊?)。哪個(gè)線性比對(duì)被視為有代表性是任意選擇的。(可見嵌合比對(duì)中,各個(gè)segments的獨(dú)立性更強(qiáng):都不在雙鏈的同一條鏈了。另外,如果一條read的不同部分比對(duì)到了不同的染色體上,那肯定也是嵌合比對(duì)了,因?yàn)椴煌旧w之間討論方向相同是沒有意義的,肯定不可能是線性比對(duì)了。)
read比對(duì)(read alignment):能代表一條read的比對(duì)結(jié)果的線性比對(duì)或嵌合比對(duì);
多重比對(duì)(Multiple mapping):由于重復(fù)序列等情況的存在,一條read在參考基因組上的正確位置可能無法確定。在這種情況下,一條read可能會(huì)有多種比對(duì)結(jié)果,其中一種被視為主要的(primary),所有其他的比對(duì)結(jié)果的SAM記錄的flag標(biāo)志中都會(huì)有一個(gè)“次要(secondary)比對(duì)結(jié)果”的標(biāo)志。所有這些SAM記錄擁有相同的QNAME,flag標(biāo)志的0x40位和0x80位有相同的值。一般被指定為“主要”的比對(duì)結(jié)果是最佳比對(duì),如果都是最佳比對(duì),則任意指定一條(primary和secondary成一對(duì),對(duì)應(yīng)多重比對(duì))。(原文注釋:嵌合比對(duì)主要由結(jié)構(gòu)變異、基因融合、組裝錯(cuò)誤、RNA測(cè)序或?qū)嶒?yàn)過程中的一些原因造成,更經(jīng)常出現(xiàn)在長(zhǎng)reads中(長(zhǎng)read有利于檢測(cè)嵌合比對(duì)。這就是為什么三代測(cè)序是檢測(cè)染色體結(jié)構(gòu)變異的更有力工具)。嵌合比對(duì)中的線性比對(duì)之間沒有大片段的重疊,每個(gè)線性比對(duì)有較高的mapping質(zhì)量值,可以用于SNP/INDEL的檢測(cè);而多重比對(duì)主要是序列重復(fù)造成的,不經(jīng)常出現(xiàn)在長(zhǎng)reads中。如果一條read有多重比對(duì)的情況,所有的比對(duì)互相之間幾乎完全完全重疊。除了一個(gè)最佳比對(duì)外,所有其他比對(duì)的質(zhì)量值都<3,且會(huì)被大多數(shù)SNP/INDEL檢測(cè)軟件忽略)。
以1為起始的坐標(biāo)系(1-based coordinate system):序列的第一位是1的坐標(biāo)系。在這種坐標(biāo)系中,一個(gè)區(qū)域用閉區(qū)間表示。例如,第三位和第七位堿基之間的區(qū)域表示為[3,7]。SAM, VCF, GFF和Wiggle格式使用以1為起始的坐標(biāo)系;
以0為起始的坐標(biāo)系(0-based coordinate system):序列的第一位是0的坐標(biāo)系。在這種坐標(biāo)系中,一個(gè)區(qū)域用左閉右開區(qū)間表示。例如,第三位和第七位堿基之間的區(qū)域表示為[2,7)(原文如此。難道不應(yīng)該是[3,8)么?不應(yīng)該。以0為起始,第三位對(duì)應(yīng)的索引號(hào)是2,第七位對(duì)應(yīng)的索引號(hào)是6,所以索引號(hào)[2,7)對(duì)應(yīng)了第三位-第七位堿基。當(dāng)時(shí)腦子糊涂了,沒搞清文中說的意思)。BAM, BCFv2, BED和PSL格式使用以0為起始的坐標(biāo)系;
Phred scale:如果一個(gè)概率值0
3、標(biāo)頭部分(header section)詳解
該部分為SAM/BAM的注釋部分,該部分并非必須,可以省略。每一行都以@符開頭,后面跟著兩個(gè)大寫字母,每個(gè)字段之間以\t分割,每個(gè)字段遵循(TAG:Value)的格式(@CO開頭的行除外)。每行可以使用以下正則表達(dá)式表示:/^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/,@后緊跟的兩個(gè)大寫字母主要有HD,SQ,RG,PG和CO五類,前四類常用如下表,其中加了*號(hào)的表示該標(biāo)簽必須存在,例如@HD這個(gè)標(biāo)簽存在時(shí),VN必須同時(shí)存在,詳細(xì)介紹如下。
4、比對(duì)信息部分(alignment section)詳解比對(duì)部分概述
該部分是SAM文件的核心部分,每一行代表一個(gè)序列的線性比對(duì)(linear alignment of a segment),每行包含前11個(gè)必需字段,和第12個(gè)字段后多個(gè)可選字段,使用TAB-separated分割,當(dāng)某個(gè)字段信息缺省時(shí),如果字段是字符串型以*替代,如果字段是整型以‘0’來替代,下表為11個(gè)必需字段含義的概述。
比對(duì)部分詳細(xì)介紹第一列、QNAME
被比對(duì)序列的名稱(query template name),如果QNAME唯一,則序列被認(rèn)為來源于同一模板;‘*’表示該字段缺省;一般情況下,該字段為FASTQ文件的第一行信息;嵌合(Chimeric alignment)比對(duì)或者多次比對(duì)(Multiple mapping)的序列會(huì)導(dǎo)致一個(gè)QNAME在SAM中多次出現(xiàn)。第二列、FLAG
SAM中顯示的是下圖中第一列值或者第一列中的數(shù)值和,當(dāng)顯示的是下表中第一列數(shù)值時(shí),意義為Description所列出,如果是多個(gè)數(shù)值和,意義為Description多行意義匯總,常用的意義見下表:1 :該read使用雙端測(cè)序,單端測(cè)序?yàn)?;
2: 該read和完全比對(duì)到參考序列;
4: 該read沒有比對(duì)到參考序列;
8: 雙端序列的另外一條序列沒有比對(duì)上參考序列(read1或者read2);
16:該read比對(duì)到參考序列的負(fù)鏈上(該read反向互補(bǔ)比對(duì)到參考序列);
32 :該read的另一條read比對(duì)到參考序列的負(fù)鏈上;
64 :雙端測(cè)序 read1;
128 : 雙端測(cè)序read2;
256: 該read不是最佳的比對(duì)序列,一條read能比對(duì)到參考序列的多個(gè)位置,只有一個(gè)是最佳的比對(duì)位置,其他都是次要的;
512: 該read在過濾(堿基質(zhì)量,測(cè)序平臺(tái)等指標(biāo))時(shí)沒通過;
1024: PCR(文庫構(gòu)建時(shí))或者儀器(測(cè)序時(shí))導(dǎo)致的重復(fù)序列;
2048: 該read可能存在嵌合(發(fā)生在PCR過程中),當(dāng)前比對(duì)部分只是read的一部分;
如果FLAG不在上表第一列,可以使用如下兩個(gè)網(wǎng)站查詢:
例如,FLAG 88=8(0x8對(duì)應(yīng)值)+16(0x10對(duì)應(yīng)值)+64(0x40對(duì)應(yīng)值),該FLAG值意義為三個(gè)意義的匯總。
另外一些常用FLAGOne of the reads is unmapped(雙端reads只有一條reads比對(duì)上):
73, 133, 89, 121, 165, 181, 101, 117, 153, 185, 69, 137
Both reads are unmapped(雙端reads都沒比對(duì)上):
77, 141
Mapped within the insert size and in correct orientation(reads比對(duì)上了,大小方向均對(duì)):
99, 147, 83, 163
Mapped within the insert size but in wrong orientation(比對(duì)上了,但是方向不對(duì)):
67, 131, 115, 179
Mapped uniquely, but with wrong insert size(唯一比對(duì),但是大小不對(duì)):
81, 161, 97, 145, 65, 129, 113, 177第三列、RNAME
Reference sequence NAME of the alignment,比對(duì)時(shí)參考序列的名稱,一般是染色體號(hào)(如果物種為人,則為chr1~chr22,chrX,chrY,chrM)。RNAME(如果不是‘*’)必須在header section部分@SQ中SN標(biāo)簽后出現(xiàn)。如果沒有比對(duì)上參考基因組,用‘*’來表示。如果RNAME值是‘*’,則后面POS和CIGAR也將沒有值。第四列、POS
該read比對(duì)到參考基因組的位置坐標(biāo),最小為1(1-based leftmost)。該read如果沒有比對(duì)上參考序列,則RNAME和CIGAR也無值。第五列、MAPQ
對(duì)應(yīng)參考序列的質(zhì)量(MAPing Quality),比對(duì)的質(zhì)量分?jǐn)?shù),越高說明該read比對(duì)到參考基因組上的位置越準(zhǔn)確。其值等于-10 lg Probility (錯(cuò)配概率),得出值后四舍五入的整數(shù)就是MAPQ值。如果該值是255,則說明對(duì)應(yīng)質(zhì)量無效。例如,MAPQ為20,即Q20,錯(cuò)誤率為0.01,20 = -10log10(0.01) = -10*(-2)。第六列、CIGAR
Compact Idiosyncratic Gapped Alignment Representation的簡(jiǎn)寫,描述read與參考序列的比對(duì)具體情況信息。CIGAR中的數(shù)字代表堿基的個(gè)數(shù),字符的含義見下表:
舉個(gè)栗子:3M1D2M1I1M:3個(gè)堿基匹配(M)(3M)、接下來1個(gè)堿基缺失(D)、接下來2個(gè)匹配(2M)、接下來1個(gè)堿基插入(1I)、接下來1個(gè)堿基匹配(1M),如下圖:第七列、RNEXT
雙端測(cè)序中另外一條read比對(duì)的參考序列的名稱,單端測(cè)序此處為0,RNEXT(如果不是*或者=,*是完全沒有比對(duì)上,=是完全比對(duì))必須在header section部分@SQ中SN標(biāo)簽后出現(xiàn)。第3和第7列,可以用來判斷某條read是否比對(duì)成功到了參考序列上,read1和read2是否比對(duì)到同一條參考染色體上。第八列、PNEXT
雙端測(cè)序中,是指另外一條read比對(duì)到參考基因組的位置坐標(biāo),最小為1(1-based leftmost)。第九列、TLEN
文庫長(zhǎng)度,insert DNA size。第十列、SEQ
read 堿基序列,FASTQ的第二行。第十一列、QUAL
FASTQ的第四行。第十二列之后,Optional fields
可選的自定義區(qū)域(Optional fields),可能有多列,多列間使用\t隔開,并不是每行都存在這些列。XT:A:R NM:i:0 X0:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:50 XA:Z:chr1,+102573964,50M,0
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
#該行該列沒有內(nèi)容
XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:50
每列格式為TAG:TYPE:VALUE,其中TAG為兩個(gè)大寫字母;
TYPE可以由如下格式A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string);
VALUE ,內(nèi)容與TYPE相關(guān),TYPE為i時(shí)VALUE為整數(shù),以此類推;
TAG詳細(xì)介紹
可分為6類,詳細(xì)介紹如下:1.1 Additional Template and Mapping data(一些比對(duì)信息)
AM:i:score The smallest template-independent mapping quality of any segment in the same template as
this read. (See also SM.)
AS:i:score Alignment score generated by aligner.
BQ:Z:qualities Offffset to base alignment quality (BAQ), of the same length as the read sequence. At the
i-th read base, BAQi = Qi
(BQi
64) where Qi is the i-th base quality.
CC:Z:rname Reference name of the next hit; ‘=’ for the same chromosome.
CG:B:I,encodedCigar Real CIGAR in its binary form if (and only if) it contains >65535 operations. This
is a BAM fifile only tag as a workaround of BAM’s incapability to store long CIGARs in the standard
way. SAM and CRAM fifiles created with updated tools aware of the workaround are not expected to
contain this tag. See also the footnote in Section 4.2 of the SAM spec for details.
2CP:i:pos Leftmost coordinate of the next hit.
E2:Z:bases The 2nd most likely base calls. Same encoding and same length as SEQ. See also U2 for
associated quality values.
FI:i:int The index of segment in the template.
FS:Z:str Segment suffiffiffix.
H0:i:count Number of perfect hits.
H1:i:count Number of 1-difffference hits (see also NM).
H2:i:count Number of 2-difffference hits.
HI:i:i Query hit index, indicating the alignment record is the i-th one stored in SAM.
IH:i:count Number of alignments stored in the fifile that contain the query in the current record.
MC:Z:cigar CIGAR string for mate/next segment.
MD:Z:[0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)* String for mismatching positions.
The MD fifield aims to achieve SNP/indel calling without looking at the reference. For example, a string
‘10A5^AC6’ means from the leftmost reference base in the alignment, there are 10 matches followed
by an A on the reference which is difffferent from the aligned read base; the next 5 reference bases are
matches followed by a 2bp deletion from the reference; the deleted sequence is AC; the last 6 bases are
matches. The MD fifield ought to match the CIGAR string.
MQ:i:score Mapping quality of the mate/next segment.
NH:i:count Number of reported alignments that contain the query in the current record.
NM:i:count Number of difffferences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch(可以結(jié)合CIGAR字段計(jì)算錯(cuò)配堿基個(gè)數(shù)). Note this means that ambiguity codes in both
sequence and reference that match each other, such as ‘N’ in both, or compatible codes such as ‘A’ and
‘R’, are still counted as mismatches. The special sequence base ‘=’ will always be considered to be a
match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and
hard clipping (‘N’, ‘P’, ‘S’ and ‘H’ CIGAR operations) do not count as mismatches, but insertions and
deletions count as one mismatch per base.Note that historically this has been ill-defifined and both data and tools exist that disagree with this defifinition.
PQ:i:score Phred likelihood of the template, conditional on the mapping locations of both/all segments
being correct.
Q2:Z:qualities Phred quality of the mate/next segment sequence in the R2 tag. Same encoding as QUAL.
R2:Z:bases Sequence of the mate/next segment in the template. See also Q2 for any associated quality
values.
SA:Z:(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+ Other canonical alignments in a chimeric alignment, for
matted as a semicolon-delimited list. Each element in the list represents a part of the chimeric align
ment. Conventionally, at a supplementary line, the fifirst element points to the primary line. Strand is
either ‘+’ or ‘-’, indicating forward/reverse strand, corresponding to FLAG bit 0x10. Pos is a 1-based
coordinate.
SM:i:score Template-independent mapping quality, i.e., the mapping quality if the read were mapped as
a single read rather than as part of a read pair or template.
3TC:i: The number of segments in the template.
TS:A:strand Strand (‘+’ or ‘-’) of the transcript to which the read has been mapped.
U2:Z: Phred probability of the 2nd call being wrong conditional on the best being wrong. The same
encoding and length as QUAL. See also E2 for associated base calls.
UQ:i: Phred likelihood of the segment, conditional on the mapping being correct.1.2 Metadata(這部分內(nèi)容和 SAM中header section部分相關(guān),描述read測(cè)序相關(guān)信息)
RG:Z:readgroup The read group to which the read belongs. If @RG headers are present, then readgroup
must match the RG-ID fifield of one of the headers.
LB:Z:library The library from which the read has been sequenced. If @RG headers are present, then library
must match the RG-LB fifield of one of the headers.
PG:Z:program id Program. Value matches the header PG-ID tag if @PG is present.
PU:Z:platformunit The platform unit in which the read was sequenced. If @RG headers are present, then
platformunit must match the RG-PU fifield of one of the headers.
CO:Z:text Free-text comments.1.3 Barcodes(UMI/單細(xì)胞測(cè)序cell barcode)
DNA barcodes can be used to identify the provenance of the underlying reads. There are currently three
varieties of barcodes that may co-exist: Sample Barcode, Cell Barcode, and Unique Molecular Identififier
(UMI).
? Despite its name, the Sample Barcode identififies the Library and allows multiple libraries to be combined
and sequenced together. After sequencing, the reads can be separated according to this barcode and
placed in difffferent “read groups” each corresponding to a library. Since the library was generated from
a sample, knowing the library should inform of the sample. The barcode itself can be included in the
PU fifield in the RG header line. Since the PU fifield should be globally unique, it is advisable to include
specifific information such as flflowcell barcode and lane. It is not recommended to use the barcode as
the ID fifield of the RG header line, as some tools modify this fifield (e.g., when merging fifiles).
? The Cell Barcode is similar to the sample barcode but there is (normally) no control over the assignment
of cells to barcodes (whose sequence could be random or predetermined). The Cell Barcode can help
identify when reads come from difffferent cells in a “single-cell” sequencing experiment.(在單細(xì)胞測(cè)序中,追溯read來源的標(biāo)簽)
? The UMI is intended to identify the (single- or double-stranded) molecule at the time that the barcode
was introduced. This can be used to inform duplicate marking and make consensus calling in ultra
deep sequencing. Additionally, the UMI can be used to (informatically) link reads that were generated
from the same long molecule, enabling long-range phasing and better informed mapping. In some
experimental setups opposite strands of the same double-stranded DNA molecule get related barcodes.
These templates can also be considered duplicates even though technically they may have difffferent
UMIs. Multiple UMIs can be added by a protocol, possibly at difffferent time-points, which means that
specifific knowledge of the protocol may be needed in order to analyze the resulting data correctly.(UMI信標(biāo)簽,RNA-seq中UMI可以對(duì)原始的 RNA 分子進(jìn)行“絕對(duì)定量”)
BC:Z:sequence Barcode sequence (Identifying the sample/library), with any quality scores (optionally)
stored in the QT tag. The BC tag should match the QT tag in length. In the case of multiple unique
molecular identififiers (e.g., one on each end of the template) the recommended implementation con
catenates all the barcodes and places a hyphen (‘-’) between the barcodes from the same template.
QT:Z:qualities Phred quality of the sample barcode sequence in the BC tag. Same encoding as QUAL,
i.e., Phred score + 33. In the case of multiple unique molecular identififiers (e.g., one on each end of
the template) the recommended implementation concatenates all the quality strings with spaces (‘ ’)
between the difffferent strings from the same template.
4CB:Z:str Cell identififier, consisting of the optionally-corrected cellular barcode sequence and an optional
suffiffiffix. The sequence part is similar to the CR tag, but may have had sequencing errors etc corrected.
This may be followed by a suffiffiffix consisting of a hyphen (‘-’) and one or more alphanumeric characters to form an identififier. In the case of the cellular barcode (CR) being based on multiple barcode sequences
the recommended implementation concatenates all the (corrected or uncorrected) barcodes with a
hyphen (‘-’) between the difffferent barcodes. Sequencing errors etc aside, all reads from a single cell
are expected to have the same CB tag.
CR:Z:sequence+ Cellular barcode. The uncorrected sequence bases of the cellular barcode as reported
by the sequencing machine, with the corresponding base quality scores (optionally) stored in CY. Se
quencing errors etc aside, all reads with the same CR tag likely derive from the same cell. In the case
of the cellular barcode being based on multiple barcode sequences the recommended implementation
concatenates all the barcodes with a hyphen (‘-’) between the difffferent barcodes.
CY:Z:qualities+ Phred quality of the cellular barcode sequence in the CR tag. Same encoding as QUAL,
i.e., Phred score + 33. The lengths of the CY and CR tags must match. In the case of the cellular
barcode being based on multiple barcode sequences the recommended implementation concatenates all
the quality strings with with spaces (‘ ’) between the difffferent strings.
MI:Z:str Molecular Identififier. A unique ID within the SAM fifile for the source molecule from which this
read is derived. All reads with the same MI tag represent the group of reads derived from the same
source molecule.
OX:Z:sequence+ Raw (uncorrected) unique molecular identififier bases, with any quality scores (optionally)
stored in the BZ tag. In the case of multiple unique molecular identififiers (e.g., one on each end of the
template) the recommended implementation concatenates all the barcodes with a hyphen (‘-’) between
the difffferent barcodes.
BZ:Z:qualities+ Phred quality of the (uncorrected) unique molecular identififier sequence in the OX tag.
Same encoding as QUAL, i.e., Phred score + 33. The OX tags should match the BZ tag in length. In the
case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended
implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.
RX:Z:sequence+ Sequence bases from the unique molecular identififier. These could be either corrected or
uncorrected. Unlike MI, the value may be non-unique in the fifile. Should be comprised of a sequence of
bases. In the case of multiple unique molecular identififiers (e.g., one on each end of the template) the
recommended implementation concatenates all the barcodes with a hyphen (‘-’) between the difffferent
barcodes.If the bases represent corrected bases, the original sequence can be stored in OX (similar to OQ storing the original qualities of bases.)
QX:Z:qualities+ Phred quality of the unique molecular identififier sequence in the RX tag. Same encoding
as QUAL, i.e., Phred score + 33. The qualities here may have been corrected (Raw bases and qualities
can be stored in OX and BZ respectively.) The lengths of the QX and the RX tags must match. In the
case of multiple unique molecular identififiers (e.g., one on each end of the template) the recommended
implementation concatenates all the quality strings with a space (‘ ’) between the difffferent strings.1.4 Original data
OA:Z:(RNAME,POS,strand,CIGAR,MAPQ,NM ;)+ The original alignment information of the record
prior to realignment or unalignment by a subsequent tool. Each original alignment entry contains
the following six fifield values from the original record, generally in their textual SAM representations,
separated by commas (‘,’) and terminated by a semicolon (‘;’): RNAME, which must be explicit
(unlike RNEXT, ‘=’ may not be used here); 1-based POS; ‘+’ or ‘-’, indicating forward/reverse strand
respectively (as per bit 0x10 of FLAG); CIGAR; MAPQ; NM tag value, which may be omitted (though
the preceding comma must be retained).
5In the presence of an existing OA tag, a subsequent tool may append another original alignment entry
after the semicolon, adding to—rather than replacing—the existing OA information.
The OA fifield is designed to provide record-level information that can be useful for understanding the
provenance of the information in a record. It is not designed to provide a complete history of the
template alignment information. In particular, realignments resulting in the the removal of Secondary
or Supplementary records will cause the loss of all tags associated with those records, and may also
leave the SA tag in an invalid state.
OC:Z:cigar Original CIGAR, usually before realignment. Deprecated in favour of the more general OA.
OP:i:pos Original 1-based POS, usually before realignment. Deprecated in favour of the more general OA.
OQ:Z:qualities Original base quality, usually before recalibration. Same encoding as QUAL.1.5 Annotation and Padding
The SAM format can be used to represent de novo assemblies , generally by using padded reference sequences and the annotation tags described here. See the Guide for Describing Assembly Sequences in the SAM Format Specifification for full details of this representation.
CT:Z:strand;type(;key(=value)?)*
Complete read annotation tag, used for consensus annotation dummy features.
The CT tag is intended primarily for annotation dummy reads, and consists of a strand, type and zero or
more key=value pairs, each separated with semicolons. The strand fifield has four values as in GFF3,2
and supplements FLAG bit 0x10 to allow unstranded (‘.’), and stranded but unknown strand (‘?’)
annotation. For these and annotation on the forward strand (strand set to ‘+’), do not set FLAG bit
0x10. For annotation on the reverse strand, set the strand to ‘-’ and set FLAG bit 0x10.
The type and any keys and their optional values are all percent encoded according to RFC3986 to
escape meta-characters ‘=’, ‘%’, ‘;’, ‘|’ or non-printable characters not matched by the isprint() macro
(with the C locale). For example a percent sign becomes ‘%25’.
PT:Z:annotag(\|annotag)*
where each annotag matches start;end;strand;type(;key(=value)?)* Read annotations for parts of the padded read sequence.The PT tag value has the format of a series of annotation tags separated by ‘|’, each annotating a sub-region of the read. Each tag consists of start, end, strand, type and zero or more key=value pairs,each separated with semicolons. Start and end are 1-based positions between one and the sum of the M/I/D/P/S/=/X CIGAR operators, i.e., SEQ length plus any pads. Note any editing of the CIGAR
string may require updating the PT tag coordinates, or even invalidate them. As in GFF3, strand is
one of ‘+’ for forward strand tags, ‘-’ for reverse strand, ‘.’ for unstranded or ‘?’ for stranded but unknown strand. The type and any keys and their optional values are all percent encoded as in the CT tag.1.6 Technology-specifific data
FZ:B:S,intensities Flow signal intensities(測(cè)序拍照的光強(qiáng)度數(shù)據(jù)) on the original strand of the read, stored as (uint16 t)
round(value * 100.0).
1.6.1 Color space
CM:i:distance Edit distance between the color sequence and the color reference (see also NM).
CS:Z:sequence Color read sequence on the original strand of the read. The primer base must be included.
CQ:Z:qualities Color read quality on the original strand of the read. Same encoding as QUAL; same
length as CS.2 Locally-defifined tags
You can freely add new tags. Note that tags starting with ‘X’, ‘Y’, or ‘Z’ and tags containing lowercase letters in either position are reserved for local use and will not be formally defifined in any future version of this specifification. If a new tag may be of general interest, it may be useful to have it added to this specifification. Additions can be proposed by opening a new issue at https://github.com/samtools/hts-specs/issues and/or by sending email to samtools-devel@lists.sourceforge.net.
參考資料
總結(jié)
以上是生活随笔為你收集整理的sam格式的结构和意义_NGS数据格式02-SAM/BAM最详细解读的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ResNet解析(二)
- 下一篇: sourcemap