（序列）一致性与概况 · 用Python解决生物信息学问题（To Solve Bioinformatics Problems with Python）

所属专题：[字符串算法](README.md) &emsp; ## 问题我们有一系列长度均为`$ n $`的DNA序列，它们的序列概况矩阵（profile matrix）是一个`$ 4\times n $`的矩阵`$ P $`. 其中，矩阵元素`$ P_{1,j} $`表示碱基A在各个序列的第`$ j $`位出现的次数；矩阵元素`$ P_{2,j} $`表示碱基C在各序列的第`$ j $`位出现的次数，以此类推。一致性序列（consensus sequence）`$ c $`是一个长度同样为`$ n $`的序列，由我们所掌握的序列集合中，各个位置出现次数最多的元素所组成。因此，序列`$ c $`的第`$ j $`位对应于序列概况矩阵第`$ j $`列中元素值最大的那个符号（碱基/氨基酸）。当然，某些位置上可能有不止一个最常见的符号，这使得存在多个可能的一致性序列。 DNA序列、序列概况矩阵（profile matrix）、一致性序列（consensus sequence）的关系如下所示： ***** （1）多个等长的DNA序列： :-: A T C C A G C T G G G C A A C T A T G G A T C T A A G C A A C C T T G G A A C T A T G C C A T T A T G G C A C T （2）序列概况矩阵（profile matrix） :-: **A** &emsp;&ensp; 5 1 0 0 5 5 0 0 &emsp;&emsp; **C** &emsp;&ensp; 0 0 1 4 2 0 6 1 &emsp;&emsp; **G** &emsp;&ensp; 1 1 6 3 0 1 0 0 &emsp;&emsp; **T** &emsp;&ensp; 1 5 0 0 0 1 1 6 &emsp;&emsp; （3）一致性序列（consensus sequence） :-: A T G C A A C T **输入：** 包含多个等长的DNA序列的FASTA文件，序列不超过10条，长度均不超过1 kb. **输出：** 第一行是一致性序列，第二行至第五行是序列概况矩阵的内容。（如果存在多个一致性序列，则输出符合条件的任一个均可） **样例数据：** ~~~ >Rosalind_1 ATCCAGCT >Rosalind_2 GGGCAACT >Rosalind_3 ATGGATCT >Rosalind_4 AAGCAACC >Rosalind_5 TTGGAACT >Rosalind_6 ATGCCATT >Rosalind_7 ATGGCACT ~~~ **样例输出：** ~~~ ATGCAACT A: 5 1 0 0 5 5 0 0 C: 0 0 1 4 2 0 6 1 G: 1 1 6 3 0 1 0 0 T: 1 5 0 0 0 1 1 6 ~~~ &emsp; ## 背景知识该问题涉及序列概况矩阵与一致性序列的原理与应用。完成多序列比对后，通过构建序列概况矩阵，可以找到其中频数较高的序列组成模式；而一致性序列则是构建了一个平均情况下出现频率最高的序列组成，它代表了各个给定序列最相互接近的共同特征，这在序列的同源性分析中是很重要的概念。详情请查阅ROSALIND网站上[关于该问题的背景说明](http://rosalind.info/problems/cons/)。 &emsp; ## 解答 ``` def read_fa(fa): """读入FASTA格式的文件对象fa，将其中的序列用字典结构存储""" seq = {} for line in fa: if line.startswith('>'): seqid = line.replace('>', '').strip() seq[seqid] = '' else: seq[seqid] += line.strip() return seq def profile_matrix(seq): """读入字典结构的多个序列，构建其Profile matrix""" sequences = list(seq.values()) seq_pos = list(zip(*sequences)) profile = {'A: ':[], 'C: ':[], 'G: ':[], 'T: ':[]} for pos in seq_pos: profile['A: '].append(pos.count('A')) profile['C: '].append(pos.count('C')) profile['G: '].append(pos.count('G')) profile['T: '].append(pos.count('T')) return profile def consensus_seq(seq): """读入字典结构的多个序列，构建一致性序列""" sequences = list(seq.values()) seq_pos = list(zip(*sequences)) base_freq = map(lambda x:dict((base, x.count(base)) for base in "ACGT"), seq_pos) conseq = ''.join([max(x, key=x.get) for x in base_freq]) return conseq ## --main-- with open("rosalind_cons.txt", 'r') as f1: dnaseqs = read_fa(f1) conseq = consensus_seq(dnaseqs) profile_m = profile_matrix(dnaseqs) with open("rosalind_cons_out.txt", 'w') as f2: f2.write(conseq + '\n') for base,profile in profile_m.items(): f2.write(base + ' '.join([str(x) for x in profile]) + '\n') ```