identity和similarity有何分别,发掘自个儿对那多少个概念也不甚清楚,于是做了点功课,如下。

首先反响 去查了BLAST的glossary

IdentityThe extent to which two (nucleotide or amino acid) sequences are
invariant.SimilarityThe extent to which nucleotide or protein sequences
are related. The extentof similarity between two sequences can be based
on percent sequence identityand/or conservation. In BLAST similarity
refers to a positive matrix score.

唯独BLAST的output里头未有similarity这一项,奇怪。

>sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL
(MONOCYTE ARG- SERPIN). Length = 415 Score = 176 (80.2 bits), Expect =
1.8e-65, Sum P = 1.8e-65 Identities = 38/89 , Positives = 50/89 Query: 1
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQ +I +LL S D DT
+VLVNA+YFKG WKT F + PF V Sbjct: 180
KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSA

Identities correspond to exact matches and positives are similarities
basedon the scoring matrix used. (来自BLAST tutorial)

足见positivies正是某种校正过的similarities了。结合起来一看就通晓了,

identities->exact matchespositives->similarities based the matirx

在可比nucleotide
seq时认为ATCG七个碱基现身时机十一分,任何多个之间同样就得一分,替换后都得零分,三个很轻巧的Substitution
Matrix,今年identities和similarities(BLAST中正是positives)是一样的,因为用了这几个差不离的Substitution
Matrix后,总计方法两个是一律的。在可比protein seq时Substitution
Matrix用的是BLOSUM,相同的脂质得分高,相似的木质素得分低,不相相配的的零分,那个时候identities和positives的乘除办法是不同的,所以两个也就不等同了。

关于总结上的similarity和生物学意义上的homology
又不等同了。想到这里又谷歌(Google)下了homology和similarity,嗯,异常的大学一年级行字,Similarity
is NOT equal to
Homology,单独做了个网页强调这多个不是一回事,值得能够在意哦。

(2008.10.1)又见到有人评价,本身看了一晃,Similarity is NOT equal to
Homology的网页链接失效了,通过waybackmachine找了回来贴在底下。

Similarity is NOT equal to Homology

IDENTITY – The extent to which two sequences are invariant.

美高梅娱乐平台,SIMILARITY – The extent to which sequences are related. Similarity makes
no statement about descent from a common ancestor. (Convergent versus
Divergent evolution.)

HOMOLOGY – Sequence similarity that can be attributed to descent from a
common ancestor.

There are Two Types of Homology

ORTHOLOGOUS – Homologous sequences in different species. These sequences
usually retain the same function in the two species.

PARALOGOUS – Homologous sequences in the same species that arose by
means of gene duplication. Divergence of function is more common between
paralogues.

Why is this important? Homology is a matter of opinion, not directly
measurable or observable. Similarity is a direct measurement and can be
discussed in terms of percentages.

(See Reeck et al. Cell 50: 667

另外,Score 与bits-Score的区别:

BLAST Score BLAST scores rely on extensive theory. We start by making
the following assumptions: The BLAST score is scoring local ungapped
alignments. The theory of scoring here is well understood. The database
sequences are assumed to be evolutionary unrelated, i.e. independent of
one another. The alignment starts at specific positions along query and
database record. The score matrix must give, on the average, a negative
score. Were this not the case, long alignments would tend to have high
score independently of whether the segment aligned were related, and the
statistical theory would break down.

Figure 5.10: Random walk: The score for a match is +2 and the punishment
for a missmatch is -1, As shown,the expectancy for the whole walk is
negative. The probability that the Top Score will be larger than X
decreases exponentially with x.

美高梅娱乐平台 1

When searching a query of length m in a database of total length n one
performs m*n random walk experiment, each with exponentially decreasing
probability of achieving a score S. Thus, the E-value for score s is:
美高梅娱乐平台 2.
美高梅娱乐平台 3
and K are constants:

美高梅娱乐平台 4

  • scaling factor K – correction for dependency and bias of the scoring
    scheme.

Indeed the E-score is normalized by the length of the query and
database: The same alignment would have different E-score if these
length are different. Also the E-score is exponential, thus it is
instructive to consider a normalization of the E-score into logarithmic
scale, called the Bit – score.

The Bit-score B is computed from the E-score E by E=mn2-B. Obviously,
the Bit-score is linear in the raw score s:
美高梅娱乐平台 5.
In contrast to raw scores, that have little meaning without k and
美高梅娱乐平台 6,
the Bit-score is measured in standard units (see eg. [17]). Naturally,
the meaning of the Bit-score depends on sizes of the query and the
database.

Again, as mentioned before one can ask for the P-value (the probability
of the observed number of records with a known E-value or lower). Define
the random variable Y to be the observed number of pairs achieveing
E-value E or better. Y is distributed Poisson with . The Probability of
Ye to be r is
美高梅娱乐平台 7,
and the probability of Ye to be 0 is equivilant to the probability that
the (Best E-score < E)=exp . Specifically the chance of finding zero
alignments with score >= S is e-E so the probability of finding at
least one such alignment is 1-e-E . This is the P-value associated with
the score S (see eg. [17]). Note that this model assumes an I.I.D
trial for each database position.

相关文章