首先反响 去查了BLAST的glossary

IdentityThe extent to which two (nucleotide or amino acid) sequences are
invariant.SimilarityThe extent to which nucleotide or protein sequences
are related. The extentof similarity between two sequences can be based
on percent sequence identityand/or conservation. In BLAST similarity
refers to a positive matrix score.


(MONOCYTE ARG- SERPIN). Length = 415 Score = 176 (80.2 bits), Expect =
1.8e-65, Sum P = 1.8e-65 Identities = 38/89 , Positives = 50/89 Query: 1
+VLVNA+YFKG WKT F + PF V Sbjct: 180

Identities correspond to exact matches and positives are similarities
basedon the scoring matrix used. (来自BLAST tutorial)


identities->exact matchespositives->similarities based the matirx

Matrix后,总计方法两个是一律的。在可比protein seq时Substitution

is NOT equal to

(2008.10.1)又见到有人评价,本身看了一晃,Similarity is NOT equal to

Similarity is NOT equal to Homology

IDENTITY – The extent to which two sequences are invariant.

美高梅娱乐平台,SIMILARITY – The extent to which sequences are related. Similarity makes
no statement about descent from a common ancestor. (Convergent versus
Divergent evolution.)

HOMOLOGY – Sequence similarity that can be attributed to descent from a
common ancestor.

There are Two Types of Homology

ORTHOLOGOUS – Homologous sequences in different species. These sequences
usually retain the same function in the two species.

PARALOGOUS – Homologous sequences in the same species that arose by
means of gene duplication. Divergence of function is more common between

Why is this important? Homology is a matter of opinion, not directly
measurable or observable. Similarity is a direct measurement and can be
discussed in terms of percentages.

(See Reeck et al. Cell 50: 667

另外,Score 与bits-Score的区别:

BLAST Score BLAST scores rely on extensive theory. We start by making
the following assumptions: The BLAST score is scoring local ungapped
alignments. The theory of scoring here is well understood. The database
sequences are assumed to be evolutionary unrelated, i.e. independent of
one another. The alignment starts at specific positions along query and
database record. The score matrix must give, on the average, a negative
score. Were this not the case, long alignments would tend to have high
score independently of whether the segment aligned were related, and the
statistical theory would break down.

Figure 5.10: Random walk: The score for a match is +2 and the punishment
for a missmatch is -1, As shown,the expectancy for the whole walk is
negative. The probability that the Top Score will be larger than X
decreases exponentially with x.

美高梅娱乐平台 1

When searching a query of length m in a database of total length n one
performs m*n random walk experiment, each with exponentially decreasing
probability of achieving a score S. Thus, the E-value for score s is:
美高梅娱乐平台 2.
美高梅娱乐平台 3
and K are constants:

美高梅娱乐平台 4

  • scaling factor K – correction for dependency and bias of the scoring

Indeed the E-score is normalized by the length of the query and
database: The same alignment would have different E-score if these
length are different. Also the E-score is exponential, thus it is
instructive to consider a normalization of the E-score into logarithmic
scale, called the Bit – score.

The Bit-score B is computed from the E-score E by E=mn2-B. Obviously,
the Bit-score is linear in the raw score s:
美高梅娱乐平台 5.
In contrast to raw scores, that have little meaning without k and
美高梅娱乐平台 6,
the Bit-score is measured in standard units (see eg. [17]). Naturally,
the meaning of the Bit-score depends on sizes of the query and the

Again, as mentioned before one can ask for the P-value (the probability
of the observed number of records with a known E-value or lower). Define
the random variable Y to be the observed number of pairs achieveing
E-value E or better. Y is distributed Poisson with . The Probability of
Ye to be r is
美高梅娱乐平台 7,
and the probability of Ye to be 0 is equivilant to the probability that
the (Best E-score < E)=exp . Specifically the chance of finding zero
alignments with score >= S is e-E so the probability of finding at
least one such alignment is 1-e-E . This is the P-value associated with
the score S (see eg. [17]). Note that this model assumes an I.I.D
trial for each database position.