identity和similarity有何分别，发掘自个儿对那多少个概念也不甚清楚，于是做了点功课，如下。

首先反响 去查了BLAST的glossary

IdentityThe extent to which two (nucleotide or amino acid) sequences are

invariant.SimilarityThe extent to which nucleotide or protein sequences

are related. The extentof similarity between two sequences can be based

on percent sequence identityand/or conservation. In BLAST similarity

refers to a positive matrix score.

唯独BLAST的output里头未有similarity这一项，奇怪。

>sp|P05120|PAI2_HUMAN PLASMINOGEN ACTIVATOR INHIBITOR-2, PLACENTAL

(MONOCYTE ARG- SERPIN). Length = 415 Score = 176 (80.2 bits), Expect =

1.8e-65, Sum P = 1.8e-65 Identities = 38/89 , Positives = 50/89 Query: 1

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQ +I +LL S D DT

+VLVNA+YFKG WKT F + PF V Sbjct: 180

KIPNLLPEGSVDGDTRMVLVNAVYFKGKWKTPFEKKLNGLYPFRVNSA

Identities correspond to exact matches and positives are similarities

basedon the scoring matrix used. （来自BLAST tutorial）

足见positivies正是某种校正过的similarities了。结合起来一看就通晓了，

identities->exact matchespositives->similarities based the matirx

在可比nucleotide

seq时认为ATCG七个碱基现身时机十一分，任何多个之间同样就得一分，替换后都得零分，三个很轻巧的Substitution

Matrix，今年identities和similarities(BLAST中正是positives)是一样的，因为用了这几个差不离的Substitution

Matrix后，总计方法两个是一律的。在可比protein seq时Substitution

Matrix用的是BLOSUM，相同的脂质得分高，相似的木质素得分低，不相相配的的零分，那个时候identities和positives的乘除办法是不同的，所以两个也就不等同了。

关于总结上的similarity和生物学意义上的homology

又不等同了。想到这里又谷歌（Google）下了homology和similarity，嗯，异常的大学一年级行字，Similarity

is NOT equal to

Homology，单独做了个网页强调这多个不是一回事，值得能够在意哦。

(2008.10.1)又见到有人评价，本身看了一晃，Similarity is NOT equal to

Homology的网页链接失效了，通过waybackmachine找了回来贴在底下。

Similarity is NOT equal to Homology

IDENTITY – The extent to which two sequences are invariant.

美高梅娱乐平台，SIMILARITY – The extent to which sequences are related. Similarity makes

no statement about descent from a common ancestor. (Convergent versus

Divergent evolution.)

HOMOLOGY – Sequence similarity that can be attributed to descent from a

common ancestor.

There are Two Types of Homology

ORTHOLOGOUS – Homologous sequences in different species. These sequences

usually retain the same function in the two species.

PARALOGOUS – Homologous sequences in the same species that arose by

means of gene duplication. Divergence of function is more common between

paralogues.

Why is this important? Homology is a matter of opinion, not directly

measurable or observable. Similarity is a direct measurement and can be

discussed in terms of percentages.

(See Reeck et al. Cell 50: 667

另外，Score 与bits-Score的区别：

BLAST Score BLAST scores rely on extensive theory. We start by making

the following assumptions: The BLAST score is scoring local ungapped

alignments. The theory of scoring here is well understood. The database

sequences are assumed to be evolutionary unrelated, i.e. independent of

one another. The alignment starts at specific positions along query and

database record. The score matrix must give, on the average, a negative

score. Were this not the case, long alignments would tend to have high

score independently of whether the segment aligned were related, and the

statistical theory would break down.

Figure 5.10: Random walk: The score for a match is +2 and the punishment

for a missmatch is -1, As shown,the expectancy for the whole walk is

negative. The probability that the Top Score will be larger than X

decreases exponentially with x.

When searching a query of length m in a database of total length n one

performs m*n random walk experiment, each with exponentially decreasing

probability of achieving a score S. Thus, the E-value for score s is:

.

and K are constants:

- scaling factor K – correction for dependency and bias of the scoring

scheme.

Indeed the E-score is normalized by the length of the query and

database: The same alignment would have different E-score if these

length are different. Also the E-score is exponential, thus it is

instructive to consider a normalization of the E-score into logarithmic

scale, called the Bit – score.

The Bit-score B is computed from the E-score E by E=mn2-B. Obviously,

the Bit-score is linear in the raw score s:

.

In contrast to raw scores, that have little meaning without k and

,

the Bit-score is measured in standard units (see eg. [17]). Naturally,

the meaning of the Bit-score depends on sizes of the query and the

database.

Again, as mentioned before one can ask for the P-value (the probability

of the observed number of records with a known E-value or lower). Define

the random variable Y to be the observed number of pairs achieveing

E-value E or better. Y is distributed Poisson with . The Probability of

Ye to be r is

,

and the probability of Ye to be 0 is equivilant to the probability that

the (Best E-score < E)=exp . Specifically the chance of finding zero

alignments with score >= S is e-E so the probability of finding at

least one such alignment is 1-e-E . This is the P-value associated with

the score S (see eg. [17]). Note that this model assumes an I.I.D

trial for each database position.