【论文翻译】Clustering by Passing Messages Between Data Points

【文献翻译】Clustering by Passing Messages Between Data Points

论文题目/作者信息：Clustering by Passing Messages Between Data Points (Brendan J. Frey and Delbert Dueck)
翻译人：jingxingv

Abstract

Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points.

Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel.

Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.

在处理感知信号和检测数据中的模式中，通过识具有代表性的例子进行数据聚类是十分重要的，这样的exemplars可以通过随机选择初始化数据集合，然后迭代精炼该集合，该方法仅仅在随机选择接近于一个良好的解时是很有用的。我们提出一种方法叫做affinity propagation，其将成对数据间的的相似性度量作为输入。数据之间交换实值消息，直到一个高质量exemplars的集合和其对应的簇逐渐产生。我们使用affinity propagation在人脸的图像分类，微阵列数据中检测基因，识别原稿中代表性的句子，识别在航空旅行中可以获得的有效城市。Affinity propagation在相比其他方法上找到更低误差的簇，并且花费不到百分之一的时间（与其他方法比较）。

正文

Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small.

基于相似性度量的数据聚类是科学数据分析和工程系统中的关键步骤。一种常见的方法是使用数据来学习一组中心，以便数据点和它们最近的中心之间的平方误差之和最小。

When the centers are selected from actual data points, they are called“exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors.

当从实际数据点中选择中心时，它们被称为“exemplars”。流行的k-centers聚类技术就是开始于一些随机的exemplars集，并以减少误差平方和为目标，迭代地改进这个集合。

k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However,this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution.

k-centers聚类对于初始选择的exemplars十分敏感！通常需要重新计算很多不同初始化情况试图找到最优的解。然而这种方法只在聚类规模小，并且初始化靠近一个良好的解时，效果良好。

We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges.

我们采用了一种完全不同的方法，并引入了一种同时将所有数据点视为潜在异常的方法。通过将每个数据点视为网络中的一个节点，我们设计了一种方法，它沿着网络的边缘重复传输实值消息，直到出现一组好的exemplars和相应的簇。

As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.

正如后面所述的那样，消息基于简单的公式进行更新，该公式搜索适当选择的能量函数的最小值。在任何时间点，每个消息的大小反映了一个数据点选择另一个数据点作为其样本的当前亲和力，因此我们称我们的方法为“affinity propagation”，图1A说明了簇是如何在消息传递过程中逐渐出现的。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Mv71sQ0M-1596174939554)(D:%5C%E5%8D%9A%E5%AE%A2%5Ctypora_photo%5Cimage-20200728115607529.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7xPkXdCP-1596174939559)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731114450407.png)]

Fig.1. AP是如何工作的。 (A)图示了二维数据点的亲和力传播，其中负欧氏距离（平方误差）被用来测量相似性。每人的点是根据当前证据着色的，即它是一个聚类中心（样本）。从i点到k点的箭头的黑暗对应于传递给我的强度这一点我属于样本点K。(B)“责任”r(I，k)从数据点发送到候选样本，并指出每个数据点对候选样本o的支持程度其他候选人的样本。 ©“Availability”a(i，k)从候选样本发送到数据点，并指出每个候选样本在多大程度上可作为该样本的聚类中心数据点。 (D)显示了输入偏好值（所有数据点通用)对已识别样本数(簇数）的影响。在(A)中使用的值是ALS 显示，这是从两两相似点的中位数计算的。

Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity $s(i,k)$ indicates how well the data point with index $k$ is suited to be the exemplar for data point $i$ .

AP把数据点之间实值的相似度作为输入，相似度s(i，k)表示多大程度上k索引的数据点适合作为数据点 i 的exemplar。

When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points $x_i$ and $x_k$ , $s(i,k) =−||xi − xk||^2$ . Indeed, the method described here can be applied when the optimization criterion is much more general.

当目标是最小化平方误差时，每个相似性被设置为负平方误差(欧几里得距离)，对于点 $x_i$ 和 $x_k$ ， $s(i,k)= -||x_i - x_k ||$ 。实际上，这里描述的方法可以应用在优化标准更一般的情况。

Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, $s(i,k)$ can be set to the log-likelihood of data point $i$ given that its exemplar is point $k$ . Alternatively, when appropriate, similarities may be set by hand.

随后，我们描述任务：相似度从成对的图像、成对的微阵列测量、成对的英语句子和成对的城市中导出。当exemplar依赖的概率模型是可以得的时候，假设其样本是点k，则 $s(i，k)$ 可以被设置为数据点 $i$ 的对数似然性。或者，在适当的时候，相似性可以手动设置。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number $s(k,k)$ for each data point $k$ so that data points with larger values of $s(k,k)$ are more likely to be chosen as exemplars.

相似性传播不是要求预先指定聚类的数目，而是将每个数据点 $k$ 的实数 $s(k，k)$ 作为输入，使得具有较大 $s(k，k)$ 值的数据点更有可能被选择作为输出。

These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value—this value can be varied to produce different numbers of clusters.

这些值（即输入的s(k，k)）被叫做preferences（参考度）。已识别exemplars的数量(聚类的数量)受输入偏好值的影响，但也会在消息传递过程中慢慢浮现。假使有一个先验，所有数据点都是同等适合作为exemplars，那么参考度应该设定为同大小的值——这个值的改变可以引起不同类别数，该值可以设定为输入相似度的中值（产生适中的簇数目）或者他们的最小值（产生一个小数量的簇数目）

The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).

共享值可以是输入相似性的中间值(产生中等数量的聚类)或它们的最小值(产生少量的聚类)。

There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B).

数据点之间有两种信息交换，每一种都考虑到不同种类的竞争。可以在任何阶段将这些信息组合起来，以决定哪些点是exemplars，对于每一个其他点，它属于哪个exemplars，从点i发送消息到点k的responsibility r(i，k)反映考虑i的其他候选样本，点K是如何适合作为点i的样本累积证据（图B）。

The “availability” a(i,k), sent from candidate exemplar point k to point i,reflects the accumulated evidence for how

appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero: a(i,k) = 0. Then, the responsibilities are computed using the rule
$r(i,k)\leftarrow s(i,k) - \max\limits_{k' s.t.k'\neq k}\{a(i,k')+s(i,k')\}\\ (1)$

从候选exemplar 点 k发送到点 i 的availability” a(i,k)反映了，考虑到其他点的支持，点i选择点k作为其示例是合适的（图1C）。r(i,k)和a(i,k) 可以看作是对数概率比。首先，可用性初始化为零： a(i,k) = 0，然后，使用规则计算。

In the first iteration, because the availabilities are zero, $r(i,k)$ is set to the input similarity between point $i$ and point $k$ as its exemplar, minus the largest of the similarities between point $i$ and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar.

在第一次迭代中，因为可用性为零，所以 $r(i，k)$ 被设置为点 $i$ 和点 $k$ 之间的输入相似度作为其样本，减去点 $i$ 和其他候选样本之间的最大相似度。这种竞争性的更新是数据驱动的，没有考虑每个候选样本有多少其他优势。

In later iterations,when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below.

在以后的迭代中，当一些点被有效地分配给其他样本时，它们的可用性将下降到零以下，如下面的更新规则所规定的。

These negative availabilities will decrease the effective values of some of the input similarities $s(i,k′)$ in the above rule, removing the corresponding candidate exemplars from competition. For $k = i$ , the responsibility $r(k,k)$ is set to the input preference that point $k$ be chosen as an exemplar, $s(k,k)$ , minus the largest of the similarities between point $i$ and all other candidate exemplars.

这些负可用性将降低上述规则中一些输入相似性 $s(i，k′)$ 的有效值，从竞争中移除相应的候选样本。对于 $k = i$ ，责任 $r(k，k)$ 被设置为输入偏好，即点 $k$ 被选择为样本 $s(k，k)$ ，减去点 $i$ 和所有其他候选样本之间的最大相似度。

This “self-responsibility” reflects accumulated evidence that point $k$ is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.

这种“自我责任”反映了越来越多的证据，表明 $k$ 点是一个样本，这是基于它的输入偏好，再加上它被分配给另一个样本的不合适性。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:
$a(i,k)\leftarrow\min\{0,r(k,k)+\sum\limits_{i's.t.i'\nsubseteq\{i,k\}} max\{0,r(i',k)\} \}\\ (2)$
尽管上述责任更新让所有候选样本竞争数据点的所有权，但以下可用性更新从数据点收集证据，以确定每个候选样本是否会成为好样本:

The availability a(i,k) is set to the self responsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities), regardless of how poorly it explains other data points (negative responsibilities). If the self responsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar.

可用性a(i，k)被设置为自我责任r(k，k)加上候选样本k从其他点接收的积极责任的总和。只增加了引入责任的积极部分，因为一个好的样本只需要很好地解释一些数据点(积极责任)，而不管它如何解释其他数据点(消极责任)。如果自我责任r(k，k)是负的(表明点k当前更适合属于另一个样本而不是样本本身)，如果一些其他点对 k 作为它们的样本具有正的责任，则点k作为样本的可用性可以增加。

To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:
$a(k,k)\leftarrow \sum\limits_{i's.t.i'\neq k} max\{0,r(i',k)\} \}\\(3)$
This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.

该消息反映了基于从其他点发送给候选样本k的积极责任，点k是样本的累积证据。

The above update rules require only simple,local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i.

上述更新规则只需要简单的、易于实现的局部计算(2)，并且消息只需要在具有已知相似性的点对之间交换。在亲和传播过程中的任何时候，效用能力和责任可以结合起来识别样本。对于点i，最大化a(i，k) + r(i，k)的k值要么将点I标识为样本(如果k = i ),要么标识作为点I的范例的数据点。

The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages,it is important that they be damped to avoid numerical oscillations that arise in some circumstances.

消息传递过程可以在固定次数的迭代之后、在消息中的变化低于阈值之后、或者在局部决策在一定次数的迭代中保持不变之后终止。当更新消息时，重要的是对它们进行阻尼，以避免在某些情况下出现数值振荡。

Each message is set to l times its value from the previous iteration plus 1 – l times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of l = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.

每个消息都被设置为l倍于其先前迭代的值加上1 l倍于其规定的更新值，其中阻尼因子l在0和1之间。在我们所有的实验(3)中，我们使用了默认的阻尼因子l = 0.5，亲和传播的每次迭代包括(I)更新给定可用性的所有责任，(ii)更新给定责任的所有可用性，以及(iii)组合可用性和责任来监控前雇员决策，当这些决定在10次迭代中没有改变时，终止算法。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity.

图1A显示了应用于25个二维数据点(3)的相似性传播的动力学，使用负平方误差作为相似性。

One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar.

相似性传播的一个优点是样本的数量不需要事先指定。相反，适当数量的样本出现在消息传递方法中，并取决于输入样本的首选项。这使得自动模型选择成为可能，基于每个点作为样本的优选程度的预先说明。

Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

图1D显示了公共输入偏好值对集群数量的影响。这种关系与通过精确地最小化平方误差(2)得到的关系几乎相同。

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error.

接下来，我们研究了使用标准优化平方误差对人脸图像进行聚类的问题。

We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3).

我们使用相似性传播和k中心聚类来识别从Olivetti人脸数据库中提取的900幅灰度图像中的样本(3)。

Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time.

相似性传播发现样本的平方误差比k-中心聚类的100次运行中最好的一次要低得多(图2A)，这花费了大约相同的计算机时间。

We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters.

我们询问大量随机重启k中心聚类是否能达到同样的平方误差。图2B显示了一次相似性传播产生的误差，以及10，000次k中心聚类产生的误差分布，并与聚类数进行了对比。

Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

AP在少于两个数量级的时间内统一地实现了低得多的误差。另一个流行的优化标准是绝对像素差异的总和(它更好地容忍外围像素强度)，因此我们使用这个误差度量重复上述过程。相似性传播再次均匀地实现了较低的误差(图2C)。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZpS4odl9-1596174939563)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731115410668.png)]

Fig.2.聚集的面孔。从900幅归一化人脸图像（3)中识别出最小化标准平方误差测度的示例）。对于−600的共同偏好，亲和力传播离子发现62个团簇，平均平方误差为108。作为比较，100次不同随机初始化的k中心聚类的最佳结果达到了较差的平均平方误差第119条。 (A)在亲和传播或k点聚类下，误差最高的15幅图像显示在顶行。中间和底部行显示t分配的示例 WO方法和框显示了这两种方法中的哪一种在平方误差方面对该图像表现得更好。亲和力传播发现了更高质量的样本。 (B)平均平方误差a 通过一次亲和传播和10，000次k中心聚类来实现，而不是簇的数量。彩色波段显示不同的百分位数的平方误差，和数字给出了与(A)结果相对应的样本。 ©重复上述程序，使用绝对误差之和作为相似性的度量，这也是一种流行的优化c 里蒂翁。

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative.

许多任务需要在相关性小的数据中识别样本，在这些数据中，大多数相似性要么未知，要么大且负面。

To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons(data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “nonexon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15 million) of data point pairs.

为了检查文中的AP，我们使用从微阵列数据中得到的稀疏相似矩阵，并在(4)中报告，解决了聚类推定外显子以发现基因的任务。在这项工作中，从小鼠1号染色体的基因组中挖掘出75，066段对应于推定外显子的DNA (60个碱基长)。在12个组织样本中测量它们的转录水平，并计算每对推定外显子(数据点)之间的相似性。推定外显子之间相似性的度量是基于它们在基因组中的相似性以及它们在12个组织中转录水平的协调程度。为了说明非外显子(如内含子)的推定外显子，我们加入了一个额外的人工样本，并使用整个数据集的统计数据确定了每个其他数据点与这个“非外显子样本”的相似性。所得的75，067 × 75，067相似性矩阵(3)由99.73%的相似性组成，其值为，对应于不可能是同一基因的一部分的遥远的DNA片段。我们将相似性传播应用于这个相似性矩阵，但是如果s(i，k)=∞，则由于消息不需要在点I和k之间交换，相似性传播的每次迭代都需要交换消息仅在数据点对的极小子集(0.27%或1500万)之间。

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and k centers clustering are compared in Fig. 3B. For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the true positive (TP) rate against the false-positive (FP)rate, using the labels provided in the Ref Seq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

图3A说明了基因簇的识别和一些数据点对非样本的分配。图3B比较了亲和传播和k中心聚类的重构误差。对于每一个数量的聚类，亲和度计算运行一次并花费6分钟，而k中心聚类运行10，000次并花费208小时。为了解决这些方法在检测真正的基因片段中表现如何的问题，图3C使用参考序列数据库(5)中提供的标记绘制了真阳性率与假阳性率的关系图。亲和繁殖获得了显著更高的总磷率，特别是在低磷率下，这对于生物学家来说是最重要的。在3%的概率密度下，亲和传播达到了39%的概率密度，而最好的k中心聚类结果是17%。相比之下，在相同的过滤速率下，层次凝聚聚类(2)的最佳过滤速率为19%，而(4)中描述的工程工具(考虑了额外的生物学知识)达到了43%的过滤速率。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-M5rCWHYO-1596174939576)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731115808147.png)]

Fig.3. 检测基因。亲和繁殖用于检测由小鼠染色体1基因组成的推测外显子(数据点。这里，平方误差不适合作为Similari的度量相反，相似度值来源于一个成本函数，测量基因组中假定外显子的接近程度，并在12个组织样本（3)中共同表达假定外显子）。 (A)A 在亲和传播的每次迭代中，显示了数据的商城部分和集群的出现。在每一张图片中，用黑色概述的100个框对应100个数据点(来自t 每个盒子中的12个彩色块表示12个组织样本中相应DNA片段的转录水平。最左边的盒子对应一个具有无限偏好的人工数据点，用于解释非xon区域（例如内含子）。连接数据点的线表示潜在的赋值，其中灰色线表示assi 目前有弱证据和实线的GNES表示当前有强证据的作业。 (B)对不同数目基因的重建误差最小化的性能检测到的集群的RS。对于每个簇数，亲和传播需要6分钟，而10，000次k中心聚类在同一台计算机上需要208小时。在每种情况下，亲和力传播与k中心聚类相比，重建误差显著降低。 ©用于检测外显子的真阳性率与假阳性率图[使用来自RefSeq（5）的标签]显示t 帽子亲和力传播在检测生物验证的外显子方面也比k中心聚类表现得更好。

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metric space clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e.,$ s(i,k) ≠ s(k,i)$] and to problems where the similarities do not satisfy the triangle inequality[i.e., $s(i,k) < s(i,j) + s( j,k)$ ].

AP在非标准优化标准的基础上运行的能力使得它适合于使用异常相似性度量的探索性数据分析。不同于度量空间聚类技术，如k-均值聚类(1)，相似性传播可以应用于数据不在连续空间的问题。实际上，它可以应用于相似性不是对称的问题[即s(i，k)s(k，i)]和相似性不满足三角形不等式的问题，[即s(i，k) < s(i，j) + s( j，k)]。

To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using l = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

为了在这份手稿的草稿中找出少量总结其他句子的句子，我们将每个句子视为“一袋单词”(6)，并根据使用句子k中的单词对句子I中的单词进行编码的成本，计算句子I与句子k的相似度。我们发现97%的相似度(2，3)不是对称的。调整偏好以识别(使用l = 0.8)不同数量的代表性例句(2)，并且具有四个句子的解决方案在图4A中示出。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k.

我们还应用了相似性传播来探索这样一个问题，即根据估计的商业航空旅行时间，识别最容易被其他城市的大子集访问的有限数量的加拿大和美国城市。每个数据点是一个城市，相似性s(i，k)被设置为乘飞机从城市I到城市k的负时间，包括估计的中途停留时间(3)。由于逆风，运输时间在许多情况下因行驶方向而异，因此36%的相似性是不对称的。此外，对于97%的城市对I和k来说，有第三个城市j违反了三角不等式，因为从I到k的旅行包含了很长的中途停留延迟。

When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New York City, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

在城市j中，因此它花费的时间比从I到j和从j到k的旅行持续时间的总和还长。当“最容易到达的城市”的数量被限制为7个时(通过适当地调整输入偏好)，图4中所示的城市B到E被识别。有趣的是，有几个主要城市没有被选中，也是因为繁重的国际旅行使它们不适合作为容易到达的国内目的地(例如，纽约城市，洛杉矶)或因为他们的邻居可以更有效地通过其他目的地(例如，亚特兰大，菲尔-阿德尔菲亚和明尼阿波利斯占了芝加哥的目的地，同时避免了潜在的机场延误)。

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels,$ c_1,…,c_N$, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that $s(i,c_i)$ is the similarity of data point i to its exemplar. $c_i = i$ is a special case indicating that point i is itself an exemplar, so that $s(i,c_i)$ is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., ci′ = i), then i must be an exemplar (i.e., ci = i). The energy of a valid configuration is $E(c) = −∑i=1 N_s(i,ci)$ . Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraints on the labels and the energy function (2).

亲和传播可被视为一种搜索能量函数(7)的最小值的方法，该能量函数依赖于对应于N个数据的一组N个隐藏标签c1，…，cN,点数。每个标签指示该点所属的样本，因此s(i，ci)是数据点I与其样本的相似度。ci = i是一种特殊情况，表明I点本身是一个样本，因此s(i，ci)是I点的输入参考。并非所有标签配置都有效；当对于每个点I，如果某个其他点I’已经选择I作为其样本(即，ci’= I)，则I必须是样本(即，ci = i)，则配置c是有效的。一个有效组态的能量是E©=∑I = 1N s(I，ci)。实际上最小化能量在计算上是棘手的，因为这个最小化问题的一个特例是NP-hard k-中值问题(8)。然而，亲和传播的更新规则对应于用于最小化贝氏自由能(9)近似值的定点递归。亲和传播最容易被描述为因子图(10)中的最大和算法的实例，该因子图描述了标签和能量函数(2)上的约束。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if $s(1,2) = s(2,1) $and $s(1,1) = s(2,2)$ , then the solutions $c1 = c2 = 1 and c1 = c2 = 2$ both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and non exemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations,or by increasing the damping factor.

在一些退化的情况下，能量函数可能具有多个最小值和相应的更新规则的多个固定点，并且这些可能阻止收敛。例如，如果s(1，2) = s(2，1)和s(1，1) = s(2，2)，那么解c1 = c2 = 1和c1 = c2 = 2都获得相同的能量。在这种情况下，相似性传播可能会振荡，两个数据点在样本和非样本之间交替。在实践中，我们发现，通过在相似处添加少量噪声以防止退化情况，或者通过增加阻尼因子，振荡总是可以避免的。

Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering(1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

AP相对于相关技术有几个优点。诸如k-中心聚类(1)、k-均值聚类(1)和期望最大化(EM)算法(11)等方法在每一步都存储一组相对较小的估计聚类中心。这些技术通过一些方法得到了改进，这些方法从大量的聚类开始，然后对它们进行删减(12)，但它们仍然依赖于随机抽样，并做出难以弥补的删减决定。相比之下，通过同时考虑所有数据点作为候选中心并逐渐识别聚类，相似性算法能够避免许多由不吉利的初始化和困难的判定引起的不良解。马尔可夫链蒙特卡罗技术(13)随机搜索好的解决方案，但不分享亲和传播的优势，考虑许多可能的解决方案在同一时间。

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

分层凝聚聚类(14)和谱聚类(15)解决了递归比较成对的点以找到数据分区的不同问题。这些技术不要求集群内的所有点都类似于单个中心，因此不太适合许多任务。特别是，不应该在同一个簇中的两个点可能被成对分组的不幸序列组合在一起。

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general non metric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Where as the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).

在（8）中，我们证明了相关的度量k-中值问题可以用常数因子近似来松弛成线性规划。在这里，假设输入是度量的，即非负的，对称的，满足三角不等式的。相反，亲和传播可以将一般的非度量相似性作为输入。亲和力传播还提供了一种概念上新的方法，在实践中效果良好。由于线性规划松弛很难解决，需要应用复杂的软件包（例如，CPLEX），亲和传播利用直观的消息更新，这些更新可以在几行代码中实现（2）。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages(17) in a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard two dimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.

AP在精神上与最近在不同学科中获得破纪录结果的技术有关（16）。在“循环图”中递归传播消息（17）的方法已被用于逼近纠错解码（18，19）中的香农极限，解决随机可满足性问题（20），解决NP难二维相位展开问题（21），以及从成对的立体图像有效地估计深度（22）。利用这种古老的知识传播方法，是解决这一问题的首要途径。由于它的简单性，普遍适用性和性能，我们相信亲和传播将被证明在科学和工程中具有广泛的价值。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0XtJGDM4-1596174939579)(D:/%E5%8D%9A%E5%AE%A2/typora_photo/image-20200731121855630.png)]

ffinity propagation will prove to be of broad value in science and engineering.

[外链图片转存中…(img-0XtJGDM4-1596174939579)]

Fig.4.识别关键句子和空中旅行路线。 AP可以在非标准优化准则的基础上探索样本的识别。 (A)两者之间的相同之处这份手稿草稿中的EN对句子是由匹配的单词构成的。通过AP，确定了四个例句。 (B)亲和性传播被应用从加拿大和美国456个最繁忙的商业机场之间的空中旅行效率（用估计的旅行时间来衡量）得出的相似之处-这两个机场的旅行时间都是直接的航班（以蓝色显示)和间接航班(未显示)，包括最长一次停留的平均转移时间，被用作负相似性(3）。 ©确定了七个样本亲和传播是彩色编码的，并显示了其他城市对这些样本的分配。位于样本城市附近的城市可能是其他更远的样本的成员他们之间缺乏直飞航班（例如，大西洋城离费城100公里，但离亚特兰大更近）。 (D)嵌体表明，加拿大-美国边界大致上是分开的 e多伦多和费城集群，由于国内航班比国际航班更多。然而，西海岸的情况并非如此，如(E)所示，因为额外温哥华和西雅图之间常见的航空服务连接着加拿大西北部的城市和西雅图。

初版，多有不足，部分未达要求，改进中…

查看全文
如若内容造成侵权/违法违规/事实不符，请联系编程学习网邮箱：809451989@qq.com进行投诉反馈，一经查实，立即删除！

系列文章：spring的xml配置是如何对应注解配置的之ContextLoaderListener
何时加载ContextLoaderListener？ContextLoaderListener：继承ContextLoader，实现ServletContextListener，接收web应用ServletContext变化的通知，将监听器配置在web.xml中。我们一般会在xml中做如下配置：<!-- 默认的spring配置文件是在WEB-INF下的applicationContext.xm…...
2024/4/15 17:16:56
结合仿造美团民宿和爱彼迎的微信小程序
结合仿造美团民宿和爱彼迎的微信小程序前段时间参考美团民宿和爱彼迎做了一个预订民宿的微信小程序，使用微信小程序原生开发和weui框架，主要实现了用户预订民宿功能。包括民宿查看、条件检索、浏览历史、客服聊天、与房东聊天、评价敏感词过滤、高德地图接入等。大概的界面如…...
2024/4/15 17:16:55
AP AUTOSAR平台设计(1)——范围和方法
1.背景传统ECU控制软件为目标车辆而设计在车辆使用寿命期间不会发生重大变化智能ECU高度自动驾驶的到来车辆中高性能CPU的引入OTA技术的搭载AUTOSAR经典平台（CP）标准满足了深度嵌入式ECU的需求，而智能ECU的需求无法满足。因此，AUTOSAR指定了另一个软件平台，即AUTOSAR自适应…...
2024/5/6 0:01:12
电子体温计程序方案开发
中国电子体温计行业最早起源于1998年，以每年高于30%的速度发展至今经历了十多年时间。目前国内涌现出了大小80多家电子体温计品牌，今后试图进入该行业的生产厂家将达到50多家。高达数倍甚至10多倍的利润空间、较低的政策壁垒和技术壁垒吸引了众多的企业进入该行业。由于行业逐…...
2024/4/24 15:10:42
Google AI提出通过数据回传加速神经网络训练方法，显著提升训练效率
2020/07/31过去十年来，神经网络的训练速度得到了大幅提高，使得深度学习技术在许多重要问题上的应用成为可能。随着摩尔定律即将走向终结，通用处理器的的改进未取得明显成效，机器学习社区越来越多地转向专用硬件来谋求额外的加速。GPU 和 TPU 针对高度并行化的矩阵运算进行了…...
2024/4/24 11:19:15
快速学习-sentinel动态规则扩展
10、动态规则扩展 10.1 规则 Sentinel 的理念是开发者只需要关注资源的定义，当资源定义成功后可以动态增加各种流控降级规则。Sentinel 提供两种方式修改规则：通过API 直接修改(loadRules) 通过DataSource 适配不同数据源修改通过API 修改比较直观，可以通过以下三个API 修改…...
2024/4/15 17:16:51
【OpenCV中的Gui特性】图片 - 自学代码笔记
（根据OpenCV+Python中文教程书籍电子版自学） 1. 头文件 # cv2.imread() cv2.imshow() cv2.imwrite() import cv2 import numpy as np import matplotlib.pyplot as plt2. 读入图像 img = cv2.imread(1.jpg,0)3. 显示图像 # 原来的写法 cv2.imshow(1.jpg,img) cv2.waitKey(0) …...
2024/4/17 5:50:38
Netty学习笔记摘要援引
netty防止内存泄露： https://blog.csdn.net/gt9000/article/details/88206340里面还提到了一点：畸形码流攻击：如果客户端按照协议规范，将消息长度值故意伪造的非常大，可能会导致接收方内存溢出。代码 BUG：错误的将消息长度字段设置或者编码成一个非常大的值，可能会导致…...
2024/4/15 17:16:50
RabbitMQ（一）
简介 RabbitMQ是一个开源的AMQP实现，服务器端用Erlang语言编写，支持多种客户端（语言），存储转发消息一、RabbitMQ系统架构1.绿色图示为交换机，红色图示为消息队列在服务端称作Broker，由RabbitMQ实现 2.蓝色为生产者和消费者两种类型，为客户端二、RabbitMQ概念 RabbitM…...
2024/4/15 17:16:48
140亿估值背后，元气森林深陷其中
配图来自Canva 最近不少人被元气森林的广告刷了屏，它无处不在，赚足了消费者的眼球，一度成为饮料界的一匹黑马。仅仅半年的时间，元气森林的销售额就达到了6.6亿元，几乎是去年全年的销售额。而元气森林的估值也随着销售上升而暴涨。去年10月，元气森林刚刚完成一笔1.5亿元的…...
2024/4/15 17:16:47
vnc远程桌面，这四款vnc远程桌面软件，一定有你不知道的
看到vnc远程桌面就可以知道这是一款什么软件，远程控制软件讲究的是方便好操作，因为远程控制这个功能就是比较麻烦的。这四款vnc远程桌面软件，一定有你不知道的。第一款：IIS7服务器管理工具这个工具里面的VNC功能可以说是使用感非常棒的。它可以一键导出或导入，还可以一键…...
2024/4/23 16:12:05
Android防界面劫持
目录一什么是页面劫持二页面劫持常用攻击手段三如何防范页面劫持3.1 用户方面3.2 开发者方面四参考一什么是页面劫持界面劫持是指在Android系统中，恶意软件通过监控目标软件的运行，当检测到当前运行界面为某个被监控应用的特定界面时（一般为登录或支付界面），弹出伪…...
2024/4/15 15:35:01
大数据安全分析需要关注哪些问题
大数据所存储的数据非常巨大，往往采用分布式的方式进行存储，而正是由于这种存储方式，存储的路径视图相对清晰，而数据量过大，导致数据保护，相对简单，黑客较为轻易利用相关漏洞，实施不法操作，造成安全问题。今天我们就一起来了解一下大数据安全分析都需要关注哪些问题。…...
2024/4/15 15:35:00
【docker】Nexus搭建笔记
前言搭着玩一下流程 docker pull sonatype/nexus3mkdir /home/nexus && chown -R 200 /home/nexus启动容器，要2个端口，如果端口冲突，改冒号前面那个，那个是映射到宿主哪个。docker run -d -p 8081:8081 -p 8082:8082 \ --name nexus \ -v /home/nexus:/nexus-data …...
2024/4/15 15:34:59
案例研究 | 区块链在监管科技中的应用
区块链技术与监管科技飞速发展，发表于MIS Quarterly Executive的论文A Case Study of Using Blockchain Technology in Regulatory Technology探讨了将区块链技术应用于监管合规、降低合规成本和减轻监管负担的应用潜力，描述了Maison区块链系统在英国抵押贷款的监管应用。中国…...
2024/4/17 12:15:04
k8s calico 网络不通
环境描述:k8s + calico + coredns 部署容器问题描述:虚拟机挂起后,重新激活连接,k8s 容器网络不通ping 同宿主机容器ip 成功ping 跨宿主机容器ip 失败ping 同宿主机容器主机名失败ping 跨宿主机容器主机名失败route -n 查看路由路由缺失 tunl0的路由ifconfig 查看网卡 t…...
2024/5/2 4:41:52
使用webpack热更新，自动编译less文件
新建工作区文件src 在src文件夹内，新建编译入口文件app.js，以及所需操作的文件文件夹。初始化包管理文件 npm init -y安装webpack、webpack-cli、webpack-dev-server npm i webpack webpack-cli webpack-dev-server -D-S：–save的缩写，上线后依旧需要此依赖 -D：–save-de…...
2024/4/15 15:34:56
使用伪半监督学习的无监督聚类
提出两个问题：是否有可能仅使用无监督技术来创建半监督方法所需的小标签数据集？如果是这样，半监督方法是否可以利用这种自动生成的伪标记数据集来提供比最新的无监督方法更高的性能？为了自主创建高精度的伪标记数据集，我们将深度网络的集成与自定义图聚类算法结合使用（第…...
2024/4/15 15:34:57
1049：晶晶赴约会
【题目描述】晶晶的朋友贝贝约晶晶下周一起去看展览，但晶晶每周的1、3、5有课必须上课，请帮晶晶判断她能否接受贝贝的邀请，如果能输出YES；如果不能则输出NO。注意YES和NO都是大写字母！带我去看题目【输入】输入有一行，贝贝邀请晶晶去看展览的日期，用数字1到7表示从星期一…...
2024/4/15 15:34:56
Linux有什么特点?体系结构有哪些?
随着社会的进步以及发展，Linux系统使用用户也在不断增加，这得益于Linux操作系统的优势。那么Linux具有哪些特点?接下来为大家介绍一下。Linux系统有哪些特点：1、免费：一个免费、自由、开放的操作系统，遵循通用公共许可证GPL，任何人有使用、拷贝以及修改Linux系统的自由，…...
2024/4/15 17:16:45

【论文翻译】Clustering by Passing Messages Between Data Points

【文献翻译】Clustering by Passing Messages Between Data Points

Abstract

正文

相关文章

最新文章