softmax

Logistic Regession宗旨是以线性分割面分割各类别 logPr(Y=ix)Pr(Y=jx)\log \frac {Pr(Y=i|x)}{ Pr(Y=j|x)} .同时保持各类别的概率和为1.

LR依从 bayesian 法则,即它对样本的分类是看哪个类别的概率最大:

G(x)=argmaxkPr(Y=kx)G(x) = \arg \max_k Pr(Y=k|x)

因此类别i和j的分界面由决定

Pr(Y=ix)=Pr(Y=jx)Pr(Y=i|x) = Pr(Y=j|x)

各类别概率

logP(Y=1x)P(Y=kx)=w1xlogP(Y=2x)P(Y=kx)=w2x.....logP(Y=k1x)P(Y=kx)=wk1xP(Y=1x)=P(Y=kx)exp(w1x)P(Y=2x)=P(Y=kx)exp(w2x).....P(Y=k1x)=P(Y=kx)exp(wk1x)P(Y=kx)=11+k=1K1exp(wkx)P(Y=1x)=exp(w1x)1+k=1K1exp(wkx)P(Y=2x)=exp(w2x)1+k=1K1exp(wkx)\log \frac{P(Y=1|x)}{P(Y=k|x)} = w_1 \cdot x \\ \log \frac{P(Y=2|x)}{P(Y=k|x)} = w_2 \cdot x \\ ..... \\ \log \frac{P(Y=k-1|x)}{P(Y=k|x)} = w_{k-1} \cdot x \\ \\ P(Y=1|x) = P(Y=k|x) \exp(w_1 \cdot x) \\ P(Y=2|x) = P(Y=k|x) \exp(w_2 \cdot x) \\ ..... \\ P(Y=k-1|x) = P(Y=k|x) \exp(w_{k-1} \cdot x) \\ \\ P(Y=k|x) = \frac {1}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\ P(Y=1|x) = \frac {exp(w_1 \cdot x)}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\ P(Y=2|x) = \frac {exp(w_2 \cdot x)}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\

这个过程都是P(Y=kx)P(Y=k|x)倍数,若

P(Y=kx)=exp(wkx)k=1Kexp(wkx)P(Y=1x)=exp(w1x)k=1Kexp(wkx)P(Y=2x)=exp(w2x)k=1Kexp(wkx)P(Y=k|x) = \frac {\exp(w_k \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\ \text{则} \\ P(Y=1|x) = \frac {\exp(w_1 \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\ P(Y=2|x) = \frac {\exp(w_2 \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\

则,这就是softmax。

怎么感觉像无向图模型中一般用的势能函数一样。给每个类别赋一个势能,然后归一化。然后求势能最大的那个类别。

所以

pj=exp(wjx)iKexp(wix)p_j = \frac {\exp(w_jx)} {\sum_i^K \exp(w_i x)}

softmax loss

一般分类的估计就是使用(负对数)极大似然估计(negative log-likelihood ), 二分类的最大似然就是交叉熵。

L(f(xi,θ),yi)=1mlogi=1mk=1Kpi,k1(yi=k)+λk=1θk=1mi=1mk=1K1(yi=k)logexp(f(xi,θ))jexp(f(xi,θ))+λk=1θkL(f(x_i,\theta),y_i) = -\frac {1}{m} \log \prod_{i=1}^m \prod_{k=1}^K p_{i,k}^{1(y_i = k)} + \lambda \sum_{k=1} |\theta_k| \\ = -\frac {1}{m} \sum_{i=1}^m \sum_{k=1}^K {1(y_i = k)} \log \frac {\exp(f(x_i,\theta))} {\sum_j \exp(f(x_i,\theta))} + \lambda \sum_{k=1} |\theta_k| \\

来个简化版的:

L(xi,θk,yi)=1mlogi=1mk=1Kpi,k1(yi=k)=1mi=1m1(yi=k)logexp(xiθk)jexp(xiθk)L(x_i,\theta_k,y_i) = -\frac {1}{m} \log \prod_{i=1}^m \prod_{k=1}^K p_{i,k}^{1(y_i = k)} \\ = -\frac {1}{m} \sum_{i=1}^m {1(y_i = k)} \log \frac {\exp(x_i \theta_k)} {\sum_j \exp(x_i \theta_k)} \\

求导得:

Lθk=1mi=1mxi[1(yi=k)p(yi=kxi;θk)]\frac {\partial L}{\partial \theta_k} = -\frac {1}{m} \sum_{i=1}^m x_i [ {1(y_i = k)} - p(y_i = k | x_i; \theta_k) ]

向量化表示

h(η)=1keηk[eη1eη2eηK]J=k=1K1{y=k}log(hk)Jη=heyey是第y个元素为1,其余为0的列向量h(\eta) = \frac {1}{\sum_k e^{\eta_k}} \begin{bmatrix} e^{\eta_1} \\ e^{\eta_2} \\ \vdots \\ e^{\eta_K} \end{bmatrix} \\ J = -\sum_{k=1}^K 1\{y=k\} \log(h_k) \\ \frac {\partial J}{\partial \eta} = h-e_y \quad e_y \text{是第y个元素为1,其余为0的列向量}

然后对于全部样本,优化函数为:

L=n=1Neylog(h(η))Lη=n=1N(hey)L = -\sum_{n=1}^N e_y \log(h(\eta)) \\ \frac {\partial L}{\partial \eta} = \sum_{n=1}^N (h-e_y) \\

参考佳文

探究最陌生的老朋友Softmax

Softmax vs. Softmax-Loss: Numerical Stability softmax回归 Softmax回归 Caffe Softmax层的实现原理

Softmax与交叉熵的数学意义

ArcFace,CosFace,SphereFace,三种人脸识别算法的损失函数的设计

Last updated