Machine Learning
  • Introduction
  • man
  • Linear model
    • Linear Regression
    • Generalized Linear Models
    • Nonlinear regression
  • bayes
    • bayesian network
    • Variational Bayesian inference
    • Gaussian Process Regression
  • Logistic Regression
    • L1 regularization
    • L2 regularization
    • softmax
    • Overflow and Underflow
  • SVM
    • C-SVM
    • C-SVM求解
  • EM
    • GMM
  • Maximum Entropy
    • IIS
  • HMM
    • viterbi algorithm
  • CRF
  • Random Forest
    • bagging
    • random forest
  • boosting
    • catboost
    • gradient boosting
    • Newton Boosting
    • online boosting
    • gcForest
    • Mixture models
    • XGBoost
    • lightGBM
    • SecureBoost
  • LDA
  • rank
    • RankNet
    • LambdaRank
    • SimRank
  • Factorization Machine
    • Field-aware Factorization Machine
    • xdeepFM
  • Clustering
    • BIRCH
    • Deep Embedding Clustering
  • Kalman filtering
  • word2vec
  • 关联规则挖掘
  • MATH-Mathematical Analysis
    • measure
  • MATH-probability
    • Variational Inference
    • Dirichlet分布
    • Gibbs Sampling
    • Maximum entropy probability distribution
    • Conjugate prior
    • Gaussian Process
    • Markov process
    • Poisson process
    • measure
    • Gumbel
  • MATH-Linear Algebra
    • SVD
    • SVD-推荐
    • PCA
    • Linear Discriminant Analysis
    • Nonnegative Matrix Factorization
  • MATH-Convex optimization
    • 梯度下降
    • 随机梯度下降
    • 牛顿法
    • L-BFGS
    • 最速下降法
    • 坐标下降法
    • OWL-QN
    • 对偶问题
    • 障碍函数法
    • 原对偶内点法
    • ISTA
    • ADMM
    • SAG
  • MATH-碎碎念
    • cost function
    • Learning Theory
    • sampling
    • Entropy
    • variational inference
    • basis function
    • Diffie–Hellman key exchange
    • wavelet transform
    • 图
    • Portfolio
    • 凯利公式
  • ML碎碎念
    • 特征
    • test
    • TF-IDF
    • population stability index
    • Shapley Values
  • 课件
    • xgboost算法演进
  • Time Series
  • PID
  • graph
    • SimRank
    • community detection
    • FRAUDAR
    • Anti-Trust Rank
    • Struc2Vec
    • graph theory
    • GNN
  • Anomaly Detection
    • Isolation Forest
    • Time Series
  • Dimensionality Reduction
    • Deep Embedded Clustering
  • Federated Learning
  • automl
  • Look-alike
  • KNN
  • causal inference
Powered by GitBook
On this page
  • softmax loss
  • 向量化表示
  • 参考佳文

Was this helpful?

  1. Logistic Regression

softmax

Logistic Regession宗旨是以线性分割面分割各类别 log⁡Pr(Y=i∣x)Pr(Y=j∣x)\log \frac {Pr(Y=i|x)}{ Pr(Y=j|x)}logPr(Y=j∣x)Pr(Y=i∣x)​ .同时保持各类别的概率和为1.

LR依从 bayesian 法则,即它对样本的分类是看哪个类别的概率最大:

G(x)=arg⁡max⁡kPr(Y=k∣x)G(x) = \arg \max_k Pr(Y=k|x)G(x)=argkmax​Pr(Y=k∣x)

因此类别i和j的分界面由决定

Pr(Y=i∣x)=Pr(Y=j∣x)Pr(Y=i|x) = Pr(Y=j|x)Pr(Y=i∣x)=Pr(Y=j∣x)

各类别概率

log⁡P(Y=1∣x)P(Y=k∣x)=w1⋅xlog⁡P(Y=2∣x)P(Y=k∣x)=w2⋅x.....log⁡P(Y=k−1∣x)P(Y=k∣x)=wk−1⋅xP(Y=1∣x)=P(Y=k∣x)exp⁡(w1⋅x)P(Y=2∣x)=P(Y=k∣x)exp⁡(w2⋅x).....P(Y=k−1∣x)=P(Y=k∣x)exp⁡(wk−1⋅x)P(Y=k∣x)=11+∑k=1K−1exp⁡(wk⋅x)P(Y=1∣x)=exp(w1⋅x)1+∑k=1K−1exp⁡(wk⋅x)P(Y=2∣x)=exp(w2⋅x)1+∑k=1K−1exp⁡(wk⋅x)\log \frac{P(Y=1|x)}{P(Y=k|x)} = w_1 \cdot x \\ \log \frac{P(Y=2|x)}{P(Y=k|x)} = w_2 \cdot x \\ ..... \\ \log \frac{P(Y=k-1|x)}{P(Y=k|x)} = w_{k-1} \cdot x \\ \\ P(Y=1|x) = P(Y=k|x) \exp(w_1 \cdot x) \\ P(Y=2|x) = P(Y=k|x) \exp(w_2 \cdot x) \\ ..... \\ P(Y=k-1|x) = P(Y=k|x) \exp(w_{k-1} \cdot x) \\ \\ P(Y=k|x) = \frac {1}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\ P(Y=1|x) = \frac {exp(w_1 \cdot x)}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\ P(Y=2|x) = \frac {exp(w_2 \cdot x)}{1+\sum_{k=1}^{K-1} \exp(w_k \cdot x)} \\logP(Y=k∣x)P(Y=1∣x)​=w1​⋅xlogP(Y=k∣x)P(Y=2∣x)​=w2​⋅x.....logP(Y=k∣x)P(Y=k−1∣x)​=wk−1​⋅xP(Y=1∣x)=P(Y=k∣x)exp(w1​⋅x)P(Y=2∣x)=P(Y=k∣x)exp(w2​⋅x).....P(Y=k−1∣x)=P(Y=k∣x)exp(wk−1​⋅x)P(Y=k∣x)=1+∑k=1K−1​exp(wk​⋅x)1​P(Y=1∣x)=1+∑k=1K−1​exp(wk​⋅x)exp(w1​⋅x)​P(Y=2∣x)=1+∑k=1K−1​exp(wk​⋅x)exp(w2​⋅x)​

这个过程都是P(Y=k∣x)P(Y=k|x)P(Y=k∣x)倍数,若

P(Y=k∣x)=exp⁡(wk⋅x)∑k=1Kexp⁡(wk⋅x)则P(Y=1∣x)=exp⁡(w1⋅x)∑k=1Kexp⁡(wk⋅x)P(Y=2∣x)=exp⁡(w2⋅x)∑k=1Kexp⁡(wk⋅x)P(Y=k|x) = \frac {\exp(w_k \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\ \text{则} \\ P(Y=1|x) = \frac {\exp(w_1 \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\ P(Y=2|x) = \frac {\exp(w_2 \cdot x)}{\sum_{k=1}^{K} \exp(w_k \cdot x)} \\P(Y=k∣x)=∑k=1K​exp(wk​⋅x)exp(wk​⋅x)​则P(Y=1∣x)=∑k=1K​exp(wk​⋅x)exp(w1​⋅x)​P(Y=2∣x)=∑k=1K​exp(wk​⋅x)exp(w2​⋅x)​

则,这就是softmax。

怎么感觉像无向图模型中一般用的势能函数一样。给每个类别赋一个势能,然后归一化。然后求势能最大的那个类别。

所以

pj=exp⁡(wjx)∑iKexp⁡(wix)p_j = \frac {\exp(w_jx)} {\sum_i^K \exp(w_i x)}pj​=∑iK​exp(wi​x)exp(wj​x)​

softmax loss

一般分类的估计就是使用(负对数)极大似然估计(negative log-likelihood ), 二分类的最大似然就是交叉熵。

L(f(xi,θ),yi)=−1mlog⁡∏i=1m∏k=1Kpi,k1(yi=k)+λ∑k=1∣θk∣=−1m∑i=1m∑k=1K1(yi=k)log⁡exp⁡(f(xi,θ))∑jexp⁡(f(xi,θ))+λ∑k=1∣θk∣L(f(x_i,\theta),y_i) = -\frac {1}{m} \log \prod_{i=1}^m \prod_{k=1}^K p_{i,k}^{1(y_i = k)} + \lambda \sum_{k=1} |\theta_k| \\ = -\frac {1}{m} \sum_{i=1}^m \sum_{k=1}^K {1(y_i = k)} \log \frac {\exp(f(x_i,\theta))} {\sum_j \exp(f(x_i,\theta))} + \lambda \sum_{k=1} |\theta_k| \\L(f(xi​,θ),yi​)=−m1​logi=1∏m​k=1∏K​pi,k1(yi​=k)​+λk=1∑​∣θk​∣=−m1​i=1∑m​k=1∑K​1(yi​=k)log∑j​exp(f(xi​,θ))exp(f(xi​,θ))​+λk=1∑​∣θk​∣

来个简化版的:

L(xi,θk,yi)=−1mlog⁡∏i=1m∏k=1Kpi,k1(yi=k)=−1m∑i=1m1(yi=k)log⁡exp⁡(xiθk)∑jexp⁡(xiθk)L(x_i,\theta_k,y_i) = -\frac {1}{m} \log \prod_{i=1}^m \prod_{k=1}^K p_{i,k}^{1(y_i = k)} \\ = -\frac {1}{m} \sum_{i=1}^m {1(y_i = k)} \log \frac {\exp(x_i \theta_k)} {\sum_j \exp(x_i \theta_k)} \\L(xi​,θk​,yi​)=−m1​logi=1∏m​k=1∏K​pi,k1(yi​=k)​=−m1​i=1∑m​1(yi​=k)log∑j​exp(xi​θk​)exp(xi​θk​)​

求导得:

∂L∂θk=−1m∑i=1mxi[1(yi=k)−p(yi=k∣xi;θk)]\frac {\partial L}{\partial \theta_k} = -\frac {1}{m} \sum_{i=1}^m x_i [ {1(y_i = k)} - p(y_i = k | x_i; \theta_k) ]∂θk​∂L​=−m1​i=1∑m​xi​[1(yi​=k)−p(yi​=k∣xi​;θk​)]

向量化表示

h(η)=1∑keηk[eη1eη2⋮eηK]J=−∑k=1K1{y=k}log⁡(hk)∂J∂η=h−eyey是第y个元素为1,其余为0的列向量h(\eta) = \frac {1}{\sum_k e^{\eta_k}} \begin{bmatrix} e^{\eta_1} \\ e^{\eta_2} \\ \vdots \\ e^{\eta_K} \end{bmatrix} \\ J = -\sum_{k=1}^K 1\{y=k\} \log(h_k) \\ \frac {\partial J}{\partial \eta} = h-e_y \quad e_y \text{是第y个元素为1,其余为0的列向量}h(η)=∑k​eηk​1​​eη1​eη2​⋮eηK​​​J=−k=1∑K​1{y=k}log(hk​)∂η∂J​=h−ey​ey​是第y个元素为1,其余为0的列向量

然后对于全部样本,优化函数为:

L=−∑n=1Neylog⁡(h(η))∂L∂η=∑n=1N(h−ey)L = -\sum_{n=1}^N e_y \log(h(\eta)) \\ \frac {\partial L}{\partial \eta} = \sum_{n=1}^N (h-e_y) \\L=−n=1∑N​ey​log(h(η))∂η∂L​=n=1∑N​(h−ey​)

参考佳文

PreviousL2 regularizationNextOverflow and Underflow

Last updated 4 years ago

Was this helpful?

探究最陌生的老朋友Softmax
Softmax vs. Softmax-Loss: Numerical Stability
softmax回归
Softmax回归
Caffe Softmax层的实现原理
Softmax与交叉熵的数学意义
ArcFace,CosFace,SphereFace,三种人脸识别算法的损失函数的设计