Machine Learning
  • Introduction
  • man
  • Linear model
    • Linear Regression
    • Generalized Linear Models
    • Nonlinear regression
  • bayes
    • bayesian network
    • Variational Bayesian inference
    • Gaussian Process Regression
  • Logistic Regression
    • L1 regularization
    • L2 regularization
    • softmax
    • Overflow and Underflow
  • SVM
    • C-SVM
    • C-SVM求解
  • EM
    • GMM
  • Maximum Entropy
    • IIS
  • HMM
    • viterbi algorithm
  • CRF
  • Random Forest
    • bagging
    • random forest
  • boosting
    • catboost
    • gradient boosting
    • Newton Boosting
    • online boosting
    • gcForest
    • Mixture models
    • XGBoost
    • lightGBM
    • SecureBoost
  • LDA
  • rank
    • RankNet
    • LambdaRank
    • SimRank
  • Factorization Machine
    • Field-aware Factorization Machine
    • xdeepFM
  • Clustering
    • BIRCH
    • Deep Embedding Clustering
  • Kalman filtering
  • word2vec
  • 关联规则挖掘
  • MATH-Mathematical Analysis
    • measure
  • MATH-probability
    • Variational Inference
    • Dirichlet分布
    • Gibbs Sampling
    • Maximum entropy probability distribution
    • Conjugate prior
    • Gaussian Process
    • Markov process
    • Poisson process
    • measure
    • Gumbel
  • MATH-Linear Algebra
    • SVD
    • SVD-推荐
    • PCA
    • Linear Discriminant Analysis
    • Nonnegative Matrix Factorization
  • MATH-Convex optimization
    • 梯度下降
    • 随机梯度下降
    • 牛顿法
    • L-BFGS
    • 最速下降法
    • 坐标下降法
    • OWL-QN
    • 对偶问题
    • 障碍函数法
    • 原对偶内点法
    • ISTA
    • ADMM
    • SAG
  • MATH-碎碎念
    • cost function
    • Learning Theory
    • sampling
    • Entropy
    • variational inference
    • basis function
    • Diffie–Hellman key exchange
    • wavelet transform
    • 图
    • Portfolio
    • 凯利公式
  • ML碎碎念
    • 特征
    • test
    • TF-IDF
    • population stability index
    • Shapley Values
  • 课件
    • xgboost算法演进
  • Time Series
  • PID
  • graph
    • SimRank
    • community detection
    • FRAUDAR
    • Anti-Trust Rank
    • Struc2Vec
    • graph theory
    • GNN
  • Anomaly Detection
    • Isolation Forest
    • Time Series
  • Dimensionality Reduction
    • Deep Embedded Clustering
  • Federated Learning
  • automl
  • Look-alike
  • KNN
  • causal inference
Powered by GitBook
On this page
  • 验证是下降方向:
  • 证明:
  • 算法过程

Was this helpful?

  1. MATH-Convex optimization

最速下降法

梯度下降每步所求:

arg⁡min⁡d{∇f(xk)Tds.t.∣∣d∣∣=1}=−arg⁡max⁡d{∣∇f(xk)Td∣s.t.∣∣d∣∣=1}\arg \min_d \{\nabla f(x^k)^T d \quad s.t. ||d||=1\} \\ = -\arg \max_d \{|\nabla f(x^k)^T d| \quad s.t. ||d||=1\}argdmin​{∇f(xk)Tds.t.∣∣d∣∣=1}=−argdmax​{∣∇f(xk)Td∣s.t.∣∣d∣∣=1}

范数定义不同,方向不同,例如对于L1范数:

dk=−sign(∂f(x)∂xi)eii=arg⁡max⁡j∣∂f(x)∂xi∣d^k = -sign(\frac {\partial f(x)}{\partial x_i})e_i \\ i= \arg \max_j |\frac {\partial f(x)}{\partial x_i}|dk=−sign(∂xi​∂f(x)​)ei​i=argjmax​∣∂xi​∂f(x)​∣

验证是下降方向:

∇f(xk)≠0∇f(xk)Tdk<∇f(xk)T−∇f(xk)∣∣∇f(xk)∣∣=−∣∣∇f(xk)∣∣<0\nabla f(x^k) \neq 0 \nabla f(x^k)^T d^k \lt \nabla f(x^k)^T \frac {-\nabla f(x^k)}{||\nabla f(x^k)||} = -||\nabla f(x^k)|| \lt 0∇f(xk)=0∇f(xk)Tdk<∇f(xk)T∣∣∇f(xk)∣∣−∇f(xk)​=−∣∣∇f(xk)∣∣<0

证明:

∣a⋅b∣=∣a1b1+…+anbn∣≤∣a1b1∣+…+∣anbn∣≤∣ak∣∗(∣b1∣+…+∣bn∣)=∣∣a∣∣∞∣∣b∣∣1∣∣a∣∣∞=max⁡∣ai∣∣∇f(xk)Td∣<∣∣∇f(xk)∣∣∗∣∣d∣∣1|a \cdot b| = |a_1b_1 + \ldots + a_nb_n| \\ \le |a_1b_1| + \ldots + |a_nb_n| \\ \le |a_k| * (|b_1|+ \ldots + |b_n|) \\ = ||a||_{\infty} ||b||_1 \\ ||a||_{\infty} = \max |a_i| \\ |\nabla f(x^k)^T d| \lt || \nabla f(x^k) || * ||d||_1∣a⋅b∣=∣a1​b1​+…+an​bn​∣≤∣a1​b1​∣+…+∣an​bn​∣≤∣ak​∣∗(∣b1​∣+…+∣bn​∣)=∣∣a∣∣∞​∣∣b∣∣1​∣∣a∣∣∞​=max∣ai​∣∣∇f(xk)Td∣<∣∣∇f(xk)∣∣∗∣∣d∣∣1​

算法过程

最速下降法直观上就是沿着变换最快的坐标下降。所以接下来引出坐标下降法及分块坐标下降法。

PreviousL-BFGSNext坐标下降法

Last updated 5 years ago

Was this helpful?

无聊的最速下降法推导

https://zhuanlan.zhihu.com/p/23799012