# Generalized Linear Models

## 自然指数分布族

自然指数分布族 [Exponential family](https://en.wikipedia.org/wiki/Exponential_family\)%20%EF%BC%9A%20%20%0A%E5%A6%82%E6%9E%9C%E4%B8%80%E4%B8%AA%E6%A6%82%E7%8E%87%E5%88%86%E5%B8%83%E5%8F%AF%E4%BB%A5%E8%A1%A8%E7%A4%BA%E6%88%90%20$$p\(y;/eta\)%20=%20b\(y\)%20/exp\(/eta^T%20T\(y\)-a\(/eta\))$$，则称x服从自然指数分布族分布。\
多看看wiki的介绍。涉及到的东西很多。

指数分布族包括：

* [Normal distribution](https://en.wikipedia.org/wiki/Normal_distribution)，多元正态分布
* [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution\)%EF%BC%8801%E9%97%AE%E9%A2%98%E5%BB%BA%E6%A8%A1%EF%BC%89%EF%BC%8C\[Categorical%20distribution]\(https:/en.wikipedia.org/wiki/Categorical_distribution)（对k个结果的事件建模），
* [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution)（对计数过程建模）
* [Gamma distribution](https://en.wikipedia.org/wiki/Gamma_distribution\)%20%EF%BC%8C\[Exponential%20distribution]\(https:/en.wikipedia.org/wiki/Exponential_distribution)（对实数的间隔问题建模）
* [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution)（对小数建模）
* [Dirichlet distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution) （对概率分布进行建模）&#x20;
* [Wishart distribution](https://en.wikipedia.org/wiki/Wishart_distribution) （协方差矩阵的分布）&#x20;

### 正太分布

$$
N(x;\mu,\sigma) = \frac {1}{\sqrt {2\pi} \sigma} \exp(-\frac {(x-\mu)^2}{2\sigma^2})
$$

### bernoulli分布

$$
p(y|p) = p^y(1-p)^{1-y} = \exp (y \log \frac {p}{1-p} + \log (1-p)) \\
\text{link function:} \eta(p) = \log \frac {p}{1-p} \\
\text{response function:} p = \frac {1}{1+e^{-\eta}}
$$

### 泊松分布

$$
p(x|\lambda) = \frac {\lambda^\*}{x!} e^{-\lambda} = \exp(x \ln \lambda-\lambda) \frac {1}{x!}
$$

### student分布

$$
T(x|\mu,\sigma,v) = \[1+\frac {1}{v}(\frac {x-\mu} {\sigma})^2]^{- \frac {v+1}{2}}
$$

[学生t-分布](https://zh.wikipedia.org/wiki/%E5%AD%A6%E7%94%9Ft-%E5%88%86%E5%B8%83)

> 此分布形式上与高斯分布类似，弥补了高斯分布的一个不足，就是高斯分布对离群的数据非常敏感，但是Student t分布更鲁棒。一般设置ν=4，在大多数实际问题中都有很好的性能，当ν大于等于5时将会是去鲁棒性，同时会迅速收敛到高斯分布。\
> 特别的，当ν=1时，被称为柯西分布（Cauchy）。

Gamma 分布 [怎么来理解伽玛（gamma）分布？](https://www.zhihu.com/question/34866983)

**为什么弄出个指数分布族？**&#x4D;LAPP page313

* 指数分布族理论上都有共轭先验分布
* 将分布全部转换成指数形式，然后给定的约束条件下，熵值最大的函数就是他们各自的分布
* It can be shown that, under certain regularity conditions, the exponential family is the only

  family of distributions with finite-sized sufficient statistics, meaning that we can compress

  the data into a fixed-sized summary without loss of information. This is particularly useful

  for online learning, as we will see later. &#x20;
* The exponential family is the only family of distributions for which **conjugate priors** exist,

  which simplifies the computation of the posterior (see Section 9.2.5).
* The exponential family can be shown to be the family of distributions that makes the least

  set of assumptions subject to some user-chosen constraints (see Section 9.2.6).
* The exponential family is at the core of **generalized linear models**, as discussed in Section 9.3. &#x20;
* The exponential family is at the core of **variational inference**, as discussed in Section 21.2.&#x20;
* **指数簇分布的最大熵**等价于其**指数形式的最大似然**。

## 广义线性模型

广义线性模型，是为了克服线性回归模型的缺点出现的，是线性回归模型的推广。\
首先**自变量可以是离散**的，也可以是连续的。离散的可以是0-1变量，也可以是多种取值的变量。\
与线性回归模型相比较，有以下推广：

* 随机误差项不一定服从正态分布，可以服从二项、泊松、负二项、正态、伽马、逆高斯等分布，这些分布被统称为指数分布族。
* 引入联接函数g(⋅)。因变量和自变量通过联接函数产生影响，即Y=g(Xβ)，联接函数满足单调，可导。常用的联接函数有恒等 $$Y=X\beta$$，对数$$Y=\ln(X\beta)$$，幂函数$$Y=(X\beta)^k$$，平方根$$Y=\sqrt {X\beta}$$，$$Y= logit(\ln(\frac {Y}{1-Y})) = X\beta$$等。 &#x20;

根据不同的数据，可以自由选择不同的模型。大家比较熟悉的Logit模型就是使用Logit联接、随机误差项服从二项分布得到模型。

### three assumptions

* p(y|x;θ)满足指数分布族，也就是说，给定x和θ，y的分布情况满足以η为参数的指数分布族的分布。
* 给定x，我们的目标是预测T(y)的期望值，也即hθ(x)=E\[T(y)|x]
* 自然参数η和输入x是线性关系:η=θTx
* y | x; θ ∼ ExponentialFamily(η). I.e., given x and θ, the distribution of\
  y follows some exponential family distribution, with parameter η.
* Given x, our goal is to predict the expected value of T(y) given x.\
  In most of our examples, we will have T(y) = y, so this means we\
  would like the prediction h(x) output by our learned hypothesis h to\
  25\
  satisfy h(x) = E\[y|x]. (Note that this assumption is satisfied in the\
  choices for hθ(x) for both logistic regression and linear regression. For\
  instance, in logistic regression, we had hθ(x) = p(y = 1|x; θ) = 0 · p(y =\
  0|x; θ) + 1 · p(y = 1|x; θ) = E\[y|x; θ].)
* The natural parameter η and the inputs x are related linearly: η = θ\
  T x.\
  (Or, if η is vector-valued, then ηi = θ\
  T\
  i x.)

对于广义线性模型，取决于采用什么分布。采用正太分布，则得到**最小二乘模型**。采用伯努利分布，则得到**logistic模型**。然后用梯度下降等求线性部分参数。

### 建模

![](https://2270971654-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-M7DcNFhVrwIk3Tks_pB%2Fsync%2F3b0b6b7522e7302e143a72b8645727fa479baa19.png?generation=1589383930237130\&alt=media)

### 参考佳文

[广义线性模型](https://mp.weixin.qq.com/s?__biz=MzA4NDEyMzc2Mw==\&mid=2649677334\&idx=3\&sn=9fccb5c53c4be9039425e93c1a7e122e)

[广义线性模](https://www.zhihu.com/question/28469421\)%20%20%0A\[GLM,%20NON-LINEARITY%20AND%20HETEROSCEDASTICITY]\(http:/freakonometrics.hypotheses.org/9593\)%20%20%0A\[%E4%B8%80%E8%88%AC%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B%E3%80%81%E6%B7%B7%E5%90%88%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B%E3%80%81%E5%B9%BF%E4%B9%89%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B]\(http:/bbs.pinggu.org/thread-2996069-1-1.html\)%20%20%0A\[%E4%BB%8E%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B%E5%88%B0%E5%B9%BF%E4%B9%89%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B%EF%BC%881%EF%BC%89%E2%80%94%E2%80%94%E6%A8%A1%E5%9E%8B%E5%81%87%E8%AE%BE%E7%AF%87]\(http:/cos.name/2011/01/how-does-glm-generalize-lm-assumption/\)%20%20%0A\[%E4%BB%8E%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B%E5%88%B0%E5%B9%BF%E4%B9%89%E7%BA%BF%E6%80%A7%E6%A8%A1%E5%9E%8B\(2\)%E2%80%94%E2%80%94%E5%8F%82%E6%95%B0%E4%BC%B0%E8%AE%A1%E3%80%81%E5%81%87%E8%AE%BE%E6%A3%80%E9%AA%8C]\(http:/cos.name/2011/01/how-does-glm-generalize-lm-fit-and-test/)

[统一分布：指数模型家族](https://zhuanlan.zhihu.com/p/148776108)
