5 对数似然
指数似然解释和过程参考
\( \tilde{p} \)是经验分布概率,由样本简单统计得到。
\[ \begin{align}
L_\tilde{P}(P_w) &= \log \prod_{x,y}p(y|x)^{\tilde{p}(x,y)} = \sum_{x,y} \tilde{p}(x,y) \log p(y|x) \\
&= \sum_{x,y} \tilde{p}(x,y) \sum_{k=1} w_k f_k(x,y) – \sum_{x} \tilde{p}(x) \log Z_w(x)
\end{align} \]
5.1 改进的迭代尺度法
按极大估计优化,并且参考MEMM IIS完全相同的方法得到:
\[ \begin{align}
\frac{\partial B}{\partial \delta_k} & = \sum_{x,y} \tilde{p}(x,y) f_k(x,y) – \sum_{x,y} \tilde{p}(x) p_w(y|x) f_k(x,y) \exp(\delta_k f^\#(x,y)) \\
& = E_{\tilde{p}}(f_k) – \sum_{x,y} \tilde{p}(x) p_w(y|x) f_k(x,y) \exp(\delta_k f^\#(x,y))
\end{align} \]
\[ \begin{align}
f_k(y,x) &= \sum_i f_k(y_{i-1}, y_i, x_i) \\
f^\#(x,y) &= \sum_k \sum_i f_k(y_{i-1}, y_i, x_i) \end{align} \]
5.1.1
因每组(x,y) 特征数量并不相同。所以设置\( M – f^\#(x,y) \geq 0 \) 于是可用下式更新.
\[ \delta_k = \frac{1}{M} \log \frac{E_{\tilde{p}(f_k)}}{E_p(f_k)} \]
5.2 拟牛顿法
按惯例定义为最小化对数似然。为减轻过拟合和提高优化速度, 加上\(L2\)正则项。
\[ \begin{align}
min_w f(w) &= -L_\tilde{P}(P_w) + \sum_k \frac{w_k^2}{2 \sigma^2 } \\
&= -\log \prod_{x,y}p(y|x)^{\tilde{p}(x,y)} + \sum_k \frac{w_k^2}{2 \sigma^2 } \\
&= -\sum_{x,y} \tilde{p}(x,y) \log p(y|x) + \sum_k \frac{w_k^2}{2 \sigma^2 } \\
&= \sum_{x} \tilde{p}(x) \log Z_w(x) – \sum_{x,y} \tilde{p}(x,y) \sum_k w_k f_k(x,y) + \sum_k \frac{w_k^2}{2 \sigma^2 } \\
\end{align} \]
梯度:
\[ \nabla f(w) = \left[\frac{\partial f(w)}{\partial w_1},\frac{\partial f(w)}{\partial w_2},\frac{\partial f(w)}{\partial w_3},\cdots \right]^T \]
\[ \begin{align}
\frac{\partial f(w)}{\partial w_k} &= \sum_{x,y} \tilde{p}(x) p_w(y|x) f_k(x,y) – \sum_{x,y} \tilde{p}(x,y) f_k(x,y) + \frac{w_k}{\sigma^2} \\
&= \sum_{x,y} \tilde{p}(x) p_w(y|x) \sum_i f_k(y_{i-1}, y_i, x_i) \\
&\quad – \sum_{x, y} \tilde{p}(x,y) \sum_i f_k(y_{i-1}, y_i, x_i) + \frac{w_k}{\sigma^2} \\
&= \sum_x \tilde{p}(x) \sum_i \sum_{y_{i-1}, y_i} \sum_{y_1,\cdots, y_{i-2}, y_{i+1}, \cdots, y_n} p_w(y_{i-1}, y_i | x) p_w(y_1,\cdots, y_{i-2}, y_{i+1}, \cdots, y_n | x) f_k(y_{i-1}, y_i, x_i) \\
&\quad – \sum_{x, y} \tilde{p}(x,y) \sum_i f_k(y_{i-1}, y_i, x_i) + \frac{w_k}{\sigma^2} \\
&= \sum_x \tilde{p}(x) \sum_i \sum_{s, s^\prime} p_w(y_{i-1}=s, y_i=s^\prime | x) f_k(s, s^\prime, x_i) \\
&\quad – \sum_{x, y} \tilde{p}(x,y) \sum_i f_k(y_{i-1}, y_i, x_i) + \frac{w_k}{\sigma^2}
\end{align} \]
正则化参数依赖于训练数据集大小,通常情况下\( \sigma^2=10 \),此值需扫描尝试,
参考资料:
[1] 李航; 统计机器学习
[2] Ethem Alpaydın: Introduction to Machine Learning Third Edition 15 Hidden Markov Models
[3] Lafferty, J., McCallum, A., Pereira, F. (2001). “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”. Proc. 18th International Conf. on Machine Learning. Morgan Kaufmann. pp. 282–289.
[4] An Introduction to Conditional Random Fields Charles Sutton1 and Andrew McCallum2
Leave a Reply