Something about Attention
A summarization of Attention mechanism. Supporsing that we have hidden state representations $H \in \mathbb{R}^{n*d}$, where $n$ is the length of sequence, $d$ is the dimension of hidden layer representation of each token.
In Seq2Seq
$$
\begin{aligned}
c_i &= \sum_{j=1}^{T_x}\alpha_{ij}h_j \
\alpha_{ij} &= \frac{\exp{(e_{ij})}} {\sum_{k=1}^{T_x} \exp(e_{ik})} \
e_{ij} &= a(s_{i-1},h_j)
\end{aligned}
$$
$$
\begin{aligned}
& a_t(s) = align(h_t, h_s) = \frac{\exp(score(h_t, h_s))}{\sum_{s’}\exp(score(h_t, h_{s’}))}\
&score(h_t, h_s)=
\begin{cases}
{h_t}^Th_s & \quad \text{dot}\
{h_t}^TW_ah_s & \quad \text{general}\
W_a[h_t;h_s] & \quad \text{concat}
\end{cases}
\end{aligned}
$$
In Relation Classification
$$
\begin{aligned}
M &= \tanh(H^T)\
\alpha &= softmax(w^TM)\
r &= H\alpha^T
\end{aligned}
\quad
\implies
\quad
\begin{aligned}
A &= \tanh(H) \quad & (A \in \mathbb{R}^{n\times d})\
\alpha &= softmax(Aw) \quad & (w \in \mathbb{R}^{d}, \alpha \in \mathbb{R}^{n})\
r &= \alpha^TH \quad & (r \in \mathbb{R}^{d})
\end{aligned}
$$
- input: $H \in \mathbb{R}^{n\times d}$
- parameters to train: $w \in \mathbb{R}^{d}$
- output: sentence representation $r \in \mathbb{R}^d$
HAN in Text Classification
$$
\begin{aligned}
u_{it} &= \tanh(W_wh_{it} + b_w)\
\alpha_{it} &= \frac{\exp({u_{it}}^Tu_w)} {\sum_t \exp({u_{it}}^Tu_w)}\
s_i &= \sum_t \alpha_{it}h_{it}
\end{aligned}
\quad
\implies
\quad
\begin{aligned}
A &= \tanh(HW+b) \quad & (A \in \mathbb{R}^{n\times d_w})\
\alpha &= softmax(Au) \quad & (u \in \mathbb{R}^{d_w}, \alpha \in \mathbb{R}^{n})\
r &= \alpha^TH \quad & (r \in \mathbb{R}^{d})
\end{aligned}
$$
- input: $H \in \mathbb{R}^{n\times d}$
- parameters to train: $W \in \mathbb{R}^{d\times d_w}$, $u \in \mathbb{R}^{d_w}$
- output: sentence representation $r \in \mathbb{R}^d$
Self-Attention in Transformer
$$
\begin{aligned}
Q &= HW_Q \
K &= HW_K \
V &= HW_V \
H’&= softmax(\frac{QK^T}{\sqrt{d_k}}) V
\end{aligned}
$$
- input: $H \in \mathbb{R}^{n\times d}$
- parameters to train: $W_Q, W_K, W_V \in \mathbb{R}^{d\times d_k}$
- output: next hidden layer representation $H’ \in \mathbb{R}^{n\times d_k}$