5.7 ELMo — Fundamentos de Deep Learning
https://rramosp.github.io/2021.deeplearning/content/U5.07%20-%20ELMo%20-%20NER.html
Image taken from here. Given \(T\) tokens \((x_1,x_2,\cdots,x_T)\), a forward language model computes the probability of the sequence by modeling the probability of token \(x_k\) given the history \((x_1,\cdots, x_{k-1})\).This formulation has been addressed in the state of the art using many different approach, and more recently including some approximation based on Bidirectional Recurrent ...