What can go wrong with tokenizer encodings? Everything! I will share my experience of understanding, misunderstanding, and ultimately learning to work with tokenization in LLMs. I will discuss what surprisal is, its relevance to my research, and its connection to tokenization. The talk will include various examples illustrating how misunderstandings of tokenization can arise, as well as strategies for debugging and preventing these issues.
Basic Python, and interest in NLP/LLMs
Computational Cognitive Science researcher at the University of Potsdam, Potsdam, Germany