Unlocking Low-Resource Language Capabilities with Trans-Tokenizers
Suvrakamal Das (~JaynouOliver) |
Description:
In this poster, we introduce the innovative concept of trans-tokenizers, a novel cross-lingual vocabulary transfer strategy designed to enhance the performance of language models for low-resource languages. Trans-tokenization allows for the initialization of token embeddings in a target language using a weighted average of semantically similar token embeddings from a high-resource source language. This technique leverages a translation resource covering both languages to create a probabilistic token mapping.
Attendees will learn about the methodology behind trans-tokenizers, including token alignment using parallel corpora, embedding mapping, and practical applications of this approach. We will also discuss the integration of trans-tokenizers with models like LLaMA 3 and the potential benefits and challenges of this approach. By leveraging trans-tokenization, the session will demonstrate how to efficiently train and fine-tune large language models (LLMs) for languages with limited resources, making advanced NLP tools more accessible to a broader community.
Prerequisites:
Basic understanding of (NLP), Knowledge of Tokenization Methods (e.g., BPE, WordPiece), Transformer Models (e.g., BERT, GPT, LLaMA), PyTorch
Speaker Info:
Suvrakamal Das is a Machine Learning Engineer at XRI Global (USA), he is a speaker at SciPy conference 2024 and a has previously worked in companies and research institutions like Woxsen University.
Speaker Links:
https://bit.ly/m/suvrakamal