Unlocking Low-Resource Language Capabilities with Trans-Tokenizers

Suvrakamal Das (~JaynouOliver)


0

Votes

Description:

In this poster, we introduce the innovative concept of trans-tokenizers, a novel cross-lingual vocabulary transfer strategy designed to enhance the performance of language models for low-resource languages. Trans-tokenization allows for the initialization of token embeddings in a target language using a weighted average of semantically similar token embeddings from a high-resource source language. This technique leverages a translation resource covering both languages to create a probabilistic token mapping.

Attendees will learn about the methodology behind trans-tokenizers, including token alignment using parallel corpora, embedding mapping, and practical applications of this approach. We will also discuss the integration of trans-tokenizers with models like LLaMA 3 and the potential benefits and challenges of this approach. By leveraging trans-tokenization, the session will demonstrate how to efficiently train and fine-tune large language models (LLMs) for languages with limited resources, making advanced NLP tools more accessible to a broader community.

Prerequisites:

Basic understanding of (NLP), Knowledge of Tokenization Methods (e.g., BPE, WordPiece), Transformer Models (e.g., BERT, GPT, LLaMA), PyTorch

Speaker Info:

Suvrakamal Das is a Machine Learning Engineer at XRI Global (USA), he is a speaker at SciPy conference 2024 and a has previously worked in companies and research institutions like Woxsen University.

Speaker Links:

https://bit.ly/m/suvrakamal

Section: Artificial Intelligence and Machine Learning
Type: Poster
Target Audience: Intermediate
Last Updated: