marin-mimi-bpe-8cb-16k Tokenizer

This tokenizer extends the stanford-crfm/marin-tokenizer with audio tokens (8-codebook Mimi tokens) and new special tokens.

Overview

Text tokens: All tokens from the base Marin tokenizer
Audio tokens: 16,384 tokens (8 codebooks × 2,048 codebook size) for Mimi codec
Special tokens: <|text_start|>, <|text_end|>, <|audio_start|>, <|audio_end|>
Vocab size: 144,644 total --> 128,256 (original Marin tokenizer) + 16,384 (audio tokens) + 4 (new special tokens) // be aware that this is not a multiple of 128.
BPE merges: 0 (no merges, direct codebook mapping). We also tried a merge list of size 128K for audio tokens, but we only observed around ~10% token reduction -- so we stick with 0 merges.

Usage

This tokenizer is meant to be used with raw text (readable) and unicode string of audio tokens (not readable). For example,

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("potsawee/marin-mimi-bpe-8cb-16k-tokenizer")
interleaved_str = "<|begin_of_text|><|audio_start|>ﶮ𐉢𐢥𑅆𑷅𐄋𐨫𑊚𑢜﵄𐅔𐥛𑌳𑨸𐇉𐰆𑏐𑮱𐋒𐹴𑞺𑿘𐇉𐰆𑀫𑨸𐛈𐸤𑌹𑿘說𐐫𐶣𑌳𑷅𐅔𐱋𑓡𑷽覆𐜸𐰩𑗖𑭞勒𐎜𐰥𑕶𑵿ￄ𐁲𐷂𑇗𑡃ｖ𐎃𐭖𑛺𑯉𐕀𐠾𑒈𑴅𐏭𐮍𑗗𑻘﹥𐊾𐭫𑞀𑸠𐝺𐲧𑜃𑤀隷𐟓𐶭𑔙𑽝ﱭ𐁠𐼫𑙵𑧄簾𐎧𐺝𑀭𑹣𐉭𐪋𑛂𑱯ﴯ𐆠𐭮𑚖𑥟﹃𐘿𐸸𑇭𑱼﷿𐀈𐠌𑂅𑺵怒𐉲𐬊𑙐𑨗ﵳ𐇢𐾉𑛈𑤝ﵳ𐔵𐻅𑝳𑪨𐚭𐾙𑃉𑵌ﳫ𐗲𐫗𑚷𑱻酪𐑍𐨔𑌬𑤑ｻ𐒧𐯡𑊛𑭗句𐏏𐷇𑛂𑹓ﰉ𐝀𐷒𑞐𑰘𐓑𐭘𑔙𑰅ﵵ𐜞𐿪𑑂𑺇﹃𐊟𐰯𑇐𑹠ﻝ𐕤𐰙𑊔𑰡ﱠ𐄼𐾉𑄓𑰦陼𐔱𐤖𑜃𑹋ﾟ𐜷𐷟𑉻𑲮索𐊃𐷋𑐆𑳥ﰭ𐙯𐶆𑏶𑨏賂𐂗𐧯𑌐𑸘ﮅ𐇾𐵀𑔎𑫗；𐌿𐢲𑀴𑥞𐜸𐥹𑊃𑩽稜𐎆𐥰𑎴𑦎𐇶𐸣𑁎𑤱ﴂ𐞮𐧑𑘠𑴇ﾱ𐆤𐫌𑒬𑬟ﹽ𐒫𐴻𑝅𑱖籠𐏇𐸉𑕵𑤑聾𐛣𐧭𑋷𑷛稜𐓾𐸾𑑥𑻾ﻫ𐆫𐾧𑙑𑮿𐒅𐦖𑚔𑺪﮳𐂜𐮨𑍢𑱦９𐆨𐴷𑎭𑭩𐛲𐸯𑐨𑼼；𐞅𐦯𑈨𑰽９𐃻𐻋𑂷𑮥︘𐞢𐡝𑆤𑠏ﯛ𐎫𐺹𑋪𑴗︷𐚍𐸣𑋂𑹋韛𐌵𐮀𑉯𑮥ﲹ𐔫𐾷𑜴𑰫ﺕ𐒱𐩆𑇣𑶩﷘𐒱𐮽𑓦𑥵𐋹𐱢𑊥𑳲﮺𐆫𐿐𑒛𑯓ￜ𐀨𐤒𑟻𑫳ﾨ𐂑𐲚𑈈𑪭ｘ𐎷𐻛𑐛𑨶𐆧𐹮𑆮𑪸ﶓ𐏪𐣈𑘁𑵓變𐞐𐪑𑑟𑲽ﮎ𐖽𐪧𑕴𑢙﷑𐎶𐪽𑚿𑢅摒𐎲𐧞𑓖𑵱諸𐄋𐥛𑌹𑮜諸𐄋𐣣𑝏𑿘﵄𐇡𐰫𑌳𑸌ﶪ𐛈𐧞𑆬𑺁虜𐉸𐱞𑕪𑤓<|audio_end|><|text_start|>I've also been taught to understand and produce paralinguistic things like sighing, chuckling, or yawning<|text_end|><|end_of_text|>"
tokens = tokenizer(interleaved_str)

Note that each character in the Unicode string corresponds to one audio token (flattened from 8 codebooks), and we apply a unicode offset 0xE000 so that unicodes for audio tokens stay within Private Use Area (i.e., not yet assigned to any other character).

Unicode String to Codes (and vice versa)

UNICODE_OFFSET = 0xE000

def chars_to_codes(
    chars: str, 
    num_codebooks: int,
    codebook_size: int,
    return_tensors: Optional[str] = None, 
    unicode_offset: int = UNICODE_OFFSET,
) -> Union[List[List[int]], np.ndarray, torch.Tensor]:
    codes = np.array([ord(c) for c in chars])
    codes = codes.reshape(-1, num_codebooks).T
    for i in range(codes.shape[0]):
        codes[i] -= unicode_offset + i*codebook_size
    if return_tensors is None:
        codes = codes.tolist()
    elif return_tensors == "pt":
        codes = torch.tensor(codes)
    return codes

def codes_to_chars(
    codes: Union[List[List[int]], np.ndarray, torch.Tensor], 
    codebook_size: int,
    copy_before_conversion: bool = True,
    unicode_offset: int = UNICODE_OFFSET,
) -> str:
    if isinstance(codes, list):
        codes = np.array(codes)
        copy_before_conversion = False
    elif isinstance(codes, torch.Tensor):
        codes = codes.cpu().numpy()
    if len(codes.shape) != 2:
        raise ValueError("codes must be a 2D array of shape (num_codebooks, seq_length).")
    if copy_before_conversion:
        codes = codes.copy()
    for i in range(codes.shape[0]):
        codes[i] += unicode_offset + i*codebook_size
    codes = codes.T.reshape(-1)
    chars = "".join([chr(c) for c in codes])
    return chars

Adapted from https://github.com/AbrahamSanders/codec-bpe

Codes to Audio Waveform

To get back the audio waveform from the codes, you can use the following code:

from transformers import MimiModel
model = MimiModel.from_pretrained("kyutai/mimi")
# audio_str is the unicode string of audio tokens (i.e., characters with <|audio_start|> and <|audio_end|>)
codes = chars_to_codes(audio_str, num_codebooks=8, codebook_size=2048, return_tensors="pt").unsqueeze(0)
with torch.no_grad():
    audio_decoded = model.decode(codes).audio_values[0]

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including potsawee/marin-mimi-bpe-8cb-16k-tokenizer

Discrete Audio

Collection

13 items • Updated 3 days ago