marin-mimi-bpe-8cb-16k Tokenizer

This tokenizer extends the stanford-crfm/marin-tokenizer with audio tokens (8-codebook Mimi tokens) and new special tokens.

Overview

  • Text tokens: All tokens from the base Marin tokenizer
  • Audio tokens: 16,384 tokens (8 codebooks ร— 2,048 codebook size) for Mimi codec
  • Special tokens: <|text_start|>, <|text_end|>, <|audio_start|>, <|audio_end|>
  • Vocab size: 144,644 total --> 128,256 (original Marin tokenizer) + 16,384 (audio tokens) + 4 (new special tokens) // be aware that this is not a multiple of 128.
  • BPE merges: 0 (no merges, direct codebook mapping). We also tried a merge list of size 128K for audio tokens, but we only observed around ~10% token reduction -- so we stick with 0 merges.

Usage

This tokenizer is meant to be used with raw text (readable) and unicode string of audio tokens (not readable). For example,

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("potsawee/marin-mimi-bpe-8cb-16k-tokenizer")
interleaved_str = "<|begin_of_text|><|audio_start|>๎™๎ณ๏˜ก๏ถฎ๐‰ข๐ขฅ๐‘…†๐‘ท…๎ฟ๎ณ๏Ÿ ๏ขƒ๐„‹๐จซ๐‘Šš๐‘ขœ๎˜๎ณ๏Ÿ ๏ต„๐…”๐ฅ›๐‘Œณ๐‘จธ๎Œ๎ณ๏•ธ๏ขƒ๐‡‰๐ฐ†๐‘๐‘ฎฑ๎Œ๎ณ๏•ธ๏กจ๐‹’๐นด๐‘žบ๐‘ฟ˜๎Œ๎ณ๏“˜๏ขƒ๐‡‰๐ฐ†๐‘€ซ๐‘จธ๎Œ๎ณ๏…๏กจ๐›ˆ๐ธค๐‘Œน๐‘ฟ˜๎†‹๎ณ๏˜ก๏ฆก๐ซ๐ถฃ๐‘Œณ๐‘ท…๎†‹๎ณ๏Ÿ ๏กจ๐…”๐ฑ‹๐‘“ก๐‘ทฝ๎“ด๎ฏก๏’ฆ๏ชท๐œธ๐ฐฉ๐‘—–๐‘ญž๎๎ป•๏Ÿ–๏ฅ’๐Žœ๐ฐฅ๐‘•ถ๐‘ตฟ๎ฝ๎ฉธ๏„ฆ๏ฟ„๐ฒ๐ท‚๐‘‡—๐‘กƒ๎™ฒ๎ก™๏†ฑ๏ฝ–๐Žƒ๐ญ–๐‘›บ๐‘ฏ‰๎ธ๎ฌผ๏ƒ“๏กž๐•€๐ พ๐‘’ˆ๐‘ด…๎›ฃ๎ค‚๏”‚๏ขฌ๐ญ๐ฎ๐‘——๐‘ป˜๎Šณ๎บ†๏‡˜๏นฅ๐Šพ๐ญซ๐‘ž€๐‘ธ ๎ข๎ง›๏™ค๏ฃˆ๐บ๐ฒง๐‘œƒ๐‘ค€๎ƒš๎ฅต๏’ฆ๏จฏ๐Ÿ“๐ถญ๐‘”™๐‘ฝ๎‚ญ๎พ‡๏…๏ฑญ๐ ๐ผซ๐‘™ต๐‘ง„๎‚ป๎ท‘๏˜พ๏ฆฆ๐Žง๐บ๐‘€ญ๐‘นฃ๎ˆฃ๎ข…๏—น๏ ง๐‰ญ๐ช‹๐‘›‚๐‘ฑฏ๎›š๎ซ‚๏˜†๏ดฏ๐† ๐ญฎ๐‘š–๐‘ฅŸ๎„ฆ๎งš๏‡ณ๏นƒ๐˜ฟ๐ธธ๐‘‡ญ๐‘ฑผ๎€ท๎ปพ๏Šด๏ทฟ๐€ˆ๐ Œ๐‘‚…๐‘บต๎Œˆ๎ฟซ๏–Š๏ฅ ๐‰ฒ๐ฌŠ๐‘™๐‘จ—๎Œˆ๎ฅช๏–„๏ตณ๐‡ข๐พ‰๐‘›ˆ๐‘ค๎„ˆ๎ฌช๏†•๏ตณ๐”ต๐ป…๐‘ณ๐‘ชจ๎Ÿ๎ช๏…„๏ฃท๐šญ๐พ™๐‘ƒ‰๐‘ตŒ๎„ฌ๎ปจ๏—ฒ๏ณซ๐—ฒ๐ซ—๐‘šท๐‘ฑป๎„น๎ด’๏‹†๏ค™๐‘๐จ”๐‘Œฌ๐‘ค‘๎Ÿ๎นƒ๏‚ƒ๏ฝป๐’ง๐ฏก๐‘Š›๐‘ญ—๎——๎ ผ๏›ƒ๏ค†๐๐ท‡๐‘›‚๐‘น“๎“บ๎ฒข๏Œ‹๏ฐ‰๐€๐ท’๐‘ž๐‘ฐ˜๎…›๎พˆ๏“๏ก‰๐“‘๐ญ˜๐‘”™๐‘ฐ…๎„น๎บฉ๏“ฑ๏ตต๐œž๐ฟช๐‘‘‚๐‘บ‡๎œฏ๎ฌ›๏Žพ๏นƒ๐ŠŸ๐ฐฏ๐‘‡๐‘น ๎Ÿด๎ช†๏—ต๏ป๐•ค๐ฐ™๐‘Š”๐‘ฐก๎žŒ๎ฉฃ๏จ๏ฑ ๐„ผ๐พ‰๐‘„“๐‘ฐฆ๎„Š๎ฆŠ๏–ญ๏ซ†๐”ฑ๐ค–๐‘œƒ๐‘น‹๎‰ฎ๎ ผ๏€‘๏พŸ๐œท๐ทŸ๐‘‰ป๐‘ฒฎ๎€๎ผฒ๏Ÿผ๏ฅช๐Šƒ๐ท‹๐‘†๐‘ณฅ๎Œž๎ฆด๏œข๏ฐญ๐™ฏ๐ถ†๐‘ถ๐‘จ๎•๎ฟญ๏€ป๏ฅˆ๐‚—๐งฏ๐‘Œ๐‘ธ˜๎Ÿ‘๎กฟ๏›œ๏ฎ…๐‡พ๐ต€๐‘”Ž๐‘ซ—๎•ฏ๎จ‘๏…๏ผ›๐Œฟ๐ขฒ๐‘€ด๐‘ฅž๎šค๎ š๏‚ฒ๏ฃง๐œธ๐ฅน๐‘Šƒ๐‘ฉฝ๎‚…๎ผฆ๏•˜๏ฅ–๐Ž†๐ฅฐ๐‘Žด๐‘ฆŽ๎††๎กŽ๏‹”๏กญ๐‡ถ๐ธฃ๐‘Ž๐‘คฑ๎–๎ขฝ๏•ฐ๏ด‚๐žฎ๐ง‘๐‘˜ ๐‘ด‡๎Šพ๎ ถ๏š™๏พฑ๐†ค๐ซŒ๐‘’ฌ๐‘ฌŸ๎˜๎ฅฃ๏€š๏นฝ๐’ซ๐ดป๐‘…๐‘ฑ–๎๎ญ’๏–ฅ๏ฅ„๐‡๐ธ‰๐‘•ต๐‘ค‘๎–ผ๎ท‡๏‰ ๏ฅ…๐›ฃ๐งญ๐‘‹ท๐‘ท›๎‡๎ฑ‚๏ด๏ฅ–๐“พ๐ธพ๐‘‘ฅ๐‘ปพ๎ง๎บ๏จ๏ปซ๐†ซ๐พง๐‘™‘๐‘ฎฟ๎™œ๎ญ‡๏›ง๏ ‹๐’…๐ฆ–๐‘š”๐‘บช๎›ฌ๎ต ๏ˆ—๏ฎณ๐‚œ๐ฎจ๐‘ข๐‘ฑฆ๎‚บ๎นš๏ˆ›๏ผ™๐†จ๐ดท๐‘Žญ๐‘ญฉ๎”‡๎ฐผ๏œด๏ฃ”๐›ฒ๐ธฏ๐‘จ๐‘ผผ๎˜ ๎ดŽ๏Š›๏ผ›๐ž…๐ฆฏ๐‘ˆจ๐‘ฐฝ๎“ฅ๎ธฑ๏ถ๏ผ™๐ƒป๐ป‹๐‘‚ท๐‘ฎฅ๎๎คค๏Ÿฅ๏ธ˜๐žข๐ก๐‘†ค๐‘ ๎ƒ๎นฉ๏žŒ๏ฏ›๐Žซ๐บน๐‘‹ช๐‘ด—๎†ฟ๎ซ›๏•Ÿ๏ธท๐š๐ธฃ๐‘‹‚๐‘น‹๎„ณ๎ทก๏“—๏ซ‰๐Œต๐ฎ€๐‘‰ฏ๐‘ฎฅ๎“๎ตƒ๏‰Š๏ฒน๐”ซ๐พท๐‘œด๐‘ฐซ๎›ฌ๎กฒ๏ฌ๏บ•๐’ฑ๐ฉ†๐‘‡ฃ๐‘ถฉ๎‚บ๎ตซ๏Ž€๏ท˜๐’ฑ๐ฎฝ๐‘“ฆ๐‘ฅต๎›๎ธฒ๏‰ณ๏ก›๐‹น๐ฑข๐‘Šฅ๐‘ณฒ๎Žฐ๎ผผ๏Ÿ“๏ฎบ๐†ซ๐ฟ๐‘’›๐‘ฏ“๎˜ ๎ฏŽ๏ˆ›๏ฟœ๐€จ๐ค’๐‘Ÿป๐‘ซณ๎“๎ฉน๏‰ฝ๏พจ๐‚‘๐ฒš๐‘ˆˆ๐‘ชญ๎šŒ๎ฒธ๏žˆ๏ฝ˜๐Žท๐ป›๐‘›๐‘จถ๎˜ฝ๎ฉ“๏–ณ๏ขก๐†ง๐นฎ๐‘†ฎ๐‘ชธ๎“ป๎คณ๏œ๏ถ“๐ช๐ฃˆ๐‘˜๐‘ต“๎•๎พพ๏ˆ™๏ซ€๐ž๐ช‘๐‘‘Ÿ๐‘ฒฝ๎š๎ š๏ต๏ฎŽ๐–ฝ๐ชง๐‘•ด๐‘ข™๎ƒข๎ฟฏ๏†จ๏ท‘๐Žถ๐ชฝ๐‘šฟ๐‘ข…๎ง๎ก›๏—ž๏ช๐Žฒ๐งž๐‘“–๐‘ตฑ๎‹™๎ณ๏—ž๏จข๐„‹๐ฅ›๐‘Œน๐‘ฎœ๎‹™๎ก›๏• ๏จข๐„‹๐ฃฃ๐‘๐‘ฟ˜๎†Œ๎ณ๏Œ๏ต„๐‡ก๐ฐซ๐‘Œณ๐‘ธŒ๎”•๎บค๏…๏ถช๐›ˆ๐งž๐‘†ฌ๐‘บ๎†€๎ฐ๏’‡๏คถ๐‰ธ๐ฑž๐‘•ช๐‘ค“<|audio_end|><|text_start|>I've also been taught to understand and produce paralinguistic things like sighing, chuckling, or yawning<|text_end|><|end_of_text|>"
tokens = tokenizer(interleaved_str)

Note that each character in the Unicode string corresponds to one audio token (flattened from 8 codebooks), and we apply a unicode offset 0xE000 so that unicodes for audio tokens stay within Private Use Area (i.e., not yet assigned to any other character).

Unicode String to Codes (and vice versa)

UNICODE_OFFSET = 0xE000

def chars_to_codes(
    chars: str, 
    num_codebooks: int,
    codebook_size: int,
    return_tensors: Optional[str] = None, 
    unicode_offset: int = UNICODE_OFFSET,
) -> Union[List[List[int]], np.ndarray, torch.Tensor]:
    codes = np.array([ord(c) for c in chars])
    codes = codes.reshape(-1, num_codebooks).T
    for i in range(codes.shape[0]):
        codes[i] -= unicode_offset + i*codebook_size
    if return_tensors is None:
        codes = codes.tolist()
    elif return_tensors == "pt":
        codes = torch.tensor(codes)
    return codes

def codes_to_chars(
    codes: Union[List[List[int]], np.ndarray, torch.Tensor], 
    codebook_size: int,
    copy_before_conversion: bool = True,
    unicode_offset: int = UNICODE_OFFSET,
) -> str:
    if isinstance(codes, list):
        codes = np.array(codes)
        copy_before_conversion = False
    elif isinstance(codes, torch.Tensor):
        codes = codes.cpu().numpy()
    if len(codes.shape) != 2:
        raise ValueError("codes must be a 2D array of shape (num_codebooks, seq_length).")
    if copy_before_conversion:
        codes = codes.copy()
    for i in range(codes.shape[0]):
        codes[i] += unicode_offset + i*codebook_size
    codes = codes.T.reshape(-1)
    chars = "".join([chr(c) for c in codes])
    return chars

Adapted from https://github.com/AbrahamSanders/codec-bpe

Codes to Audio Waveform

To get back the audio waveform from the codes, you can use the following code:

from transformers import MimiModel
model = MimiModel.from_pretrained("kyutai/mimi")
# audio_str is the unicode string of audio tokens (i.e., characters with <|audio_start|> and <|audio_end|>)
codes = chars_to_codes(audio_str, num_codebooks=8, codebook_size=2048, return_tensors="pt").unsqueeze(0)
with torch.no_grad():
    audio_decoded = model.decode(codes).audio_values[0]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including potsawee/marin-mimi-bpe-8cb-16k-tokenizer