Discrete Audio
Collection
13 items
โข
Updated
This tokenizer extends the stanford-crfm/marin-tokenizer with audio tokens (8-codebook Mimi tokens) and new special tokens.
<|text_start|>, <|text_end|>, <|audio_start|>, <|audio_end|>This tokenizer is meant to be used with raw text (readable) and unicode string of audio tokens (not readable). For example,
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("potsawee/marin-mimi-bpe-8cb-16k-tokenizer")
interleaved_str = "<|begin_of_text|><|audio_start|>๎๎ณ๏ก๏ถฎ๐ข๐ขฅ๐
๐ท
๎ฟ๎ณ๏ ๏ข๐๐จซ๐๐ข๎๎ณ๏ ๏ต๐
๐ฅ๐ณ๐จธ๎๎ณ๏ธ๏ข๐๐ฐ๐๐ฎฑ๎๎ณ๏ธ๏กจ๐๐นด๐บ๐ฟ๎๎ณ๏๏ข๐๐ฐ๐ซ๐จธ๎๎ณ๏
๏กจ๐๐ธค๐น๐ฟ๎๎ณ๏ก๏ฆก๐ซ๐ถฃ๐ณ๐ท
๎๎ณ๏ ๏กจ๐
๐ฑ๐ก๐ทฝ๎ด๎ฏก๏ฆ๏ชท๐ธ๐ฐฉ๐๐ญ๎๎ป๏๏ฅ๐๐ฐฅ๐ถ๐ตฟ๎ฝ๎ฉธ๏ฆ๏ฟ๐ฒ๐ท๐๐ก๎ฒ๎ก๏ฑ๏ฝ๐๐ญ๐บ๐ฏ๎ธ๎ฌผ๏๏ก๐๐ พ๐๐ด
๎ฃ๎ค๏๏ขฌ๐ญ๐ฎ๐๐ป๎ณ๎บ๏๏นฅ๐พ๐ญซ๐๐ธ ๎ข๎ง๏ค๏ฃ๐บ๐ฒง๐๐ค๎๎ฅต๏ฆ๏จฏ๐๐ถญ๐๐ฝ๎ญ๎พ๏
๏ฑญ๐ ๐ผซ๐ต๐ง๎ป๎ท๏พ๏ฆฆ๐ง๐บ๐ญ๐นฃ๎ฃ๎ข
๏น๏ ง๐ญ๐ช๐๐ฑฏ๎๎ซ๏๏ดฏ๐ ๐ญฎ๐๐ฅ๎ฆ๎ง๏ณ๏น๐ฟ๐ธธ๐ญ๐ฑผ๎ท๎ปพ๏ด๏ทฟ๐๐ ๐
๐บต๎๎ฟซ๏๏ฅ ๐ฒ๐ฌ๐๐จ๎๎ฅช๏๏ตณ๐ข๐พ๐๐ค๎๎ฌช๏๏ตณ๐ต๐ป
๐ณ๐ชจ๎๎ช๏
๏ฃท๐ญ๐พ๐๐ต๎ฌ๎ปจ๏ฒ๏ณซ๐ฒ๐ซ๐ท๐ฑป๎น๎ด๏๏ค๐๐จ๐ฌ๐ค๎๎น๏๏ฝป๐ง๐ฏก๐๐ญ๎๎ ผ๏๏ค๐๐ท๐๐น๎บ๎ฒข๏๏ฐ๐๐ท๐๐ฐ๎
๎พ๏๏ก๐๐ญ๐๐ฐ
๎น๎บฉ๏ฑ๏ตต๐๐ฟช๐๐บ๎ฏ๎ฌ๏พ๏น๐๐ฐฏ๐๐น ๎ด๎ช๏ต๏ป๐ค๐ฐ๐๐ฐก๎๎ฉฃ๏จ๏ฑ ๐ผ๐พ๐๐ฐฆ๎๎ฆ๏ญ๏ซ๐ฑ๐ค๐๐น๎ฎ๎ ผ๏๏พ๐ท๐ท๐ป๐ฒฎ๎๎ผฒ๏ผ๏ฅช๐๐ท๐๐ณฅ๎๎ฆด๏ข๏ฐญ๐ฏ๐ถ๐ถ๐จ๎๎ฟญ๏ป๏ฅ๐๐งฏ๐๐ธ๎๎กฟ๏๏ฎ
๐พ๐ต๐๐ซ๎ฏ๎จ๏
๏ผ๐ฟ๐ขฒ๐ด๐ฅ๎ค๎ ๏ฒ๏ฃง๐ธ๐ฅน๐๐ฉฝ๎
๎ผฆ๏๏ฅ๐๐ฅฐ๐ด๐ฆ๎๎ก๏๏กญ๐ถ๐ธฃ๐๐คฑ๎๎ขฝ๏ฐ๏ด๐ฎ๐ง๐ ๐ด๎พ๎ ถ๏๏พฑ๐ค๐ซ๐ฌ๐ฌ๎๎ฅฃ๏๏นฝ๐ซ๐ดป๐
๐ฑ๎๎ญ๏ฅ๏ฅ๐๐ธ๐ต๐ค๎ผ๎ท๏ ๏ฅ
๐ฃ๐งญ๐ท๐ท๎๎ฑ๏ด๏ฅ๐พ๐ธพ๐ฅ๐ปพ๎ง๎บ๏จ๏ปซ๐ซ๐พง๐๐ฎฟ๎๎ญ๏ง๏ ๐
๐ฆ๐๐บช๎ฌ๎ต ๏๏ฎณ๐๐ฎจ๐ข๐ฑฆ๎บ๎น๏๏ผ๐จ๐ดท๐ญ๐ญฉ๎๎ฐผ๏ด๏ฃ๐ฒ๐ธฏ๐จ๐ผผ๎ ๎ด๏๏ผ๐
๐ฆฏ๐จ๐ฐฝ๎ฅ๎ธฑ๏ถ๏ผ๐ป๐ป๐ท๐ฎฅ๎๎คค๏ฅ๏ธ๐ข๐ก๐ค๐ ๎๎นฉ๏๏ฏ๐ซ๐บน๐ช๐ด๎ฟ๎ซ๏๏ธท๐๐ธฃ๐๐น๎ณ๎ทก๏๏ซ๐ต๐ฎ๐ฏ๐ฎฅ๎๎ต๏๏ฒน๐ซ๐พท๐ด๐ฐซ๎ฌ๎กฒ๏ฌ๏บ๐ฑ๐ฉ๐ฃ๐ถฉ๎บ๎ตซ๏๏ท๐ฑ๐ฎฝ๐ฆ๐ฅต๎๎ธฒ๏ณ๏ก๐น๐ฑข๐ฅ๐ณฒ๎ฐ๎ผผ๏๏ฎบ๐ซ๐ฟ๐๐ฏ๎ ๎ฏ๏๏ฟ๐จ๐ค๐ป๐ซณ๎๎ฉน๏ฝ๏พจ๐๐ฒ๐๐ชญ๎๎ฒธ๏๏ฝ๐ท๐ป๐๐จถ๎ฝ๎ฉ๏ณ๏ขก๐ง๐นฎ๐ฎ๐ชธ๎ป๎คณ๏๏ถ๐ช๐ฃ๐๐ต๎๎พพ๏๏ซ๐๐ช๐๐ฒฝ๎๎ ๏ต๏ฎ๐ฝ๐ชง๐ด๐ข๎ข๎ฟฏ๏จ๏ท๐ถ๐ชฝ๐ฟ๐ข
๎ง๎ก๏๏ช๐ฒ๐ง๐๐ตฑ๎๎ณ๏๏จข๐๐ฅ๐น๐ฎ๎๎ก๏ ๏จข๐๐ฃฃ๐๐ฟ๎๎ณ๏๏ต๐ก๐ฐซ๐ณ๐ธ๎๎บค๏
๏ถช๐๐ง๐ฌ๐บ๎๎ฐ๏๏คถ๐ธ๐ฑ๐ช๐ค<|audio_end|><|text_start|>I've also been taught to understand and produce paralinguistic things like sighing, chuckling, or yawning<|text_end|><|end_of_text|>"
tokens = tokenizer(interleaved_str)
Note that each character in the Unicode string corresponds to one audio token (flattened from 8 codebooks), and we apply a unicode offset 0xE000 so that unicodes for audio tokens stay within Private Use Area (i.e., not yet assigned to any other character).
UNICODE_OFFSET = 0xE000
def chars_to_codes(
chars: str,
num_codebooks: int,
codebook_size: int,
return_tensors: Optional[str] = None,
unicode_offset: int = UNICODE_OFFSET,
) -> Union[List[List[int]], np.ndarray, torch.Tensor]:
codes = np.array([ord(c) for c in chars])
codes = codes.reshape(-1, num_codebooks).T
for i in range(codes.shape[0]):
codes[i] -= unicode_offset + i*codebook_size
if return_tensors is None:
codes = codes.tolist()
elif return_tensors == "pt":
codes = torch.tensor(codes)
return codes
def codes_to_chars(
codes: Union[List[List[int]], np.ndarray, torch.Tensor],
codebook_size: int,
copy_before_conversion: bool = True,
unicode_offset: int = UNICODE_OFFSET,
) -> str:
if isinstance(codes, list):
codes = np.array(codes)
copy_before_conversion = False
elif isinstance(codes, torch.Tensor):
codes = codes.cpu().numpy()
if len(codes.shape) != 2:
raise ValueError("codes must be a 2D array of shape (num_codebooks, seq_length).")
if copy_before_conversion:
codes = codes.copy()
for i in range(codes.shape[0]):
codes[i] += unicode_offset + i*codebook_size
codes = codes.T.reshape(-1)
chars = "".join([chr(c) for c in codes])
return chars
Adapted from https://github.com/AbrahamSanders/codec-bpe
To get back the audio waveform from the codes, you can use the following code:
from transformers import MimiModel
model = MimiModel.from_pretrained("kyutai/mimi")
# audio_str is the unicode string of audio tokens (i.e., characters with <|audio_start|> and <|audio_end|>)
codes = chars_to_codes(audio_str, num_codebooks=8, codebook_size=2048, return_tensors="pt").unsqueeze(0)
with torch.no_grad():
audio_decoded = model.decode(codes).audio_values[0]