|
|
há 1 ano atrás | |
|---|---|---|
| .. | ||
| CMakeLists.txt | há 1 ano atrás | |
| README.md | há 1 ano atrás | |
| convert_pt_to_hf.py | há 1 ano atrás | |
| tts-outetts.py | há 1 ano atrás | |
| tts.cpp | há 1 ano atrás | |
This example demonstrates the Text To Speech feature. It uses a model from outeai.
If you have built llama.cpp with -DLLAMA_CURL=ON you can simply run the
following command and the required models will be downloaded automatically:
$ build/bin/llama-tts --tts-oute-default -p "Hello world" && aplay output.wav
For details about the models and how to convert them to the required format see the following sections.
Checkout or download the model that contains the LLM model:
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
$ popd
Convert the model to .gguf format:
(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
--outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16
The generated model will be models/outetts-0.2-0.5B-f16.gguf.
We can optionally quantize this to Q8_0 using the following command:
$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
models/outetts-0.2-0.5B-q8_0.gguf q8_0
The quantized model will be models/outetts-0.2-0.5B-q8_0.gguf.
Next we do something simlar for the audio decoder. First download or checkout the model for the voice decoder:
$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
$ popd
This model file is PyTorch checkpoint (.ckpt) and we first need to convert it to huggingface format:
(venv) python examples/tts/convert_pt_to_hf.py \
models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
...
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json
Then we can convert the huggingface format to gguf:
(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
--outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
...
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf
With both of the models generated, the LLM model and the voice decoder model, we can run the example:
$ build/bin/llama-tts -m ./models/outetts-0.2-0.5B-q8_0.gguf \
-mv ./models/wavtokenizer-large-75-f16.gguf \
-p "Hello world"
...
main: audio written to file 'output.wav'
The output.wav file will contain the audio of the prompt. This can be heard by playing the file with a media player. On Linux the following command will play the audio:
$ aplay output.wav