|
|
il y a 8 mois | |
|---|---|---|
| .. | ||
| android | il y a 8 mois | |
| CMakeLists.txt | il y a 8 mois | |
| README-quantize.md | il y a 8 mois | |
| README.md | il y a 8 mois | |
| clip-impl.h | il y a 8 mois | |
| clip-quantize-cli.cpp | il y a 8 mois | |
| clip.cpp | il y a 8 mois | |
| clip.h | il y a 8 mois | |
| convert_image_encoder_to_gguf.py | il y a 8 mois | |
| deprecation-warning.cpp | il y a 8 mois | |
| glmedge-convert-image-encoder-to-gguf.py | il y a 8 mois | |
| glmedge-surgery.py | il y a 8 mois | |
| llava.cpp | il y a 8 mois | |
| llava.h | il y a 8 mois | |
| llava_surgery.py | il y a 8 mois | |
| llava_surgery_v2.py | il y a 8 mois | |
| minicpmv-convert-image-encoder-to-gguf.py | il y a 8 mois | |
| minicpmv-surgery.py | il y a 8 mois | |
| mtmd-cli.cpp | il y a 8 mois | |
| mtmd.cpp | il y a 8 mois | |
| mtmd.h | il y a 8 mois | |
| qwen2vl-test.cpp | il y a 8 mois | |
| requirements.txt | il y a 8 mois | |
| test-1.jpeg | il y a 8 mois | |
| tests.sh | il y a 8 mois | |
This is the tool for quantizing the CLIP visual projector model. Quantization reduces the precision of the model's weights, which can significantly decrease the model size and improve inference speed, often with minimal impact on performance.
To quantize a CLIP visual projector model, use the following command:
./bin/llama-llava-clip-quantize-cli /path/to/ggml-model-f32.gguf /path/to/ggml-model-quantized.gguf <type>
After the quantization, the visual projector can be used freely with the existing LLAVA cli (LLAVA, Qwen2VL, etc).
/path/to/ggml-model-f32.gguf: The path to the input model file in FP32 or FP16 format./path/to/ggml-model-quantized.gguf: The path where the quantized model will be saved.<type>: The quantization type to apply. This should be an integer corresponding to one of the quantization types defined in the enum ggml_type.The following quantization types are supported, based on the enum ggml_type definition:
2 - q4_0: 4-bit quantization with a single scale value.3 - q4_1: 4-bit quantization with a separate scale value for each block.6 - q5_0: 5-bit quantization with a single scale value.7 - q5_1: 5-bit quantization with a separate scale value for each block.8 - q8_0: 8-bit quantization with a single scale value.To quantize a model using the q4_0 quantization type, you would run:
./bin/llama-llava-clip-quantize-cli /path/to/ggml-model-f32.gguf /path/to/ggml-model-quantized.gguf 2
This command will generate a quantized model at /path/to/ggml-model-quantized.gguf using the q4_0 quantization method.