瀏覽代碼

Use `tokenizer.vocab_size()` instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
Ronsor 2 年之前
父節點
當前提交
956dfda8ad
共有 1 個文件被更改,包括 1 次插入1 次删除
  1. 1 1
      convert-pth-to-ggml.py

+ 1 - 1
convert-pth-to-ggml.py

@@ -99,7 +99,7 @@ for p in range(n_parts):
     fout.write(struct.pack("i", ftype))
 
     # Is this correct??
-    for i in range(32000):
+    for i in range(tokenizer.vocab_size()):
         if tokenizer.is_unknown(i):
             # "<unk>" token (translated as ??)
             text = " \u2047 ".encode("utf-8")