jaime-m-p
|
43248e5594
llama3 custom regex split (#6965)
|
1 year ago |
Georgi Gerganov
|
92139b90af
tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
|
1 year ago |
Georgi Gerganov
|
f4ab2a4147
llama : fix BPE pre-tokenization (#6920)
|
1 year ago |
Jared Van Bortel
|
32c8486e1f
wpm : portable unicode tolower (#6305)
|
1 year ago |
Georgi Gerganov
|
83796e62bc
llama : refactor unicode stuff (#5992)
|
1 year ago |
Douglas Hanley
|
9600d59e01
unicode : switch to multimap based nfd_map (#5799)
|
1 year ago |
Douglas Hanley
|
177628bfd8
llama : improve BERT tokenization (#5740)
|
1 year ago |
Georgi Gerganov
|
67fd33132f
unicode : reuse iterator (#5726)
|
1 year ago |
Georgi Gerganov
|
cf45252a7c
tests : multi-thread the tokenizer tests (#5474)
|
1 year ago |
bobqianic
|
6c5629d4d2
add `#include <string>` to unicode.h (#5051)
|
2 years ago |
goerch
|
ff5a3f0c09
Work on the BPE tokenizer (#3252)
|
2 years ago |