Commit History

Author SHA1 Message Date
  jaime-m-p 43248e5594 llama3 custom regex split (#6965) 1 year ago
  Georgi Gerganov 92139b90af tests : add test-tokenizer-0.sh + fix some tokenizers (#7036) 1 year ago
  Georgi Gerganov f4ab2a4147 llama : fix BPE pre-tokenization (#6920) 1 year ago
  Jared Van Bortel 32c8486e1f wpm : portable unicode tolower (#6305) 1 year ago
  Georgi Gerganov 83796e62bc llama : refactor unicode stuff (#5992) 1 year ago
  Douglas Hanley 9600d59e01 unicode : switch to multimap based nfd_map (#5799) 1 year ago
  Douglas Hanley 177628bfd8 llama : improve BERT tokenization (#5740) 1 year ago
  Georgi Gerganov 67fd33132f unicode : reuse iterator (#5726) 1 year ago
  Georgi Gerganov cf45252a7c tests : multi-thread the tokenizer tests (#5474) 1 year ago
  bobqianic 6c5629d4d2 add `#include <string>` to unicode.h (#5051) 2 years ago
  goerch ff5a3f0c09 Work on the BPE tokenizer (#3252) 2 years ago