python-tokenizers
Provides an implementation of today's most used tokenizers
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. * Train new vocabularies and tokenize, using today's most used tokenizers. * Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. * Easy to use, but also extremely versatile. * Designed for research and production. * Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. * Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Ei virallista pakettia saatavilla: openSUSE Leap 16.0Jakelut
openSUSE Tumbleweed
openSUSE Leap 16.0
openSUSE Leap 15.6
openSUSE Factory RISCV
SLFO 1.2
openSUSE Backports for SLE 15 SP7
openSUSE Backports for SLE 15 SP4