Manual


Advanced tokenization decisions such as splitting off affixes or separating compound words is currently beyond the scope of this project. For English, we use PTB tokenization guidelines. For other languages, we use a simple generic tokenizer which only splits at whitespace and around punctuation characters.