您需要向令牌生成器提供缩写列表,如下所示:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameterspunkt_param = PunktParameters()punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])sentence_splitter = PunktSentenceTokenizer(punkt_param)text = "is THAT what you mean, Mrs. Hussey?"sentences = sentence_splitter.tokenize(text)
现在的句子是:
['is THAT what you mean, Mrs. Hussey?']
更新:如果句子的最后一个单词附有撇号或引号(例如 Hussey?’
),则此方法不起作用。因此,一种快速而又肮脏的方法是在撇号和引号之前加上空格,并在句子结尾的符号(。!?)之后:
text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)