【Python】词频统计_python

概述需求：一篇文章，出现了哪些词？哪些词出现得最多？英文文本词频统计英文文本：Hamlet 分析词频统计英文词频分为两步：文本去噪及归一化使用字典表达词频代码： #CalHamletV1.py 需求：一篇文章，出现了哪些词？哪些词出现得最多？英文文本词频统计

英文文本：Hamlet 分析词频

统计英文词频分为两步：

文本去噪及归一化使用字典表达词频

代码：

#CalHamletV1.pydef getText():    txt = open("hamlet.txt","r").read()    txt = txt.lower()    for ch in '!"#$%&()*+,-./:;<=>?@[\]^_‘{|}~':        txt = txt.replace(ch," ")   #将文本中特殊字符替换为空格    return txt hamletTxt = getText()words  = hamletTxt.split()counts = {}for word in words:               counts[word] = counts.get(word,0) + 1items = List(counts.items())items.sort(key=lambda x:x[1],reverse=True) for i in range(10):    word,count = items[i]    print ("{0:<10}{1:>5}".format(word,count))

运行结果：

the        1138and         965to          754of          669you         550i           542a           542my          514hamlet      462in          436

中文文本词频统计

中文文本：《三国演义》分析人物

统计中文词频分为两步：

中文文本分词使用字典表达词频

#CalThreeKingdomsV1.pyimport jIEbatxt = open("threekingdoms.txt","r",enCoding='utf-8').read()words  = jIEba.lcut(txt)counts = {}for word in words:    if len(word) == 1:        continue    else:        counts[word] = counts.get(word,reverse=True) for i in range(15):    word,count))

运行结果：

曹 ***       953孔明  836将军  772却说  656玄德  585关公  510丞相  491二人  469不可  440荆州  425玄德曰     390孔明曰     390不能  384如此  378张飞  358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步：

中文文本分词使用字典表达词频扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.pyimport jIEbaexcludes = {"将军","却说","荆州","二人","不可","不能","如此"}txt = open("threekingdoms.txt",enCoding='utf-8').read()words  = jIEba.lcut(txt)counts = {}for word in words:    if len(word) == 1:        continue    elif word == "诸葛亮" or word == "孔明曰":        rword = "孔明"    elif word == "关公" or word == "云长":        rword = "关羽"    elif word == "玄德" or word == "玄德曰":        rword = "刘备"    elif word == "孟德" or word == "丞相":        rword = "曹 *** "    else:        rword = word    counts[rword] = counts.get(rword,0) + 1for word in excludes:    del counts[word]items = List(counts.items())items.sort(key=lambda x:x[1],count))

考研英语词频统计

将词频统计应用到考研英语中，我们可以统计出出现次数较多的关键单词。
文本链接: https://pan.baIDu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.pydef getText():    txt = open("86_17_1_2.txt"," ")   #将文本中特殊字符替换为空格    return txtpyTxt = getText()   #获得没有任何标点的txt文件words  = pyTxt.split()  #获得单词counts = {} #字典，键值对excludes = {"the","a","of","to","and","in","b","c","d","is",\            "was","are","have","were","had","that","for","it",\            "on","be","as","with","by","not","their","they",\            "from","more","but","or","you","at","has","we","an",\            "this","can","which","will","your","one","he","his","all","people","should","than","points","there","i","what","about","new","if","”",\            "its","been","part","so","who","would","answer","some","our","may","most","do","when","1","text","section","2","many","time","into",\            "10","no","other","up","following","【答案】","only","out","each","much","them","such","world","these","sheet","life","how","because","3","even",\            "work","directions","use","Could","Now","first","make","years","way","20","those","over","also","best","two","well","15","us","write","4","5","being","social","read","like","according","just","take","paragraph","any","english","good","after","own","year","must","american","less","her","between","then","children","before","very","human","long","while","often","my","too",\            "40","four","research","author","questions","still","last","business","education","need","information","public","says","passage","reading","through","women","she","health","example","help","get","different","him","mark","might","off","job","30","writing","choose","words","economic","become","scIEnce","socIEty","without","made","high","students","few","better","since","6","rather","however","great","where","culture","come",\            "both","three","same","government","old","find","number","means","study","put","8","change","does","today","think","future","school","yet","man","things","far","line","7","13","50","used","states","down","12","14","16","end","11","making","9","another","young","system","important","letter","17","chinese","every","see","s","test","word","century","language","little",\            "give","saID","25","state","problems","sentence","food","translation","given","child","18","longer","question","back","don’t","19","against","always","answers","kNow","having","among","instead","comprehension","large","35","want","likely","keep","family","go","why","41","home","law","place","look","day","men","22","26","45","it’s","others","companIEs","countrIEs","once","money","24","though",\            "27","29","31","say","national","ii","23","based","found","28","32","past","living","university","scIEntific","–","36","38","working","around","data","right","21","jobs","33","34","possible","feel","process","effect","growth","probably","seems","fact","below","37","39","history","technology","never","sentences","47","true","scIEntists","power","thought","during","48","early","parents",\            "something","market","times","46","certain","whether","000","dID","enough","problem","least","federal","age","IDea","learn","common","political","pay","vIEw","going","attention","happiness","moral","show","live","until","52","49","ago","percent","stress","43","44","42","meaning","51","e","iii","u","60","anything","53","55","cultural","nothing","short","100","water","car","56","58","【解析】","54","59","57","v","。","63","64","65","61","62","66","70","75","f","【考点分析】","67","here","68","71","72","69","73","74","选项a","ourselves","teachers","helps","参考范文","gdp","yourself","gone","150"}for word in words:    if word not in excludes:        counts[word] = counts.get(word,count))x = len(counts)print(x)r = 0next = eval(input("1继续"))while next == 1:    r += 100    for i in range(r,r+100):        word,count = items[i]        print ("\"{}\"".format(word),end = ",")    next = eval(input("1继续"))

总结

以上是内存溢出为你收集整理的【Python】词频统计全部内容，希望文章能够帮你解决【Python】词频统计所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: https://www.outofmemory.cn/langs/1189805.html