反爬虫-python3.6抓取猫眼电影信息

反爬虫-python3.6抓取猫眼电影信息,第1张

概述思路分解:1.页面信息url:http://maoyan.com/cinema/24311?poi=164257570 查看信息发现价格存在乱码现象:

思路分解:

1.页面信息

url:http://maoyan.com/cinema/24311?poi=164257570

查看信息发现价格存在乱码现象:

刷新页面找到乱码的URL,下载woff格式文件:方法:复制URL:右键单击转到下载完成,即为代码中的baseprice.woff文件,再次刷新网页,同样的方法再次下载URL作为匹配的woff文件,即为代码中的maoprice.woff.

用这个网址打开保存的base.woff文件,如下图:

FontEditor

Fontstore.baIDu.com

与代码行对应:

爬虫字体解析原理:先在网页上下载乱码文件base.woff,可以转化为xml,用pycharm打开可以看到信息,再刷新页面后下载maoyan.woff文件可以看到二者有对应的关系,就可以编写代码。

进群“960410445 ”  即可获取数十套pdf哦!@
 

二者的对应关系:

2.字体解析代码:

baseFont = TTFont('C:\Users\nanafighting\Desktop\baseprice.woff') maoyanFont = TTFont('maoprice.woff') maoyan_unicode_List = maoyanFont['cmap'].tables[0].ttFont.getGlyphOrder() maoyan_num_List = [] baseNumList = ['.','6','4','7','5','2','8','0','1','9','3'] baseUniCode = ['x','uniF76E','unIEACB','unie8D1','unie737','unIE9B7','uniF098','uniF4DC','uniF85E','unIE2F1','unIEE4E'] for i in range(1,12): maoyanGlyph = maoyanFont['glyf'][maoyan_unicode_List[i]] for j in range(11): baseGlyph = baseFont['glyf'][baseUniCode[j]] if maoyanGlyph == baseGlyph: maoyan_num_List.append(baseNumList[j]) break maoyan_unicode_List[1] = 'uni0078' utf8List = [eval(r"'\u" + uni[3:] + "'").encode("utf-8") for uni in maoyan_unicode_List[1:]]

3.代码中容易出错的地方:字符串的转换

movIEwish = mw[i].get_text().encode('utf-8') #字符串转换方法1 #movIEwish = str(movIEwish,enCoding='utf-8') #movIEwish = '%r' % movIEwish #movIEwish = movIEwish[1:-1] #字符串转换方法2 movIEwish=''.join('%s' %ID for ID in movIEwish) for i in range(len(utf8List)): #字符转换 utf8List[i]=''.join('%s' %ID for ID in utf8List[i]) maoyan_num_List[i]=''.join('%s' %ID for ID in maoyan_num_List[i]) movIEwish = movIEwish.replace(utf8List[i],maoyan_num_List[i])#完整代码@R_502_5565@ requests@R_502_5565@ refrom FontTools.ttlib @R_502_5565@ TTFontfrom bs4 @R_502_5565@ BeautifulSoup as bsfrom lxml @R_502_5565@ HTMLfrom FontTools.ttlib @R_502_5565@ TTFont# 抓取maoyan票房class MaoyanSpIDer: # 页面初始化 def __init__(self): self.headers = { "Accept": "text/HTML,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-EnCoding": "gzip,deflate,br","Accept-Language": "zh-CN,zh;q=0.8","Cache-Control": "max-age=0","Connection": "keep-alive","upgrade-insecure-requests": "1","Content-Type": "application/x-www-form-urlencoded; charset=UTF-8","User-Agent": "Mozilla/5.0 (windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/59.0.3071.86 Safari/537.36" } # 获取票房 def getNote(self): url = 'http://maoyan.com/cinema/24311?poi=164257570' host = {'host': 'maoyan.com','refer': 'http://maoyan.com/news'} # 合并字典 headers={**self.headers,**host} #headers = dict(self.headers.items() + host.items())在python3中会报错 # 获取页面内容 r = requests.get(url,headers=headers) # print r.text response = HTML.fromstring(r.text) u = r.text # 匹配ttf Font cmp = re.compile(",url('(//.*.woff)') format('woff')") rst = cmp.findall(u) ttf = requests.get("http:" + rst[0],stream=True) with open("maoyanprice.woff","wb") as pdf: for chunk in ttf.iter_content(chunk_size=1024): if chunk: pdf.write(chunk) # 解析字体库Font文件 #baseprice.woff是自己在网页上下载的乱码字符 baseFont = TTFont('C:\Users\nanafighting\Desktop\baseprice.woff') maoyanFont = TTFont('maoprice.woff') maoyan_unicode_List = maoyanFont['cmap'].tables[0].ttFont.getGlyphOrder() maoyan_num_List = [] baseNumList = ['.',12): maoyanGlyph = maoyanFont['glyf'][maoyan_unicode_List[i]] for j in range(11): baseGlyph = baseFont['glyf'][baseUniCode[j]] if maoyanGlyph == baseGlyph: maoyan_num_List.append(baseNumList[j]) break maoyan_unicode_List[1] = 'uni0078' utf8List = [eval(r"'\u" + uni[3:] + "'").encode("utf-8") for uni in maoyan_unicode_List[1:]] # 获取发帖内容 soup = bs(u,"HTML.parser") index = soup.find_all('div',{'class': 'show-List'}) print('---------------Prices-----------------') for n in range(len(index)): mn = soup.find_all('h3',{'class': 'movIE-name'}) ting = soup.find_all('span',{'class': 'hall'}) mt = soup.find_all('span',{'class': 'begin-time'}) mw = soup.find_all('span',{'class': 'stoneFont'}) for i in range(len(mn)): movIEname = mn[i].get_text() film_ting = ting[i].get_text() movIEtime = mt[i].get_text() movIEwish = mw[i].get_text().encode('utf-8') #字符串转换 #movIEwish = str(movIEwish,enCoding='utf-8') #movIEwish = '%r' % movIEwish #movIEwish = movIEwish[1:-1] movIEwish=''.join('%s' %ID for ID in movIEwish) for i in range(len(utf8List)): #字符转换 utf8List[i]=''.join('%s' %ID for ID in utf8List[i]) maoyan_num_List[i]=''.join('%s' %ID for ID in maoyan_num_List[i]) movIEwish = movIEwish.replace(utf8List[i],maoyan_num_List[i]) print(movIEname,film_ting,movIEtime,movIEwish)spIDer = MaoyanSpIDer()print(spIDer.getNote())

运行结果:

总结

以上是内存溢出为你收集整理的反爬虫-python3.6抓取猫眼电影信息全部内容,希望文章能够帮你解决反爬虫-python3.6抓取猫眼电影信息所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: https://www.outofmemory.cn/langs/1208474.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-04
下一篇 2022-06-04

发表评论

登录后才能评论

评论列表(0条)

保存