如何获取书签的页码

如何获取书签的页码,第1张

如何获取书签页码

正如@theta指出的那样,“根据轮廓分割pdf
”具有提取页码所需的代码。如果您觉得这很复杂,我复制了一部分代码,该代码将页面ID映射到页面编号并使其成为函数。这是一个打印书签o [0]的页码的工作示例:

from PyPDF2 import PdfFileReaderdef _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):    if _result is None:        _result = {}    if pages is None:        _num_pages = []        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()    t = pages["/Type"]    if t == "/Pages":        for page in pages["/Kids"]: _result[page.idnum] = len(_num_pages) _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)    elif t == "/Page":        _num_pages.append(1)    return _result# mainf = open('document.pdf','rb')p = PdfFileReader(f)# map page ids to page numberspg_id_num_map = _setup_page_id_to_num(p)o = p.getOutlines()pg_num = pg_id_num_map[o[0].page.idnum] + 1print(pg_num)

@theta可能为时已晚,但可能会对其他人有所帮助:) btw我关于stackoverflow的第一篇文章,所以请问如果我不遵循通常的格式

进一步扩展此功能: 如果您希望在页面上获得书签的确切位置,这将使您的工作更加轻松:

from PyPDF2 import PdfFileReaderimport PyPDF2 as pyPdfdef _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):    if _result is None:        _result = {}    if pages is None:        _num_pages = []        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()    t = pages["/Type"]    if t == "/Pages":        for page in pages["/Kids"]: _result[page.idnum] = len(_num_pages) _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)    elif t == "/Page":        _num_pages.append(1)    return _resultdef outlines_pg_zoom_info(outlines, pg_id_num_map, result=None):    if result is None:        result = dict()    if type(outlines) == list:        for outline in outlines: result = outlines_pg_zoom_info(outline, pg_id_num_map, result)    elif type(outlines) == pyPdf.pdf.Destination:        title = outlines['/Title']        result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'],         left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))    return result# mainpdf_name = 'document.pdf'f = open(pdf_name,'rb')pdf = PdfFileReader(f)# map page ids to page numberspg_id_num_map = _setup_page_id_to_num(pdf)outlines = pdf.getOutlines()bookmarks_info = outlines_pg_zoom_info(outlines, pg_id_num_map)print(bookmarks_info)

注意:我的书签是区号(例如:1.1 Introduction),我正在将书签信息映射到区号。 如果您的书签不同,请修改此部分代码:

    elif type(outlines) == pyPdf.pdf.Destination:        title = outlines['/Title']        result[title.split()[0]] = dict(title=outlines['/Title'], top=outlines['/Top'],         left=outlines['/Left'], page=(pg_id_num_map[outlines.page.idnum]+1))


欢迎分享,转载请注明来源:内存溢出

原文地址: https://www.outofmemory.cn/zaji/5674507.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-16

发表评论

登录后才能评论

评论列表(0条)

保存