PDF文档导出指定章节为TXT

需求

要导出3000多个pdf文档的特定章节内容为txt格式（pdf文字可复制）。

解决

导出PDF

查了一下Python操作PDF文档的方法，主要是通过3个库，PyPDF2、pdfminer和pdfplumber。

PyPDF2 是一个纯 Python PDF 库，可以读取文档信息（标题，作者等）、写入、分割、合并PDF文档，它还可以对pdf文档进行添加水印、加密解密等。
pdfplumber 是基于 pdfminer.six 开发的模块，pdfplumber库按页处理 pdf ，获取页面文字，提取表格等操作。
pdfminer 使用门槛较高，但遇到复杂情况，最后还得用它。目前开源模块中，它对PDF的支持应该是最全的了。

看网上的例子，pdfminer是用得比较多的，然后直接复制了之前的代码并修改了一下变量名啥的：

# 解析pdf文件函数
def parse(pdf_path):with open(r'C:\Users\Desktop\\' + pdf_path, 'rb') as pdf_file:  # 以二进制读模式打开# 用文件对象来创建一个pdf文档分析器pdf_parser = PDFParser(pdf_file)# 创建一个PDF文档pdf_doc = PDFDocument(pdf_parser)# 检测文档是否提供txt转换，不提供就忽略if pdf_doc.is_extractable:# 创建PDf 资源管理器 来管理共享资源pdf_rm = PDFResourceManager()# 创建一个PDF设备对象pdf_lap = LAParams()pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)# 创建一个PDF解释器对象interpreter = PDFPageInterpreter(pdf_rm, pdf_pa)# 循环遍历列表，每次处理一个page的内容for page in PDFPage.create_pages(pdf_doc):  # doc.get_pages() 获取page列表interpreter.process_page(page)# 接受该页面的LTPage对象layout = pdf_pa.get_result()for x in layout:if isinstance(x, LTTextBoxHorizontal):  # 获取文本内容# 保存文本内容with open(os.path.basename(pdf_path) + '.txt', 'a', encoding='utf-8') as f:  # 生成doc文件的文件名及路径results = x.get_text()f.write(results)f.write('\n')

运行一下发现很慢，一张页面要很久。因此不能全部导出之后再裁剪，而是找到指定的页面之后再导出，那么找到指定页面只能是通过目录，或者边导出边扫描，发现我们已经导出了所需的内容后面就不需要再导出了。最后，30000多个的文档运行到一半电脑关机了再重新导出肯定很麻烦，所以还要保存一下导出状态等信息。

还好，每个文档都有目录，那我们可以解析目录来获取指定页。

根据目录获取指定页

百度了一下Python获取pdf的指定页，获取pdf的目录，发现用的是PyPDF2来完成的，于是就对PyPDF2进行研究，通过其官网发现，它有获取目录的能力，可以直接导出目录及对应的页码。

for index, file_path in enumerate(files_list):start_page_number = 0  # 开始页码is_get_page_number_range = Falseinfo = update_file_info(file_path=file_path)with open(file_path, 'rb') as pdf_file:  # 读取pdf文档pdf = PdfFileReader(pdf_file)  # 加载pdf文档if pdf.isEncrypted:pdf.decrypt('')  # 解密end_page_number = pdf.getNumPages()  # 获取总页码info = update_file_info(info, page_count=end_page_number)  # 保存总页数pdf_directory = pdf.getOutlines()  # 获取目录is_have_start_page_number = Falsefor destination in pdf_directory:if isinstance(destination, dict):if is_have_start_page_number:end_page_number = pdf.getDestinationPageNumber(destination)is_get_page_number_range = Truebreaktitle = destination.get('/Title')if key_word in str(title):# 在目录中找到关键词了start_page_number = pdf.getDestinationPageNumber(destination)is_have_start_page_number = Truecontinueif is_get_page_number_range:info = update_file_info(info, start_page_number=start_page_number, end_page_number=end_page_number,is_have_directory=True)res = "获取页码成功"else:info = update_file_info(info, is_have_directory=False)res = "获取页码失败"print("扫描进度 : {:.2f}%, 文件 : {}".format(index / len(files_list) * 100, os.path.basename(file_path)), res, ':','[', start_page_number, ',', end_page_number, ']', end=end)

比较重要的就是getOutlines()函数和getDestinationPageNumber(destination)函数，分别是获取目录对象，以及根据目录对象获取页数。

这样，就把目标页码找出了，有不能直接在pdf查看器里目录那里点跳转的是扫描不出的，要另外想办法。

导出

先是使用PyPDF2导出文档。但是使用PyPDF2导出文本的时候导出的是乱码，使用的是unicode编码，暂时没找到转换的方法，网友说是其年代久远，对中文支持不好，网上一般配合pdfplumber使用，pdfplumber好像有OCR能力，安装的时候要安装一个图形库，安装了很久安装不上就放弃了pdfplumber。但是pdfminer我不会获取目录，那就只能两个库配合使用了。

首先是使用PyPDF2扫描一下目录，这个非常快，然后把配置信息保存在json文件中，然后再由pdfminer提取对应页文档。对于没有跳转目录的，可以逐页分析，找到合适的就保存需要的，没找到就保存整个文档的txt导出。

with open(path, 'rb') as pdf_file:  # 读取pdf文档is_have_target_page = info.get('is_have_directory')start_page_number = 0end_page_number = 0page_count = info.get('page_count')if is_have_target_page:start_page_number = info.get('start_page_number')if start_page_number is None:start_page_number = 0is_have_target_page = Falseend_page_number = info.get('end_page_number')if end_page_number is None:is_have_target_page = Falseend_page_number = info.get('page_count')else:is_have_target_page = Falsepdf_parse = PDFParser(pdf_file)pdf_doc = PDFDocument(pdf_parse)if pdf_doc.is_extractable:pdf_rm = PDFResourceManager(caching=True)pdf_lap = LAParams()pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)pdf_pi = PDFPageInterpreter(pdf_rm, pdf_pa)if is_have_target_page:page_set = set()for i in range(start_page_number, end_page_number):page_set.add(i)pdf_page = PDFPage.get_pages(pdf_file, pagenos=page_set, password=b'', caching=True)print('读取文本->>>')for index, page in enumerate(pdf_page):print("部分 : 当前文档进度 : {}/{}".format(index, len(page_set)), end=end)pdf_pi.process_page(page)layout = pdf_pa.get_result()for x in layout:if isinstance(x, LTTextBoxHorizontal):  # 获取文本内容text += x.get_text() + '\n'# print(x.get_text())else:pdf_page = PDFPage.create_pages(pdf_doc)print('读取文本->>>')is_find_start_page = Falsetext_cache = ""for index, page in enumerate(pdf_page):print("扫描 : 当前文档进度 : {}/{}, 找到起始位置 : {}".format(index, page_count, is_find_start_page),end=end)pdf_pi.process_page(page)layout = pdf_pa.get_result()page_text = ''for x in layout:if isinstance(x, LTTextBoxHorizontal):  # 获取文本内容page_text += x.get_text() + '\n'# print(x.get_text())text_cache += page_textif re.search(r'第.节\s*经营情况讨论与分析\s*一', page_text):  # 找到这一节了text += page_text # 当前页开始保存is_find_start_page = Trueinfo = update_file_info(info, start_page_number=index)continueif is_find_start_page:text += page_textif re.search(r'第.节\s*.*\s*一', page_text):  # 找到下一节了info = update_file_info(info, end_page_number=index)breakif text == '':text = text_cache

保存

保存很简单，就直接新建个文件，把文本写入即可。

def save_text_file(file_name, txt):"""覆盖保存文本文档到当前脚本目录下的output目录下UTF-8编码:param file_name: 文件名:param txt: 文件内容:return: None"""if not file_name.endswith('.txt'):file_name += '.txt'  # 补全文件名file_path = os.path.join(os.getcwd(), 'output')if not os.path.exists(file_path):os.mkdir(file_path)  # 创建文件夹with open(os.path.join(file_path, file_name), 'w', encoding='utf-8') as txt_file:txt_file.write(txt)  # 保存文件

完成

# coding:utf-8
# @Time : 2021/11/5 11:37 
# @Author : minuy
# @File : pdf_to_txt.py
# @Version : v1.1 修改搜索正则，添加文件名后缀，删除日期后缀，修复扫描不到不保存问题，修复扫描第一页丢失问题
import os
import json
import refrom PyPDF2 import PdfFileReaderfrom pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBoxHorizontal
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser# 换行符
end = '\n'def dispose(root_path, name_suffix=None, is_recover=False, load_cache=True, cache_info_path='pdf_cache.json'):"""处理pdf数据:param name_suffix: 文件名后缀:param root_path: 处理的根目录:param is_recover: 是否覆盖（已导出的）:param load_cache: 是否使用缓存（不会重新扫描）:param cache_info_path: 缓存保存位置:return: None"""if load_cache:if not os.path.exists(cache_info_path):load_cache = Falseprint('没有找到缓存数据......')if load_cache:# pdf文档的缓存pdf_info = load_cache_info(cache_info_path)else:files = get_files_list(root_path)print('开始扫描文档.......')pdf_info = scan_pdf_directory(files, '经营情况讨论与分析')save_cache_info(pdf_info)files = []for key, val in pdf_info.items():files.append(val.get('file_path'))if name_suffix is None:name_suffix = ""count = 0print('开始提取数据.......')for index, path in enumerate(files):print("处理进度 = {:.2f}%, 文件 = {}".format(index / len(pdf_info) * 100, os.path.basename(path)), end=end)info = pdf_info.get(str(index))if info.get('is_export') and (not is_recover):continue  # 如果已经导出并且不覆盖，则直接处理下一个text, info = parse_pdf(info)  # 提取text_file_name = info.get('stock_code') + '_' + str(name_suffix)save_text_file(text_file_name, text)  # 保存info = update_file_info(info, text_length=len(text), is_export=True, output_length=len(text),text_file_name=text_file_name)pdf_info.update({str(index): info})  # 更新缓存信息save_cache_info(pdf_info)count += 1if info.get('is_have_directory'):res = '有'else:res = '无'print('> 已保存文件，文件名：{}，长度：{}，目录：{}，本次运行处理文件个数：{}'.format(text_file_name, len(text), res, count))def save_cache_info(pdf_info):"""保存处理信息:param pdf_info: 处理信息:return: None"""with open("pdf_cache.json", 'w') as f:json_str = json.dumps(pdf_info)f.write(json_str)def load_cache_info(info_path):"""加载处理信息缓存文件:param info_path: 加载配置信息的位置:return: pdf缓存对象"""with open(info_path, 'r') as f:json_str = f.read()pdf_info_cache = json.loads(json_str)return pdf_info_cachedef parse_pdf(info: dict):"""解析pdf文档:param info: 文档信息:return: 文本内容，文档信息（股票代码，日期，起始位置，结束位置）"""path = info.get('file_path')if path is None:raise ValueError('不存在文件路径')file = os.path.basename(path)  # 获取文件名stock_code = re.search(r'\d{6}', file).group(0)  # 解析股票代码file_date = re.search(r'\d{4}-\d{1,2}-\d{1,2}', file).group(0)  # 解析日期info = update_file_info(info, stock_code=stock_code, date=file_date)  # 更新信息text = ''  # 文本缓存with open(path, 'rb') as pdf_file:  # 读取pdf文档is_have_target_page = info.get('is_have_directory')start_page_number = 0end_page_number = 0page_count = info.get('page_count')if is_have_target_page:start_page_number = info.get('start_page_number')if start_page_number is None:start_page_number = 0is_have_target_page = Falseend_page_number = info.get('end_page_number')if end_page_number is None:is_have_target_page = Falseend_page_number = info.get('page_count')else:is_have_target_page = Falsepdf_parse = PDFParser(pdf_file)pdf_doc = PDFDocument(pdf_parse)if pdf_doc.is_extractable:pdf_rm = PDFResourceManager(caching=True)pdf_lap = LAParams()pdf_pa = PDFPageAggregator(pdf_rm, laparams=pdf_lap)pdf_pi = PDFPageInterpreter(pdf_rm, pdf_pa)if is_have_target_page:page_set = set()for i in range(start_page_number, end_page_number):page_set.add(i)pdf_page = PDFPage.get_pages(pdf_file, pagenos=page_set, password=b'', caching=True)print('读取文本->>>')for index, page in enumerate(pdf_page):print("部分 : 当前文档进度 : {}/{}".format(index, len(page_set)), end=end)pdf_pi.process_page(page)layout = pdf_pa.get_result()for x in layout:if isinstance(x, LTTextBoxHorizontal):  # 获取文本内容text += x.get_text() + '\n'# print(x.get_text())else:pdf_page = PDFPage.create_pages(pdf_doc)print('读取文本->>>')is_find_start_page = Falsetext_cache = ""for index, page in enumerate(pdf_page):print("扫描 : 当前文档进度 : {}/{}, 找到起始位置 : {}".format(index, page_count, is_find_start_page),end=end)pdf_pi.process_page(page)layout = pdf_pa.get_result()page_text = ''for x in layout:if isinstance(x, LTTextBoxHorizontal):  # 获取文本内容page_text += x.get_text() + '\n'# print(x.get_text())text_cache += page_textif re.search(r'第.节\s*经营情况讨论与分析\s*一', page_text):  # 找到这一节了text += page_text # 当前页开始保存is_find_start_page = Trueinfo = update_file_info(info, start_page_number=index)continueif is_find_start_page:text += page_textif re.search(r'第.节\s*.*\s*一', page_text):  # 找到下一节了info = update_file_info(info, end_page_number=index)breakif text == '':text = text_cachereturn text, infodef save_text_file(file_name, txt):"""覆盖保存文本文档到当前脚本目录下的output目录下UTF-8编码:param file_name: 文件名:param txt: 文件内容:return: None"""if not file_name.endswith('.txt'):file_name += '.txt'  # 补全文件名file_path = os.path.join(os.getcwd(), 'output')if not os.path.exists(file_path):os.mkdir(file_path)  # 创建文件夹with open(os.path.join(file_path, file_name), 'w', encoding='utf-8') as txt_file:txt_file.write(txt)  # 保存文件def scan_pdf_directory(files_list, key_word):"""扫描pdf文档目录，获得文档总页数，有无目录，有（起始位置，结束位置）key_word 用在有目录的情况下，不匹配则返回整个文档范围:param files_list: 要扫描的文件列表:param key_word: 目录关键词:return: 字典，每个元素为一个处理单元，有唯一的ID"""pdf_info_dict = {}for index, file_path in enumerate(files_list):start_page_number = 0  # 开始页码is_get_page_number_range = Falseinfo = update_file_info(file_path=file_path)with open(file_path, 'rb') as pdf_file:  # 读取pdf文档pdf = PdfFileReader(pdf_file)  # 加载pdf文档if pdf.isEncrypted:pdf.decrypt('')  # 解密end_page_number = pdf.getNumPages()  # 获取总页码info = update_file_info(info, page_count=end_page_number)  # 保存总页数pdf_directory = pdf.getOutlines()  # 获取目录is_have_start_page_number = Falsefor destination in pdf_directory:if isinstance(destination, dict):if is_have_start_page_number:end_page_number = pdf.getDestinationPageNumber(destination)is_get_page_number_range = Truebreaktitle = destination.get('/Title')if key_word in str(title):# 在目录中找到关键词了start_page_number = pdf.getDestinationPageNumber(destination)is_have_start_page_number = Truecontinueif is_get_page_number_range:info = update_file_info(info, start_page_number=start_page_number, end_page_number=end_page_number,is_have_directory=True)res = "获取页码成功"else:info = update_file_info(info, is_have_directory=False)res = "获取页码失败"print("扫描进度 : {:.2f}%, 文件 : {}".format(index / len(files_list) * 100, os.path.basename(file_path)), res, ':','[', start_page_number, ',', end_page_number, ']', end=end)pdf_info_dict.update({str(index): info})return pdf_info_dictdef update_file_info(info=None, file_path=None, start_page_number=None, end_page_number=None, page_count=None,output_length=None,is_have_directory=None, is_export=None, stock_code=None, date=None, text_file_name=None,text_length=None):"""更新字典里的东西，如果不是字典，则被替换成字典:param text_length: 导出的文本文件长度:param page_count: 总页数:param stock_code: 股票代码:param date: 日期:param text_file_name: 对应的文本文件名:param info: 字典:param file_path: 更新文件路径:param start_page_number: 更新开始页码:param end_page_number: 更新结束页码:param output_length: 输出长度:param is_have_directory: 是否存在目录:param is_export: 是否已经导出:return: 更新后的info"""if info is None:info = {'file_path': None,'start_page_number': None,'end_page_number': None,'output_length': None,'is_have_directory': None,'is_export': None,'stock_code': None,'date': None,'text_file_name': None,'page_count': None,'text_length': None}if not isinstance(info, dict):raise ValueError("传入的值info必须是空或者是字典！")if file_path:info['file_path'] = file_pathif start_page_number:info['start_page_number'] = start_page_numberif end_page_number:info['end_page_number'] = end_page_numberif output_length:info['output_length'] = output_lengthif is_have_directory:info['is_have_directory'] = is_have_directoryif is_export:info['is_export'] = is_exportif stock_code:info['stock_code'] = stock_codeif date:info['date'] = dateif text_file_name:info['text_file_name'] = text_file_nameif page_count:info['page_count'] = page_countif text_length:info['text_length'] = text_lengthreturn infodef get_files_list(path):"""获取传入路径中及其子目录下的所有pdf文件路径:param path: 要搜索的根路径:return: pdf文件路径列表"""files_list = []for root, dirs, files in os.walk(path):  # 遍历目录for file in files:  # 遍历文件file_path = os.path.join(root, file)  # 拼接路径if file_path.endswith(".pdf"):  # 如果是pdf文件files_list.append(file_path)  # 添加到列表中return files_listif __name__ == '__main__':# 扫描根目录，文件名后缀，是否覆盖，是否使用缓存信息dispose(r'D:\Project\pdf_ouput', 2019, True, False)

运行结果

D:\Project\pdf_ouput\venv\Scripts\python.exe D:/Project/pdf_ouput/pdf_to_txt.py
开始扫描文档.......
扫描进度 : 0.00%, 文件 : 000045深纺织A：深纺织A2019年年度报告_2020-03-14.pdf 获取页码失败 : [ 0 , 182 ]
扫描进度 : 33.33%, 文件 : 002030达安基因：达安基因2019年年度报告_2020-04-30.pdf 获取页码成功 : [ 18 , 38 ]
扫描进度 : 66.67%, 文件 : 102030达安基因：达安基因2019年年度报告_2020-04-21.pdf 获取页码失败 : [ 0 , 283 ]
开始提取数据.......
处理进度 = 0.00%, 文件 = 000045深纺织A：深纺织A2019年年度报告_2020-03-14.pdf
读取文本->>>
扫描 : 当前文档进度 : 0/182, 找到起始位置 : False
扫描 : 当前文档进度 : 1/182, 找到起始位置 : False
扫描 : 当前文档进度 : 2/182, 找到起始位置 : False
扫描 : 当前文档进度 : 3/182, 找到起始位置 : False
扫描 : 当前文档进度 : 4/182, 找到起始位置 : False
扫描 : 当前文档进度 : 5/182, 找到起始位置 : False
扫描 : 当前文档进度 : 6/182, 找到起始位置 : False
扫描 : 当前文档进度 : 7/182, 找到起始位置 : False
扫描 : 当前文档进度 : 8/182, 找到起始位置 : False
扫描 : 当前文档进度 : 9/182, 找到起始位置 : False
扫描 : 当前文档进度 : 10/182, 找到起始位置 : False
扫描 : 当前文档进度 : 11/182, 找到起始位置 : False
扫描 : 当前文档进度 : 12/182, 找到起始位置 : False
扫描 : 当前文档进度 : 13/182, 找到起始位置 : True
扫描 : 当前文档进度 : 14/182, 找到起始位置 : True
扫描 : 当前文档进度 : 15/182, 找到起始位置 : True
扫描 : 当前文档进度 : 16/182, 找到起始位置 : True
扫描 : 当前文档进度 : 17/182, 找到起始位置 : True
扫描 : 当前文档进度 : 18/182, 找到起始位置 : True
扫描 : 当前文档进度 : 19/182, 找到起始位置 : True
扫描 : 当前文档进度 : 20/182, 找到起始位置 : True
扫描 : 当前文档进度 : 21/182, 找到起始位置 : True
扫描 : 当前文档进度 : 22/182, 找到起始位置 : True
扫描 : 当前文档进度 : 23/182, 找到起始位置 : True
扫描 : 当前文档进度 : 24/182, 找到起始位置 : True
扫描 : 当前文档进度 : 25/182, 找到起始位置 : True
扫描 : 当前文档进度 : 26/182, 找到起始位置 : True
扫描 : 当前文档进度 : 27/182, 找到起始位置 : True
扫描 : 当前文档进度 : 28/182, 找到起始位置 : True
> 已保存文件，文件名：000045_2019，长度：21424，目录：无，本次运行处理文件个数：1
处理进度 = 33.33%, 文件 = 002030达安基因：达安基因2019年年度报告_2020-04-30.pdf
读取文本->>>
部分 : 当前文档进度 : 0/20
部分 : 当前文档进度 : 1/20
部分 : 当前文档进度 : 2/20
部分 : 当前文档进度 : 3/20
部分 : 当前文档进度 : 4/20
部分 : 当前文档进度 : 5/20
部分 : 当前文档进度 : 6/20
部分 : 当前文档进度 : 7/20
部分 : 当前文档进度 : 8/20
部分 : 当前文档进度 : 9/20
部分 : 当前文档进度 : 10/20
部分 : 当前文档进度 : 11/20
部分 : 当前文档进度 : 12/20
部分 : 当前文档进度 : 13/20
部分 : 当前文档进度 : 14/20
部分 : 当前文档进度 : 15/20
部分 : 当前文档进度 : 16/20
部分 : 当前文档进度 : 17/20
部分 : 当前文档进度 : 18/20
部分 : 当前文档进度 : 19/20
> 已保存文件，文件名：002030_2019，长度：17705，目录：有，本次运行处理文件个数：2
处理进度 = 66.67%, 文件 = 102030达安基因：达安基因2019年年度报告_2020-04-21.pdf
读取文本->>>
扫描 : 当前文档进度 : 0/283, 找到起始位置 : False
扫描 : 当前文档进度 : 1/283, 找到起始位置 : False
扫描 : 当前文档进度 : 2/283, 找到起始位置 : False
扫描 : 当前文档进度 : 3/283, 找到起始位置 : False
扫描 : 当前文档进度 : 4/283, 找到起始位置 : False
扫描 : 当前文档进度 : 5/283, 找到起始位置 : False
扫描 : 当前文档进度 : 6/283, 找到起始位置 : False
扫描 : 当前文档进度 : 7/283, 找到起始位置 : False
扫描 : 当前文档进度 : 8/283, 找到起始位置 : False
扫描 : 当前文档进度 : 9/283, 找到起始位置 : False
扫描 : 当前文档进度 : 10/283, 找到起始位置 : False
扫描 : 当前文档进度 : 11/283, 找到起始位置 : False
扫描 : 当前文档进度 : 12/283, 找到起始位置 : False
扫描 : 当前文档进度 : 13/283, 找到起始位置 : False
扫描 : 当前文档进度 : 14/283, 找到起始位置 : False
扫描 : 当前文档进度 : 15/283, 找到起始位置 : False
扫描 : 当前文档进度 : 16/283, 找到起始位置 : False
扫描 : 当前文档进度 : 17/283, 找到起始位置 : False
扫描 : 当前文档进度 : 18/283, 找到起始位置 : True
扫描 : 当前文档进度 : 19/283, 找到起始位置 : True
扫描 : 当前文档进度 : 20/283, 找到起始位置 : True
扫描 : 当前文档进度 : 21/283, 找到起始位置 : True
扫描 : 当前文档进度 : 22/283, 找到起始位置 : True
扫描 : 当前文档进度 : 23/283, 找到起始位置 : True
扫描 : 当前文档进度 : 24/283, 找到起始位置 : True
扫描 : 当前文档进度 : 25/283, 找到起始位置 : True
扫描 : 当前文档进度 : 26/283, 找到起始位置 : True
扫描 : 当前文档进度 : 27/283, 找到起始位置 : True
扫描 : 当前文档进度 : 28/283, 找到起始位置 : True
扫描 : 当前文档进度 : 29/283, 找到起始位置 : True
扫描 : 当前文档进度 : 30/283, 找到起始位置 : True
扫描 : 当前文档进度 : 31/283, 找到起始位置 : True
扫描 : 当前文档进度 : 32/283, 找到起始位置 : True
扫描 : 当前文档进度 : 33/283, 找到起始位置 : True
扫描 : 当前文档进度 : 34/283, 找到起始位置 : True
扫描 : 当前文档进度 : 35/283, 找到起始位置 : True
扫描 : 当前文档进度 : 36/283, 找到起始位置 : True
扫描 : 当前文档进度 : 37/283, 找到起始位置 : True
> 已保存文件，文件名：102030_2019，长度：18753，目录：无，本次运行处理文件个数：3Process finished with exit code 0

完成~

速度明显提升，但是后面扫描的时候不应该直接就是一页一页的扫描，而是先扫描前面的目录，获取对应页面，这个看看将来还有没有需求，有需求再改进吧。

总结

Python 导出pdf文档，可以导出为txt，html，表格，xml，图片等，PyPDF2主要用来获取目录，拆分、合并等操作，主要用到的函数：getNumPages() 获取总页码，getOutlines() 获取目录，getDestinationPageNumber(destination) 获取目录对应的页码，pdfminer功能很强大，现在只会导出，主要的函数有：PDFPage.create_pages(pdf_doc) 导出全部页，PDFPage.get_pages(pdf_file, pagenos=page_set) 导出集合中的指定页，pdfplumber 貌似能识别图片字符。

其他的，扫描根目录下的所有pdf文档，配置的读取和保存，配置的更新等主要涉及到Python基础和操作逻辑问题了，正则表达式也是个好东西。

参考文档

Python操作PDF全总结|pdfplumber&PyPDF2

Python使用pdfminer解析PDF_光明~~~

如何利用Python抓取PDF中的某些内容？

python 从PDF文件中读取书签/目录_龙纸人的博客

Python利用PyPDF2库获取PDF文件总页码实例

PDFMiner: PDFMiner 是一个 Python 的 PDF 解析器，可以从 PDF 文档中提取信息

PyPDF2 Documentation — PyPDF2 1.26.0 documentation

查看全文
如若内容造成侵权/违法违规/事实不符，请联系编程学习网邮箱：809451989@qq.com进行投诉反馈，一经查实，立即删除！

Clion+Mingw64配置C/C++ 开发环境 (windows10）
1. 下载 Mingw64 下载地址：https://www.mingw-w64.org/downloads/ 2. 安装 MinGW 64 2.1 双击文件开始安装选择版本 Version 指的是 gcc 的版本，如果没有特殊的需求，一般选择最高的版本号即可。选择电脑系统架构电脑系统是 64位的&a…...
2024/5/9 18:41:58
MySQL之DQL进阶（多表查询）
笛卡尔积查询有两张表，获取这两个表的所有组合情况 -- 标准语法 SELECT 列名 FROM 表名1,表名2,...; 内连接查询查询原理内连接查询的是两张表有交集的部分数据(有主外键关联的数据) 显式内连接 -- 标准语法 SELECT 列名 FROM 表名1 [INNER] JOIN 表名2 ON …...
2024/5/9 14:09:32
Win10系统的安装
1.先准备一个大于等于16G的U盘（8G勉勉强强也可以，该U盘当作系统U盘来使用，里面自己的资料最好备份，清空）。 2.去百度下载一个win10的系统（iso文件）下载地址（可自行查找下载&#xff…...
2024/5/9 12:05:16
Spring源码学习一
Spring源码学习一前言Spring 分析一加载前的准备例子:FeiSay类Test类fei.xmlSpring 中ClassPathXmlApplicationContext执行方法前执行的静态方法当我们从顺序从上往下缕的时候很容易绕蒙,一定要结合图来看。总结前言学完spring,mybatis,springmvc,spring-boot基础用法后,感觉…...
2024/5/1 21:53:30
光学领域需要使用超纯水的几种情形你get到了吗？
近年来，在全球经济放缓的背景下，大多数行业都出现了放缓和利润下滑的趋势，消费电子行业也在持续下滑。然而，随着以苹果为首的智能穿戴设备、车辆成像和安全监控等新兴电子产品的快速发展，上游光电薄膜组件的市场需求不…...
2024/5/8 10:19:23
Kubernetes 默认调度器简介
在 Kubernetes 项目中，默认调度器的主要职责，就是为一个新创建出来的 Pod，寻找一个最合适的节点（Node）。而这里“最合适”的含义，包括两层： 从集群所有的节点中，根据调度算法挑选…...
2024/5/9 21:35:23
实验五 —— ADC烟雾报警器
实验要求实验源码 /*exam5coo.h*/ #ifndef ADCCOORDINATOR_H #define ADCCOORDINATOR_H#ifdef __cplusplus extern "C" { #endif #include "ZComDef.h"#define AdcAPP_ENDPOINT 10#define AdcAPP_PROFID 0x0F04 #define AdcAPP_…...
2024/4/17 21:48:28
MySQL学习笔记1
MySQL 1.MySQL架构与历史 1.1 MySQL逻辑架构以下是MySQL逻辑架构图最上层是基于网络的客户端/服务器的工具或者服务都有类似的架构。比如连接处理，授权认证，安全等等。第二层架构是大多数MySQL核心服务功能所在的层，包括查询解析&#x…...
2024/5/9 12:26:28
jvm 常用调优参数
...
2024/5/5 21:51:35
jsp研究生导师双向选择选题
本系统主要针对高校研究生管理工作中研究生导师双向选择这一培养环节进行设计与开发的，系统的使用对象为系统管理员、导师和研究生三种身份的用户。 1．普通用户(导师和研究生)需求 ①普通用户可以在线获取双向选择相关的全部信息和资料，如导…...
2024/5/6 2:32:18
算法导论2 动态规划矩阵链乘法
问题有n个矩阵相乘（A1A2A3...An）,任何一个矩阵 Ai 的维度为Pi-1 * Pi ; 求如何拆分矩阵相乘使得矩阵乘法次数最小自顶向下的递归方法对于任何一个子集Ai...Aj ； 如果ij,则为一个矩阵，乘法次数m(i,j) 为零；否则从各个…...
2024/4/14 22:33:25
金融行业管理解决方案
金融业乘风起航，2019年银行业总资产突破280万亿，为供给侧改革保驾护航；保险业全年实现保费收入4.26万亿元，呈现恢复性增长趋势；证券公司总资产突破7万亿大关；基金市场则迎“抢钱行情”，新基金成…...
2024/4/7 2:41:52
loss、val_loss、accuracy、val_accuracy
loss：训练集损失值 accuracy:训练集准确率 val_loss:测试集损失值 val_accruacy:测试集准确率以下5种情况可供参考： train loss 不断下降，test loss不断下降，说明网络仍在学习;（最好的） train loss 不…...
2024/4/7 2:41:50
织梦dedecms后台采集数据库批量替换安全确认码不显示完美解决方法
织梦dedecms后台采集数据库批量替换安全确认码不显示完美解决方法织梦dede后台采集数据库批量替换安全确认码不显示完美解决方法如下。 (此图片来源于网络，如有侵权，请联系删除! ) dede后台安全验证码不显示给你快捷的方法，ftp中找到在…...
2024/5/5 18:19:34
163 输入(input)子系统基础概念
一、统一管理外部输入设备，如： 按键键盘鼠标触摸屏 … 用户空间接口 /dev/input/event0/1/2/…/dev/input/mouse0/1/2/…（鼠标）/dev/input/sj0/1/2/…… 分层模型核心层创建input设备类根据输入设备种类、分发事件到…...
2024/5/1 10:50:34
767-Linux高频考点梳理(1)
g与 gcc 的区别，g能编译.c 文件吗？gcc 能编译.cpp 文件吗？ 1、gcc 现在是 gnu 提供的一些编译器工具的集合（Gnu Compiler Collection），以前是gnu 组织提供的 c 编译器（Gnu C Compiler&#xff…...
2024/5/5 18:26:43
LeetCode268.丢失的数字
268. 丢失的数字给定一个包含 [0, n] 中 n 个数的数组 nums ，找出 [0, n] 这个范围内没有出现在数组中的那个数。示例 1：输入：nums [3,0,1] 输出：2 解释：n 3，因为有 3 个数字，所以所有的数…...
2024/4/27 8:51:57
「万字进阶」深入浅出 Commonjs 和 Es Module
一前言今天我们来深度分析一下 Commonjs 和 Es Module，希望通过本文的学习，能够让大家彻底明白 Commonjs 和 Es Module 原理，能够一次性搞定面试中遇到的大部分有关 Commonjs 和 Es Module 的问题。老规矩我们带上疑问开始今天的分析&am…...
2024/4/14 22:34:21
浅析微服务架构前世今生，深入微服务开发组件
文章目录一、单体架构vs微服务架构1.1 单机架构1.1.1 什么是单体架构1.1.2 单体架构示意图1.1.3 单体架构的优缺点1.2 微服务以及微服务架构1.2.1 微服务的定义1.2.2 微服务的特点1.2.3 微服务架构是什么？1.2.4 微服务的优缺点1.2.4 微服务的适用场景二、Spring Clo…...
2024/5/6 0:38:50
list模糊查询及转型
List< String> 转 List< Short> 转其他的互换下<>内的属性 /*** List<String> 转 List<Short>* param StringList* return*/public static List<Short> parseIntegersList(List<String> StringList) {List<Short> shortList …...
2024/4/29 6:57:23

【Python】PDF文档导出指定章节为TXT

PDF文档导出指定章节为TXT

需求

解决

导出PDF

根据目录获取指定页

导出

保存

完成

总结

参考文档

相关文章

最新文章