从需求文档到智能化测试基于-PaddleOCR-的图片信息处理与自动化提取

2025-03-14 约 2080 字预计阅读 5 分钟

https://bing.ee123.net/img/rand?artid=146246715

从需求文档到智能化测试：基于 PaddleOCR 的图片信息处理与自动化提取

在软件测试的实际工作中，需求文档是测试工程师的重要工具。然而，随着需求文档的复杂化，文档中的图片（如业务流程图、UI 示意图和测试用例说明图）不仅承载了大量关键信息，也是测试用例设计的重要依据。如何从需求文档中快速、准确地提取图片内容，甚至将其转化为结构化文本进行分析，是提升测试效率的关键。本文将通过结合 Python-Docx 和 PaddleOCR ，实现一套自动化提取需求文档图片并进行文字识别的解决方案。测试工程师可以利用本文提供的工具，快速从指定标题下的图片中提取内容并识别文字，大幅减少人工操作的时间成本。

1 测试工程师的痛点：为何要自动提取图片并识别文字？

在需求分析阶段，测试工程师通常需要从需求文档中提取图片，并根据图片内容设计测试用例。这一过程中存在以下常见痛点：

图片信息分散 图片可能分布在文档的多个章节中，手动查找费时费力。
手动识别效率低下 即使找到图片，人工识别和整理图片内容容易出现遗漏或理解偏差。
文档标题样式多样化 不同文档的标题样式可能不一致，难以通过固定规则准确提取。
图片文字识别复杂 图片中的文字可能包含中英文混合内容、复杂表格或特殊字体，普通工具识别效果有限。基于这些痛点，我们提出了一个解决方案：通过 Python-Docx 提取指定标题下的图片，并结合 PaddleOCR 对图片内容进行文字识别，最终将识别结果输出，供测试工程师高效使用。

2 技术方案：从图片提取到文字识别

我们的技术方案分为以下几个步骤：

提取指定标题下的图片 利用 python-docx 从需求文档中提取指定标题下的图片，支持模糊匹配标题，兼容多种标题样式。
使用 PaddleOCR 进行文字识别 对提取的图片进行 OCR 处理，识别图片中的文字内容，支持中英文混合及表格处理。
输出结构化识别结果 将识别到的内容按标题和图片分组输出，便于测试工程师直接使用。以下是完整的代码实现。

3 代码实现：自动提取图片并进行文字识别

3.1 获取指定标题下的图片

我们通过 python-docx 提取文档中指定标题下的图片，并保存到本地： import os import cv2 import numpy as np import xml.etree.cElementTree as ET from docx import Document def is_heading_enhanced(paragraph): """ 判断段落是否为标题（增强版）。包含对样式名称、字体大小和加粗特征的检查。 """ heading_features = { ‘keywords’: [‘标题’, ‘heading’, ‘header’, ‘h1’, ‘h2’, ‘h3’, ‘chapter’], ‘font_size’: (14, 72) } style_match = any( kw in (paragraph.style.name or “”).lower() for kw in heading_features[‘keywords’] ) if style_match: return True try: font = paragraph.style.font or paragraph.document.styles[‘Normal’].font effective_size = font.size.pt if font.size else 12 return heading_features[‘font_size’][0] <= effective_size <= heading_features[‘font_size’][1] except Exception as e: print(f"格式检查失败: {str(e)}") return False def get_target_pic(docx_file, target_title): """ 从指定 docx 文件中提取目标标题下的图片。 """ doc = Document(docx_file) target_title_list = [title.strip() for title in target_title.split(",") if title.strip()] xml_elements = [] found_start = False for par in doc.paragraphs: par_text = par.text.strip() if par_text and any(title in par_text for title in target_title_list) and is_heading_enhanced(par): found_start = True continue elif is_heading_enhanced(par): found_start = False if found_start: xml_elements.append(par.element.xml) rId_list = [] def get_img_rId(root_element, target_tag, target_attribute, out_list): for child in root_element: if child.tag in target_tag and target_attribute in child.attrib: out_list.append(child.attrib[target_attribute]) else: get_img_rId(child, target_tag, target_attribute, out_list) for element in xml_elements: root = ET.fromstring(element) target_tags = [ ‘{urn:schemas-microsoft-com:vml}imagedata’, ‘{http://schemas.openxmlformats.org/drawingml/2006/main}blip’ ] target_attributes = [ ‘{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed’, ‘{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id’ ] for target_attribute in target_attributes: get_img_rId(root, target_tags, target_attribute, rId_list) images = [] output_dir = os.path.dirname(docx_file) for idx, rId in enumerate(rId_list): try: img_part = doc.part.related_parts[rId] img_binary = img_part.blob img = cv2.imdecode(np.frombuffer(img_binary, np.uint8), cv2.IMREAD_COLOR) if img is None: continue img_name = os.path.join(output_dir, f"extracted_image{idx + 1}.jpg") cv2.imwrite(img_name, img) images.append(img_name) except Exception as e: print(f"图片提取失败，rId={rId}: {e}") return images if images else None

3.2 使用 PaddleOCR 识别提取的图片

我们使用 PaddleOCR 对提取的图片进行文字识别： from paddleocr import PaddleOCR

初始化 PaddleOCR

ocr = PaddleOCR(use_angle_cls=True, lang=“ch”) # 支持方向分类和中英文混合识别 def perform_ocr_with_paddle(images): """ 使用 PaddleOCR 对图片进行文字识别。 """ results = [] for image_path in images: try: img = cv2.imread(image_path) ocr_result = ocr.ocr(img, cls=True) text_lines = [line[1][0] for line in ocr_result[0]] results.append((image_path, “\n”.join(text_lines))) except Exception as e: results.append((image_path, f"OCR 识别失败: {e}")) return results

3.3 组合主流程

if name == ‘main’:

用户输入的标题和文档路径

target_title = ‘业务流程,测试用例’ file_path = ‘需求文档.docx’

提取图片

print(“正在提取图片…”) image_paths = get_target_pic(file_path, target_title) if not image_paths: print(“未找到目标标题下的图片”) exit() print(f"提取到 {len(image_paths)} 张图片：{image_paths}")

OCR 识别

print("\n正在进行 OCR 识别…") ocr_results = perform_ocr_with_paddle(image_paths)

输出结果

for image_path, text in ocr_results: print(f"\n图片路径：{image_path}") print(f"识别内容：\n{text}")

4 示例运行结果

用户输入

目标标题：业务流程,测试用例文档路径：xxx.docx

输出结果

提取到 1 张图片： [‘D:\python_project\with_playwright\0312\img0.jpg’] 提取到 1 张图片：[‘D:\python_project\with_playwright\0312\img0.jpg’] 正在进行 OCR 识别… 图片路径：D:\python_project\with_playwright\0312\img0.jpg 识别内容：专员创建活动发送二维码邀请填写信息提交生成客户信息扫码核销签到码拍摄照片上传 T+7录入跟踪信息跟踪90天加保信息

5 测试工程师的收益

通过本文提供的工具，测试工程师可以：

快速提取图片信息 ：自动提取指定标题下的所有图片，避免手动查找。
精准识别图片文字 ：结合 PaddleOCR，支持中英文混合内容的高效识别。
提升测试效率 ：将图片内容转化为结构化文本，直接为测试用例设计提供依据。

6 总结

本文结合 Python-Docx 和 PaddleOCR ，实现了从需求文档中提取图片并进行文字识别的完整流程。通过这一工具，测试工程师可以显著提升需求分析的效率，同时减少人工操作的错误率。未来，我们还可以进一步优化识别算法，支持复杂表格和图形的解析，为测试自动化提供更多可能性。

目录

从需求文档到智能化测试基于-PaddleOCR-的图片信息处理与自动化提取

从需求文档到智能化测试：基于 PaddleOCR 的图片信息处理与自动化提取

1 测试工程师的痛点：为何要自动提取图片并识别文字？

2 技术方案：从图片提取到文字识别

3 代码实现：自动提取图片并进行文字识别

3.1 获取指定标题下的图片

3.2 使用 PaddleOCR 识别提取的图片

初始化 PaddleOCR

3.3 组合主流程

用户输入的标题和文档路径

提取图片

OCR 识别

输出结果

4 示例运行结果

用户输入

输出结果

5 测试工程师的收益

6 总结