目的 手写文本行提取是文档图像处理中的重要基础步骤,对于无约束手写文本图像,文本行都会有不同程度的倾斜、弯曲、交叉、粘连等问题。利用传统的几何分割或聚类的方法往往无法保证文本行边缘的精确分割。针对这些问题提出了一种基于文本行回归-聚类联合框架的手写文本行提取方法。方法 首先,采用各向异性高斯滤波器组对图像进行多尺度、多方向分析,利用拖尾效应检测脊形结构提取文本行主体区域,并对其骨架化得到文本行回归模型。然后,以连通域为基本图像单元建立超像素表示,为实现超像素的聚类,建立了像素-超像素-文本行关联层级随机场模型,利用能量函数优化的方法实现超像素的聚类与所属文本行标注。在此基础上,检测出所有的行间粘连字符块,采用基于回归线的k-means聚类算法由回归模型引导粘连字符像素聚类,实现粘连字符分割与所属文本行标注。最后,利用文本行标签开关实现了文本行像素的操控显示与定向提取,而不再需要几何分割。结果 在HIT-MW 脱机手写中文文档数据集上进行文本行提取测试,检测率DR为99.83%,识别准确率RA为99.92%。结论 实验表明,提出的文本行回归-聚类联合分析框架相比于传统的分段投影分析、最小生成树聚类、Seam Carving等方法提高了文本行边缘的可控性与分割精度。在高效手写文本行提取的同时,最大程度地避免了相邻文本行的干扰,具有较高的准确率和鲁棒性。
Combination of regression and clustering for handwritten text line extraction
Zhu Jianfei,Ying Zilu,Chen Pengfei(School of Information Engineering,Wuyi University)
Objective Handwritten text line extraction is a fundamental step in document image processing. The text lines may suffer from tilting curving crossing and adhesion for the reason of unconstrained paper layout and free writing style. Traditional text line segmentation or clustering method could not guarantee the classification accuracy of the pixels between text lines. In this paper, a text line regression-clustering joint framework for handwritten text line extraction is proposed. Method First of all, the anisotropic Gaussian filter bank is used to filter the handwritten document image in multi-scale and multi-direction. Text line main body area (MBA) is first extracted with smearing, and then the text line regression model is obtained by extracting the skeleton structure of the MBA. Then the super pixel representation is constructed using connected component as the basic image element. For the super pixel classification and clustering, an approach based on associative hierarchical random fields is presented. A higher-order energy model is established by constructing a hierarchical network of Pixel-Connected Components-Text Lines. According to the model, an energy function is built whose minimization yields text line labels of the connected components. Based on the achieved instance labels of connected components, the sticky characters that share the same label are detected. Then the pixels of the sticky characters are re-clustered with k-means algorithm under the constraint of text line regression model. With the instance labels of text lines, the manipulation of the text lines can be realized by labels switch. Therefore, the geometric segmentation of the document image is no longer needed, and bounding box can be used to extract text line directly. Result The experiments were performed on HIT-MW document level dataset and achieved overall detection rate of 99.83% and recognition accuracy 99.92% respectively, which has achieved the state of the art performance for Chinese handwritten text line extraction. Conclusion Experimental results show that the proposed text line regression-clustering joint framework improves the segmentation accuracy at pixels level, makes the edge of the text line more controllable than traditional algorithms such as piece-wise projection, minimum spanning tree-based clustering, seam carving and so on. Proposed system shows high performance on Chinese handwritten text lines extraction together with better robustness and accuracy, at the same time, to the greatest extend, excluding the interference of adjacent text lines.