万源,陈晓丽,张景会,欧卓玲(武汉理工大学理学院, 湖北武汉 430070)
目的 特征降维是机器学习领域的热点研究问题。现有的低秩稀疏保持投影方法忽略了原始数据空间和降维后的低维空间之间的信息损失，且现有的方法不能有效处理少量有标签数据和大量无标签数据的情况，针对这两个问题，提出基于低秩稀疏图嵌入的半监督特征选择方法（LRSE）。方法 LRSE方法包含两步：第1步是充分利用有标签数据和无标签数据分别学习其低秩稀疏表示，第2步是在目标函数中同时考虑数据降维前后的信息差异和降维过程中的结构信息保持，其中通过最小化信息损失函数使数据中有用的信息尽可能地保留下来，将包含数据全局结构和内部几何结构的低秩稀疏图嵌入在低维空间中使得原始数据空间中的结构信息保留下来，从而能选择出更有判别性的特征。结果 将本文方法在6个公共数据集上进行测试，对降维后的数据采用KNN分类验证本文方法的分类准确率，并与其他现有的降维算法进行实验对比，本文方法分类准确率均有所提高，在其中的5个数据集上本文方法都有最高的分类准确率，其分类准确率分别在Wine数据集上比次高算法鲁棒非监督特征选择算法（RUFS）高11.19%，在Breast数据集上比次高算法RUFS高0.57%，在Orlraws10P数据集上比次高算法多聚类特征选择算法（MCFS）高1%，在Coil20数据集上比次高算法MCFS高1.07%，在数据集Orl64上比次高算法MCFS高2.5%。结论 本文提出的基于低秩稀疏图嵌入的半监督特征选择算法使得降维后的数据能最大限度地保留原始数据包含的信息，且能有效处理少量有标签样本和大量无标签样本的情况。实验结果表明，本文方法比现有算法的分类效果更好，此外，由于本文方法基于所有的特征都在线性流形上的假设，所以本文方法只适用于线性流形上的数据。
Semi-supervised feature selection based on low-rank sparse graph embedding
Wan Yuan,Chen Xiaoli,Zhang Jinghui,Ou Zhuoling(College of Science, Wuhan University of Technology, Wuhan 430070, China)
Objective With the widespread use of high-dimensional data, dimensionality reduction has become an important research direction in both machine learning and data mining. High-dimensional data require additional storage space, cause high complexity in calculating, and are time consuming. Given that the large number of redundant features indicates a minimal effect on data analysis, the sparse preserving project method is proposed, and this method maintains the sparse reconstruction relations of data by minimizing an objective function including l1 norm regularization term. Thus, the approach can obtain intrinsic geometric structure information of data without any parameter setting. The traditional SPP neglects the potential global structure because it computes sparse representation of each sample separately; thus, SPP is not robust to noise. Another dimension reduction method, which is called low-rank sparse preserving projection, has advantages in preserving global structure and local linear structure of the data by constructing a low-rank sparse graph. However, this method ignores the loss of information between the original high-dimensional data and the reduced lower dimensional data. Furthermore, the method fails to address the situation that involves a small number of labeled samples and a large number of unlabeled samples. Thus, to solve these two problems, a semi-supervised feature selection method based on low-rank sparse graph embedding (LRSE) is proposed in this paper. Method The proposed LRSE consists of two steps. The first is to learn the low-rank sparse representation of original data from a small amount of labeled data and a large amount of unlabeled data. Next, we consider the difference of information between high-and low-dimensional data and structural information preservation in a union model. Thus, the useful information in the data is preserved as much as possible by minimizing the information loss function. Furthermore, the structural information in the original data space is preserved by embedding the low-rank sparse graph that contains a data global structure and internal geometric structure into lower-dimensional data space. Therefore, this method can select a larger number of discriminative features. The solution process of the objective function is presented in detail in Section 4, where we use the alternating optimization method to convert non-convex problems into two convex optimization subproblems. Then, the Lagrange multiplier method is adopted to solve these two subproblems. Result To validate the performance of the proposed algorithm, we conduct five experiments on six public datasets. The first two experiments involve dimensionality reduction on dataset Wine and face dataset Orl64. Then, we visualize the obtained features. The features selected by the proposed method indicate strong discrimination ability and less redundant information from visual graphics 1 and 2. The second two experiments conduct feature selection on six datasets, and we adopt KNN classification on the lower-dimensional data to verify the effectiveness of the proposed method. Table 2 shows that the average classification accuracy (ACC) of our method LRSE on the five datasets is the highest except for datasets WarpPIE10P compared with four other feature selection methods and baseline. In particular, the ACC increases by 11.19% in the Wine dataset. The classification accuracy is better than that of the original high-dimensional data in most cases. A reasonable explanation is that our method eliminates redundant features in dimensionality reduction. Thus, the conclusions of experiment 1 and 2 can also be proven by these results. The last experiment is parameter sensitivity analysis. The experimental results show that the classification accuracy is more stable for most datasets whenα=1. Conclusion This study proposes a semi-supervised feature selection method based on low-rank sparse graph embedding; the method learns the low-rank sparse representation of labeled data and unlabeled data and embeds it in the lower-dimensional data space. Thus, the information contained in the original data can be preserved as much as possible. Moreover, the information loss in dimensionality reduction can be minimized by adding reconstruction error of data in the objective function. The combination of these two strategies can improve the efficiency of our dimensionality reduction method. We conduct a series of comparative analyses with several existing methods on multiple datasets to prove that the proposed method is more efficient than the existing dimensionality reduction methods. Furthermore, the method applied in this study is based on the assumption that all of the features are on linear manifolds. Thus, our method is not suitable for all types of data. Future research will focus on expanding the applicability of the method to a wider range of data with kernel trick.