A PCA-random forest pipeline for high-resolution SNP-based cultivar discrimination in Leymus chinensis
| 作 者:Kong CF, Hu SY, Tian L, Chen SY* |
| 影响因子:4.8 |
| 刊物名称:BMC Plant Biology |
| 出版年份:2026 |
| 卷:26 期:1 页码:464 |
Background
Leymus chinensis (sheepgrass) is a key perennial forage grass for grassland restoration in northern China, but its complex genome and high genetic diversity hinder precise cultivar identification using traditional morphological methods. Recent advances in SNP-based molecular markers provide efficient and reliable tools for Distinctness, Uniformity and Stability (DUS) testing and cultivar-rights protection in this species.
Results
Using a custom-designed sheepgrass 50K whole-genome liquid-phase SNP array, we genotyped 223 individuals from 11 accessions, including nine cultivars, one breeding line, and one wild accession, with 15–30 individuals sampled per accession. After stringent quality control, 159,262 high-confidence SNPs were retained, with over 60% located in gene-associated regions. Population structure and phylogenetic analyses revealed clear genetic differentiation among most accessions, although some cultivars showed substantial genomic overlap. Principal component analysis (PCA) alone could effectively distinguish only two cultivars, demonstrating limited resolution in differentiating all populations. To improve identification power, we integrated PCA with Random Forest (RF) classification and established a core panel of 575 SNPs. The resulting assignment model achieved a mean correct-classification rate of 74.36% across eleven sheepgrass populations, with three cultivars (breed 11, 7, and 8) exceeding 80% accuracy, and the highest reaching 88.56%. Nevertheless, lower resolution was observed for the breeding line and the wild accession, reflecting high genetic heterozygosity and complex ancestry among the two groups.
Conclusions
This high-throughput, cost-effective SNP assay enables accurate identification of sheepgrass cultivars. By integrating PCA and Random Forest, a core set of 575 SNPs was established, achieving high discrimination power. This strategy is also broadly applicable to cultivar identification in other species.