Updated April 28, 2025:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.
Objective 1 can be divided into five sub-objectives to report specific, significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests
As reported in the last progress report, 511 new breeding lines submitted to regional trials were grown. Tissue was collected and DNA was extracted from the tissue. We sent DNA of the 511 breeding lines to David Hyten at UNL for genotyping via skim sequencing. Previously we sent samples to a private service vendor. But because of dramatic prices increases, we decided to work with co-PI Hyten who could provide very similar data for 70% of the cost. This delayed data delivery, but we are confident the data will be in-hand by May. We now have genotyped over 4000 advanced breeding lines entered into the public regional trials, creating an impressive resource helping current and future soybean breeders and geneticists connect genotype to phenotype, and develop genomics-assisted breeding resources.
Data from the 2024 NUST trials was collected this past fall. We formatted the data and sent it to Rex Nelson at Soybase, where it will be uploaded soon.
A manuscript on this work has been submitted to the scientific journal Crop Science. It was recently accepted pending revision. We are currently editing the manuscript for final acceptance.
1) Wartha, C.A., B. Campbell, V. Ramasubramanian, L. Nice, ….19 authors….A.J. Lorenz*. 2025. Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science (Accepted pending revision).
Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely. Below are some highlights from individual program reports that are part of the SOYGEN initiative.
University of Nebraska
Aiming to generate new recombinant populations with high yield and resistance to biotic stress, the UNL soybean breeding program conducted a genomic selection analysis following the 2024 field trials, utilizing phenotypic datasets from multiple years and locations to train the prediction model. This dataset was formed by UNL lines that belonged to elite populations designed to carry resistance alleles to the Soybean Cyst Nematode (Heterodera glycines) for the rhg-1a//rhg-1a, Rhg-2//Rhg-2, and Rhg-4//Rhg-4 genes. These lines have been evaluated in field trials since 2022. In addition, the Northern Uniform Soybean Trials yield datasets from 2012, 2018, 2019, and 2020 were added to train the model. Lines from both datasets have been extensively tested in maturity 2 and maturity 3 locations in Nebraska and surrounding states. UNL lines were genotyped with micro-inversion probes (MIP), and the NUST lines were genotyped with 6K SNP chip. Genotypes were imputed and filtered accordingly. For the analyses, yield values were adjusted for the experimental design model and the best linear unbiased estimations (BLUEs) were used as input for the genomic selection analyses. This analyses accounted for the genotype-by-environment (GEI) interaction, considering that complex, non-linear interactions between lines and environments regularly occur in the soybean breeding context. The GS4PB R Shiny App (previously known as SOYGEN2 R Shiny App) and its codes were used to run the analyses.
Seven lines were selected by the breeder following the analyses. These lines were highly ranked based on their GEBV from the genomic prediction analyses. In addition, they contain at least one allele of interest for the three mentioned genes, and some of these lines are homozygous for two loci of significant interest (rhg-1a//rhg-1a, Rhg-4//Rhg-4).
These seven lines were selected as parents for eight new combinations in the UNL Winter Nursery Crossing Block project, conducted in Puerto Rico between January-April 2025. It is expected that F1 plants of these populations will be planted and genotyped in Nebraska in June 2025. As future directions, these seven new populations will be advanced, and their selected progenies will be tested in multi-environment field trials to identify superior lines for yield and resistance to Soybean Cyst Nematode. Additionally, using SOYGEN2 yield datasets in both regular and sparse genomic selection designs will enable the UNL Soybean Breeding Program to efficiently select superior lines.
University of Missouri
Andrew Scaboo’s lab is diving into the collected data from SOYGEN2 in the genomic selection experiment. This experiment tested genomic prediction versus phenotypic selection versus random selection at the University of Minnesota, North Dakota State University, University of Illinois, and University of Missouri. The selection treatments we applied in the original experiment were not as successful as we had hoped. Currently, we are analyzing the data to learn why the genomic selection treatment was not as successful as anticipated, and how we can better understand and utilize it in the future. Because this multi-institutional dataset is large and complex, we are first developing the analysis framework and treatments using the Missouri data only. Several initial analyses were described in the last progress report. During this last reporting period, we extensively evaluated the effect of genotype imputation. Figure 1 shows that methods of genotype imputation implemented in the “GS4PB” application improves prediction accuracy overall. This indicates that these methods will be powerful approaches towards improving the cost effectiveness of genomic prediction for driving genetic gain in yield.
Figure 1. (See attached document)
University of Minnesota
As described in the last report, the UMN soybean breeding program has refined its GS pipeline and tested it extensively on the UMN Preliminary Yield Trials (PYT) data. PYT 2023 progeny population lines were assayed using 1K low-density (LD) genotyping assay and parents of PYT23 lines from the crossing block were assayed using a low-pass sequencing platform to generate high density (HD) variant data. The 50K SoySNP Chip subset from the HD data set as the parental reference panel to impute 1K LD set to 50K HD set (~30K SNPs after QC). We used this imputed data to make predictions using genomic prediction models that include GxE interaction effects. In the summer of 2024 we planted a trial including lines selected using genomic prediction and phenotypic selection. The trial was successfully planted at six Minnesota locations. Every location yielded good data collected at harvest.
During the past reporting period, we analyzed the data to compare the phenotypic selection versus the genomic selections. For each treatment, we selected both high and low yielding lines. Selecting low yielding lines is important as it tests our ability to identify the poorly yielding lines. This ensures we are not devoting precious field phenotyping resources towards lines that are predicted to have low yield.
Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. We have submitted a peer-reviewed article to the journal Plant Genome.
1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (Submitted).
This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.
Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.
Three main tasks are associated with this objective.
1. Conduct a large, multi-institutional, multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that soybean researchers can use to identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively so that crop growth can be evaluated in relation to certain environmental stressors.
Progress on each task:
1. For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass resequencing technologies. Detailed environmental data for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If successful, we will leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in environments that a producer is most likely to encounter.
Last summer these breeding lines were successfully grown and phenotyped for yield at 21 locations. Data has been delivered to the University of Minnesota in a centralized database. Plans were laid for scanning all 12,400 samples with NIR to measure protein and oil. We worked with Perkin-Elmer to develop a protocol to standardize instruments across universities. Standardization samples were collected across universities to represent diversity in germplasm and growing conditions. From UMN, samples were sentto collaborators for scanning on their instruments to create the standardization file. We anticipate the samples to be scanned for protein and oil content after planting when time allows.
Another major activity was the design and packaging of the 2025 yield trials. All seeds have been delivered to 2025 field locations, and 90% of the packaging is complete. Fields have and designed and sent to cooperators. All is on schedule for a successful 2025 planting.
Genotype data from the Hyten lab was collected and returned on April 11. The genotype data includes 15 million molecular markers (SNPs) imputed from skim sequencing data, with a 50K subset available. The SNPs were mapped to the latest Williams82 genome version, Wm82a6v1. Analysis of these data will proceed over the coming months.
2. We completed the development of the Seasonal Characterization Engine (SCE). The SCE is a specialized tool integrated within the R environment, designed to streamline the analysis of trial data using the Agricultural Production Systems Simulator (APSIM) model. Users begin by uploading trial data specifying key parameters such as location, latitude, longitude, and maturity. Predefined input files, such as the soybean seed composition test, facilitate accurate data setup.
After data upload, users select a crop model template (e.g., soybean or maize) and specify maturity handling methods. Users then choose appropriate weather and soil databases, considering geographic coverage to avoid analysis errors.
The APSIM model generates environmental variables specific to different crop growing stages, providing a detailed characterization of conditions throughout the growing season. Upon initiating the analysis, the SCE provides real-time process updates via the R console.
Analytical results are available through multiple visualization tools:
1. Results Viewer: Offers box plot visualizations of selected variables, downloadable individually or collectively.
2. Heat Maps: Enables detailed inspection of seasonal and environmental covariates by growth stage, comparing specific parameters within consistent genetic maturity groups.
3. Trial Similarities: Presents correlation heat maps illustrating seasonal profile similarities among trial sites, complemented by dendrograms to clarify site relationships. Outputs are readily exportable for further analysis.
4. Thermal Time and Precipitation: Summarizes accumulated thermal time and precipitation over growing seasons, offering comparative analyses between and within sites across multiple years. These insights support understanding climate variability and trends.
3. During the 2024 season, Rainey Lab at Purdue University has focused on phenotyping soybean crops using high-resolution UAS imagery from 10 SOYGEN sites to extract canopy traits. A data management system has been set up at Purdue to help SOYGEN collaborators with secure storage of UAS image data, efficient data sharing, and access to processed plot-level outputs. To date, key canopy traits including spectral, textural, and structural features, have been extracted from 6,320 two-row plots covering about 1,200 genotypes across V3 to R5 growth stages. The processed plot-level data includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This growing dataset provides valuable resources for soybean breeders, supporting the development of advanced AI/ML algorithms and trait derivation.
For biomass prediction, we established ‘Calibration Plots’ — 8-row plots planted adjacent to the SOYGEN plots to enable nondestructive biomass sampling. A representative subset of 36 lines from the SOYGEN3 GEI panel, with three replications, was used in this experiment. Ground biomass sampling was performed on rows 2, 3, and 4, while rows 6 and 7 were used for UAS-based data collection and harvest yield measurements. UAS flights were conducted within one day before or after each ground sampling event, and from these flights, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index, and an array of structural and textural features.
In addition to the calibration dataset, we incorporated our previous ‘Public Biomass’ experiments (2021–2023), which utilized public soybean breeding germplasm from the north-central U.S. region.
Biomass prediction was approached using two modeling strategies:
1. A simplified model using two robust drone-derived predictors — canopy coverage and canopy height — both known to exhibit strong correlations with biomass.
2. A comprehensive model leveraging a broader suite of structural and textural features, capturing more detailed aspects of plant architecture and canopy structure.
Machine learning models developed with these approaches achieved high performance, with R² values reaching 0.89 and RMSE as low as 68.70, demonstrating strong predictive accuracy. Our current focus is on refining these models by minimizing time-dependent biases, enabling reliable biomass predictions on any given day regardless of growth stage or sampling date.
Structural features derived from 3D models, like canopy volume, and the top and side canopy geometry, are also good indicators for biomass. We anticipate that the marker data will facilitate the development of a biomass prediction model. We expanded the ground truth biomass dataset in 2024, and plan to do so again in 2025.
Preliminary analysis has been conducted on yield estimation using image-derived features. The results show promising potential for predicting yield, with an R² value around 0.45. To improve prediction accuracy, efforts will focus on expanding the dataset and incorporating additional data such as weather variables, soil characteristics, and genetic marker information. The integration of expanded data sources and advanced modeling techniques is anticipated to significantly strengthen the robustness and predictive power of the models.
Status of 2024 season UAS data collection and processing
See attached report.
Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.
Recent improvements in our SV calling pipeline have significantly enhanced the accuracy and resolution of SV detection. By integrating five distinct SV callers, we are now able to identify a greater number and diversity of structural variants with higher confidence. These improvements were implemented using the Williams 82 version 6 (a6) reference genome, a more complete and accurate assembly with no scaffolds compared to its predecessors. GWAS was conducted using both the SVs and SNPs identified from the a6 reference genome, yielding highly significant associations between structural variants and key agronomic traits. These traits include, but are not limited to, yield, stress resistance, and plant architecture. The use of a refined SV dataset has reduced background noise and increased the power to detect true associations. Gene Ontology (GO) analysis of significant GWAS markers located within gene models revealed that many of the associated SVs and SNPs influence genes known to control trait development and physiological processes. These functional annotations provide valuable biological insights into how structural variation contributes to phenotypic diversity in soybeans.
Comparative analyses demonstrate that the use of the Williams 82 a6 reference genome produces more meaningful GWAS results than the earlier version 4 (a4). GWAS performed with a4, which contains more unresolved scaffolds, tends to yield less consistent and potentially spurious associations. This underscores the importance of reference genome quality in structural variation-based GWAS.
Our findings highlight the critical role of high-quality reference genomes and comprehensive SV calling pipelines in conducting accurate and biologically relevant GWAS. The integration of multiple SV callers, coupled with the Williams 82 a6 reference, has led to improved detection of trait-associated structural variants. These results demonstrate that unresolved genomic regions can compromise GWAS outcomes, potentially leading to random or misleading associations.
View uploaded report 