2025
SOYGEN3: Building capacity to increase soybean genetic gain for yield through combining genomics-assisted breeding with characterization of future environments (year 3 of 3)
Category:
Sustainable Production
Keywords:
(none assigned)
Lead Principal Investigator:
Aaron Lorenz, University of Minnesota
Co-Principal Investigators:
Asheesh Singh, Iowa State University
William Schapaugh, Kansas State University
Dechun Wang, Michigan State University
Carrie Miranda, North Dakota State University
Katy M Rainey, Purdue University
Leah McHale, The Ohio State University
Eliana Monteverde Dominguez, University of Illinois
Matthew Hudson, University of Illinois at Urbana-Champaign
Nicolas Frederico Martin, University of Illinois at Urbana-Champaign
Andrew Scaboo, University of Missouri
George Graef, University of Nebraska
David Hyten, University of Nebraska at Lincoln
Rex Nelson, USDA/ARS-Iowa State University
+12 More
Project Code:
59010
Contributing Organization (Checkoff):
Institution Funded:
Brief Project Summary:
SOYGEN3 aims to enhance soybean breeding by integrating genomics, phenomics, and environmental modeling. In its final year, the project focuses on genomic selection tools, predictive models for future environments, and structural variant analysis. Key achievements include genotyping 4,000+ breeding lines, launching multi-location yield trials, and identifying 470,000+ structural variants. Expected outcomes include improved genomic tools, better yield stability, and superior soybean germplasm. The project enhances breeding efficiency, ensuring continued genetic gain and economic benefits for U.S. soybean producers. Leveraging existing funding, SOYGEN3 advances public breeding programs with cutting-edge genomic selection and environmental characterization strategies.
Unique Keywords:
#advanced methods in plant breeding, #cultivar-by-environment interactions, #genomic prediction, #yield
Information And Results
Project Summary

SOYGEN3: Building Capacity to Increase Soybean Genetic Gain for Future Environments

Project Overview

SOYGEN3 is a three-year initiative aimed at enhancing soybean genetic gain by integrating genomics-assisted breeding with environmental characterization. This initiative, in its third and final year, is designed to address the challenges of genotype-by-environment interactions, improve yield stability, and develop predictive models for future environments. The project involves a collaboration of multiple universities and institutions across the North Central region.

Project Justification and Rationale
Soybean is a critical crop with high global demand driven by its use in food, feed, and renewable fuel production. Since the 1940s, scientific breeding efforts have significantly improved yield, expanded production regions, and developed varieties with defensive traits. However, genotype-by-environment interactions complicate breeding efforts, requiring broader field testing across diverse environmental conditions. The SOYGEN initiative seeks to address these challenges by leveraging genomics, phenomics, and environmental data to enhance predictive breeding methodologies.

Key Objectives
1. Enhancing Genomics-Assisted Breeding:
o Develop and implement genomic selection tools in public breeding programs.
o Utilize genome-wide markers for genotyping advanced breeding lines.
o Integrate low-pass sequencing technology to generate cost-effective genomic data.
o Establish user-friendly software applications for genomic selection.

2. Predicting Cultivar Performance in Future Environments:
o Conduct multi-environment trials with 1,200 diverse breeding lines.
o Characterize environmental conditions and model genotype-by-environment interactions.
o Utilize UAV imagery to assess canopy development and growth rates.
o Develop predictive models connecting genotype, phenotype, and environmental data.

3. Structural Variant Analysis for Genomic Prediction:
o Sequence 41 SoyNAM founder lines to identify structural variants.
o Evaluate their influence on seed yield, composition, and adaptability.
o Improve genomic prediction models by incorporating structural variant data.

Progress to Date
Significant advancements have been made over the first two years, including:
• Genotyping and Data Management:
o Over 4,000 breeding lines genotyped with genome-wide markers.
o Development of a public genomic selection application integrated with SoyBase.
• Yield Trials and Environmental Modeling:
o Multi-location yield trials initiated with 1,200 elite lines.
o Collection of environmental data to refine predictive models.
• Structural Variant Discovery:
o Identification of over 470,000 structural variants using advanced sequencing tools.
o Initiation of pangenome sequencing for key soybean lines.

Expected Outcomes and Deliverables
• Publicly available genomic selection tools for soybean breeding programs.
• New knowledge on genotype-by-environment interactions and improved predictive models.
• Identification of structural variants impacting yield and seed composition.
• Development of superior soybean germplasm adapted to future environmental conditions.

Economic Impact
This project supports U.S. soybean competitiveness by ensuring continued genetic gain, improving yield stability, and equipping future breeders with cutting-edge genomic tools. The integration of advanced genomic and environmental modeling approaches will enhance breeding efficiency and profitability for soybean producers.

Budget Considerations
The project leverages existing breeding infrastructure and funding sources from multiple institutions. The final year will focus on optimizing resource use to complete genomic analysis, conduct large-scale trials, and refine predictive models for future breeding applications.

Project Objectives

1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.
2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environment characterization.
3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Project Deliverables

The following will be delivered upon completion of this three-year project:
1. Publicly available resources and tools for soybean breeders to implement cost-effective genomic prediction in their programs.
2. Publicly available knowledge on genetic control of genotype-by-environment interaction in soybean, and improved models for prediction of breeding line performance in new environments. Knowledge will be made available through open-access publications, presentations at scientific meetings, and presentations to the seed industry.
3. Identification of important structural variants that control seed yield and composition, and publication of knowledge on any benefit into explicitly modeling structural variants for predicting breeding line performance.
4. Enhanced germplasm and superior varieties developed through adoptions of genomics-assisted breeding techniques better adapted to future environmental conditions.

Progress Of Work

Updated April 28, 2025:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

Objective 1 can be divided into five sub-objectives to report specific, significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests

As reported in the last progress report, 511 new breeding lines submitted to regional trials were grown. Tissue was collected and DNA was extracted from the tissue. We sent DNA of the 511 breeding lines to David Hyten at UNL for genotyping via skim sequencing. Previously we sent samples to a private service vendor. But because of dramatic prices increases, we decided to work with co-PI Hyten who could provide very similar data for 70% of the cost. This delayed data delivery, but we are confident the data will be in-hand by May. We now have genotyped over 4000 advanced breeding lines entered into the public regional trials, creating an impressive resource helping current and future soybean breeders and geneticists connect genotype to phenotype, and develop genomics-assisted breeding resources.

Data from the 2024 NUST trials was collected this past fall. We formatted the data and sent it to Rex Nelson at Soybase, where it will be uploaded soon.

A manuscript on this work has been submitted to the scientific journal Crop Science. It was recently accepted pending revision. We are currently editing the manuscript for final acceptance.

1) Wartha, C.A., B. Campbell, V. Ramasubramanian, L. Nice, ….19 authors….A.J. Lorenz*. 2025. Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science (Accepted pending revision).

Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely. Below are some highlights from individual program reports that are part of the SOYGEN initiative.

University of Nebraska
Aiming to generate new recombinant populations with high yield and resistance to biotic stress, the UNL soybean breeding program conducted a genomic selection analysis following the 2024 field trials, utilizing phenotypic datasets from multiple years and locations to train the prediction model. This dataset was formed by UNL lines that belonged to elite populations designed to carry resistance alleles to the Soybean Cyst Nematode (Heterodera glycines) for the rhg-1a//rhg-1a, Rhg-2//Rhg-2, and Rhg-4//Rhg-4 genes. These lines have been evaluated in field trials since 2022. In addition, the Northern Uniform Soybean Trials yield datasets from 2012, 2018, 2019, and 2020 were added to train the model. Lines from both datasets have been extensively tested in maturity 2 and maturity 3 locations in Nebraska and surrounding states. UNL lines were genotyped with micro-inversion probes (MIP), and the NUST lines were genotyped with 6K SNP chip. Genotypes were imputed and filtered accordingly. For the analyses, yield values were adjusted for the experimental design model and the best linear unbiased estimations (BLUEs) were used as input for the genomic selection analyses. This analyses accounted for the genotype-by-environment (GEI) interaction, considering that complex, non-linear interactions between lines and environments regularly occur in the soybean breeding context. The GS4PB R Shiny App (previously known as SOYGEN2 R Shiny App) and its codes were used to run the analyses.
Seven lines were selected by the breeder following the analyses. These lines were highly ranked based on their GEBV from the genomic prediction analyses. In addition, they contain at least one allele of interest for the three mentioned genes, and some of these lines are homozygous for two loci of significant interest (rhg-1a//rhg-1a, Rhg-4//Rhg-4).
These seven lines were selected as parents for eight new combinations in the UNL Winter Nursery Crossing Block project, conducted in Puerto Rico between January-April 2025. It is expected that F1 plants of these populations will be planted and genotyped in Nebraska in June 2025. As future directions, these seven new populations will be advanced, and their selected progenies will be tested in multi-environment field trials to identify superior lines for yield and resistance to Soybean Cyst Nematode. Additionally, using SOYGEN2 yield datasets in both regular and sparse genomic selection designs will enable the UNL Soybean Breeding Program to efficiently select superior lines.

University of Missouri
Andrew Scaboo’s lab is diving into the collected data from SOYGEN2 in the genomic selection experiment. This experiment tested genomic prediction versus phenotypic selection versus random selection at the University of Minnesota, North Dakota State University, University of Illinois, and University of Missouri. The selection treatments we applied in the original experiment were not as successful as we had hoped. Currently, we are analyzing the data to learn why the genomic selection treatment was not as successful as anticipated, and how we can better understand and utilize it in the future. Because this multi-institutional dataset is large and complex, we are first developing the analysis framework and treatments using the Missouri data only. Several initial analyses were described in the last progress report. During this last reporting period, we extensively evaluated the effect of genotype imputation. Figure 1 shows that methods of genotype imputation implemented in the “GS4PB” application improves prediction accuracy overall. This indicates that these methods will be powerful approaches towards improving the cost effectiveness of genomic prediction for driving genetic gain in yield.

Figure 1. (See attached document)

University of Minnesota
As described in the last report, the UMN soybean breeding program has refined its GS pipeline and tested it extensively on the UMN Preliminary Yield Trials (PYT) data. PYT 2023 progeny population lines were assayed using 1K low-density (LD) genotyping assay and parents of PYT23 lines from the crossing block were assayed using a low-pass sequencing platform to generate high density (HD) variant data. The 50K SoySNP Chip subset from the HD data set as the parental reference panel to impute 1K LD set to 50K HD set (~30K SNPs after QC). We used this imputed data to make predictions using genomic prediction models that include GxE interaction effects. In the summer of 2024 we planted a trial including lines selected using genomic prediction and phenotypic selection. The trial was successfully planted at six Minnesota locations. Every location yielded good data collected at harvest.
During the past reporting period, we analyzed the data to compare the phenotypic selection versus the genomic selections. For each treatment, we selected both high and low yielding lines. Selecting low yielding lines is important as it tests our ability to identify the poorly yielding lines. This ensures we are not devoting precious field phenotyping resources towards lines that are predicted to have low yield.

Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. We have submitted a peer-reviewed article to the journal Plant Genome.

1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (Submitted).

This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.

Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.

Three main tasks are associated with this objective.
1. Conduct a large, multi-institutional, multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that soybean researchers can use to identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively so that crop growth can be evaluated in relation to certain environmental stressors.

Progress on each task:

1. For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass resequencing technologies. Detailed environmental data for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If successful, we will leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in environments that a producer is most likely to encounter.
Last summer these breeding lines were successfully grown and phenotyped for yield at 21 locations. Data has been delivered to the University of Minnesota in a centralized database. Plans were laid for scanning all 12,400 samples with NIR to measure protein and oil. We worked with Perkin-Elmer to develop a protocol to standardize instruments across universities. Standardization samples were collected across universities to represent diversity in germplasm and growing conditions. From UMN, samples were sentto collaborators for scanning on their instruments to create the standardization file. We anticipate the samples to be scanned for protein and oil content after planting when time allows.

Another major activity was the design and packaging of the 2025 yield trials. All seeds have been delivered to 2025 field locations, and 90% of the packaging is complete. Fields have and designed and sent to cooperators. All is on schedule for a successful 2025 planting.

Genotype data from the Hyten lab was collected and returned on April 11. The genotype data includes 15 million molecular markers (SNPs) imputed from skim sequencing data, with a 50K subset available. The SNPs were mapped to the latest Williams82 genome version, Wm82a6v1. Analysis of these data will proceed over the coming months.

2. We completed the development of the Seasonal Characterization Engine (SCE). The SCE is a specialized tool integrated within the R environment, designed to streamline the analysis of trial data using the Agricultural Production Systems Simulator (APSIM) model. Users begin by uploading trial data specifying key parameters such as location, latitude, longitude, and maturity. Predefined input files, such as the soybean seed composition test, facilitate accurate data setup.

After data upload, users select a crop model template (e.g., soybean or maize) and specify maturity handling methods. Users then choose appropriate weather and soil databases, considering geographic coverage to avoid analysis errors.

The APSIM model generates environmental variables specific to different crop growing stages, providing a detailed characterization of conditions throughout the growing season. Upon initiating the analysis, the SCE provides real-time process updates via the R console.

Analytical results are available through multiple visualization tools:
1. Results Viewer: Offers box plot visualizations of selected variables, downloadable individually or collectively.
2. Heat Maps: Enables detailed inspection of seasonal and environmental covariates by growth stage, comparing specific parameters within consistent genetic maturity groups.
3. Trial Similarities: Presents correlation heat maps illustrating seasonal profile similarities among trial sites, complemented by dendrograms to clarify site relationships. Outputs are readily exportable for further analysis.
4. Thermal Time and Precipitation: Summarizes accumulated thermal time and precipitation over growing seasons, offering comparative analyses between and within sites across multiple years. These insights support understanding climate variability and trends.

3. During the 2024 season, Rainey Lab at Purdue University has focused on phenotyping soybean crops using high-resolution UAS imagery from 10 SOYGEN sites to extract canopy traits. A data management system has been set up at Purdue to help SOYGEN collaborators with secure storage of UAS image data, efficient data sharing, and access to processed plot-level outputs. To date, key canopy traits including spectral, textural, and structural features, have been extracted from 6,320 two-row plots covering about 1,200 genotypes across V3 to R5 growth stages. The processed plot-level data includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This growing dataset provides valuable resources for soybean breeders, supporting the development of advanced AI/ML algorithms and trait derivation.

For biomass prediction, we established ‘Calibration Plots’ — 8-row plots planted adjacent to the SOYGEN plots to enable nondestructive biomass sampling. A representative subset of 36 lines from the SOYGEN3 GEI panel, with three replications, was used in this experiment. Ground biomass sampling was performed on rows 2, 3, and 4, while rows 6 and 7 were used for UAS-based data collection and harvest yield measurements. UAS flights were conducted within one day before or after each ground sampling event, and from these flights, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index, and an array of structural and textural features.

In addition to the calibration dataset, we incorporated our previous ‘Public Biomass’ experiments (2021–2023), which utilized public soybean breeding germplasm from the north-central U.S. region.

Biomass prediction was approached using two modeling strategies:
1. A simplified model using two robust drone-derived predictors — canopy coverage and canopy height — both known to exhibit strong correlations with biomass.
2. A comprehensive model leveraging a broader suite of structural and textural features, capturing more detailed aspects of plant architecture and canopy structure.

Machine learning models developed with these approaches achieved high performance, with R² values reaching 0.89 and RMSE as low as 68.70, demonstrating strong predictive accuracy. Our current focus is on refining these models by minimizing time-dependent biases, enabling reliable biomass predictions on any given day regardless of growth stage or sampling date.

Structural features derived from 3D models, like canopy volume, and the top and side canopy geometry, are also good indicators for biomass. We anticipate that the marker data will facilitate the development of a biomass prediction model. We expanded the ground truth biomass dataset in 2024, and plan to do so again in 2025.
Preliminary analysis has been conducted on yield estimation using image-derived features. The results show promising potential for predicting yield, with an R² value around 0.45. To improve prediction accuracy, efforts will focus on expanding the dataset and incorporating additional data such as weather variables, soil characteristics, and genetic marker information. The integration of expanded data sources and advanced modeling techniques is anticipated to significantly strengthen the robustness and predictive power of the models.

Status of 2024 season UAS data collection and processing
See attached report.

Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Recent improvements in our SV calling pipeline have significantly enhanced the accuracy and resolution of SV detection. By integrating five distinct SV callers, we are now able to identify a greater number and diversity of structural variants with higher confidence. These improvements were implemented using the Williams 82 version 6 (a6) reference genome, a more complete and accurate assembly with no scaffolds compared to its predecessors. GWAS was conducted using both the SVs and SNPs identified from the a6 reference genome, yielding highly significant associations between structural variants and key agronomic traits. These traits include, but are not limited to, yield, stress resistance, and plant architecture. The use of a refined SV dataset has reduced background noise and increased the power to detect true associations. Gene Ontology (GO) analysis of significant GWAS markers located within gene models revealed that many of the associated SVs and SNPs influence genes known to control trait development and physiological processes. These functional annotations provide valuable biological insights into how structural variation contributes to phenotypic diversity in soybeans.

Comparative analyses demonstrate that the use of the Williams 82 a6 reference genome produces more meaningful GWAS results than the earlier version 4 (a4). GWAS performed with a4, which contains more unresolved scaffolds, tends to yield less consistent and potentially spurious associations. This underscores the importance of reference genome quality in structural variation-based GWAS.

Our findings highlight the critical role of high-quality reference genomes and comprehensive SV calling pipelines in conducting accurate and biologically relevant GWAS. The integration of multiple SV callers, coupled with the Williams 82 a6 reference, has led to improved detection of trait-associated structural variants. These results demonstrate that unresolved genomic regions can compromise GWAS outcomes, potentially leading to random or misleading associations.

View uploaded report PDF file

Final Project Results

Benefit To Soybean Farmers

Soybean breeding has a large impact on the efficiency and profitability of agriculture through the development of high yielding new varieties with critical defensive traits and enhanced seed composition. Ensuring that such programs (both private and public) are using state-of-the-art technologies to drive genetic gain in the face of changing environments and narrowing genetic diversity will contribute to continual development and release of ever better varieties. Additionally, these efforts help to educate future agricultural scientists and soybean breeders that are best prepared to enter the seed industry and develop impactful future products for farmers, keeping the North Central region competitive in soybean production.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.