Project Details: SOYGEN3: Building capacity to increase soybean genetic gain for yield through combining genomics-assisted breeding with characterization of future environments (year 3 of 3) (2025)

2025

SOYGEN3: Building capacity to increase soybean genetic gain for yield through combining genomics-assisted breeding with characterization of future environments (year 3 of 3)

Home

Contributor/Checkoff:

North Central Soybean Research Program

Category:

Sustainable Production

Keywords:

(none assigned)

Parent Project:

Increasing the rate of genetic gain for yield in soybean breeding programs

Lead Principal Investigator:

Aaron Lorenz, University of Minnesota

Co-Principal Investigators:

Asheesh Singh, Iowa State University
William Schapaugh, Kansas State University
Dechun Wang, Michigan State University
Carrie Miranda, North Dakota State University
Katy M Rainey, Purdue University
Leah McHale, The Ohio State University
Eliana Monteverde Dominguez, University of Illinois
Matthew Hudson, University of Illinois at Urbana-Champaign
Nicolas Frederico Martin, University of Illinois at Urbana-Champaign
Andrew Scaboo, University of Missouri
George Graef, University of Nebraska
David Hyten, University of Nebraska at Lincoln
Rex Nelson, USDA/ARS-Iowa State University

+12 More

Project Code:

59010

Contributing Organization (Checkoff):

North Central Soybean Research Program

$715,000

Institution Funded:

University of Minnesota

$715,000

Brief Project Summary:

SOYGEN3 aims to enhance soybean breeding by integrating genomics, phenomics, and environmental modeling. In its final year, the project focuses on genomic selection tools, predictive models for future environments, and structural variant analysis. Key achievements include genotyping 4,000+ breeding lines, launching multi-location yield trials, and identifying 470,000+ structural variants. Expected outcomes include improved genomic tools, better yield stability, and superior soybean germplasm. The project enhances breeding efficiency, ensuring continued genetic gain and economic benefits for U.S. soybean producers. Leveraging existing funding, SOYGEN3 advances public breeding programs with cutting-edge genomic selection and environmental characterization strategies.

Unique Keywords:
#advanced methods in plant breeding, #cultivar-by-environment interactions, #genomic prediction, #yield

Information And Results

Project Summary

SOYGEN3: Building Capacity to Increase Soybean Genetic Gain for Future Environments

Project Overview

SOYGEN3 is a three-year initiative aimed at enhancing soybean genetic gain by integrating genomics-assisted breeding with environmental characterization. This initiative, in its third and final year, is designed to address the challenges of genotype-by-environment interactions, improve yield stability, and develop predictive models for future environments. The project involves a collaboration of multiple universities and institutions across the North Central region.

Project Justification and Rationale
Soybean is a critical crop with high global demand driven by its use in food, feed, and renewable fuel production. Since the 1940s, scientific breeding efforts have significantly improved yield, expanded production regions, and developed varieties with defensive traits. However, genotype-by-environment interactions complicate breeding efforts, requiring broader field testing across diverse environmental conditions. The SOYGEN initiative seeks to address these challenges by leveraging genomics, phenomics, and environmental data to enhance predictive breeding methodologies.

Key Objectives
1. Enhancing Genomics-Assisted Breeding:
o Develop and implement genomic selection tools in public breeding programs.
o Utilize genome-wide markers for genotyping advanced breeding lines.
o Integrate low-pass sequencing technology to generate cost-effective genomic data.
o Establish user-friendly software applications for genomic selection.

2. Predicting Cultivar Performance in Future Environments:
o Conduct multi-environment trials with 1,200 diverse breeding lines.
o Characterize environmental conditions and model genotype-by-environment interactions.
o Utilize UAV imagery to assess canopy development and growth rates.
o Develop predictive models connecting genotype, phenotype, and environmental data.

3. Structural Variant Analysis for Genomic Prediction:
o Sequence 41 SoyNAM founder lines to identify structural variants.
o Evaluate their influence on seed yield, composition, and adaptability.
o Improve genomic prediction models by incorporating structural variant data.

Progress to Date
Significant advancements have been made over the first two years, including:
• Genotyping and Data Management:
o Over 4,000 breeding lines genotyped with genome-wide markers.
o Development of a public genomic selection application integrated with SoyBase.
• Yield Trials and Environmental Modeling:
o Multi-location yield trials initiated with 1,200 elite lines.
o Collection of environmental data to refine predictive models.
• Structural Variant Discovery:
o Identification of over 470,000 structural variants using advanced sequencing tools.
o Initiation of pangenome sequencing for key soybean lines.

Expected Outcomes and Deliverables
• Publicly available genomic selection tools for soybean breeding programs.
• New knowledge on genotype-by-environment interactions and improved predictive models.
• Identification of structural variants impacting yield and seed composition.
• Development of superior soybean germplasm adapted to future environmental conditions.

Economic Impact
This project supports U.S. soybean competitiveness by ensuring continued genetic gain, improving yield stability, and equipping future breeders with cutting-edge genomic tools. The integration of advanced genomic and environmental modeling approaches will enhance breeding efficiency and profitability for soybean producers.

Budget Considerations
The project leverages existing breeding infrastructure and funding sources from multiple institutions. The final year will focus on optimizing resource use to complete genomic analysis, conduct large-scale trials, and refine predictive models for future breeding applications.

Project Objectives

1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.
2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environment characterization.
3. Discover structural variants and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Project Deliverables

The following will be delivered upon completion of this three-year project:
1. Publicly available resources and tools for soybean breeders to implement cost-effective genomic prediction in their programs.
2. Publicly available knowledge on genetic control of genotype-by-environment interaction in soybean, and improved models for prediction of breeding line performance in new environments. Knowledge will be made available through open-access publications, presentations at scientific meetings, and presentations to the seed industry.
3. Identification of important structural variants that control seed yield and composition, and publication of knowledge on any benefit into explicitly modeling structural variants for predicting breeding line performance.
4. Enhanced germplasm and superior varieties developed through adoptions of genomics-assisted breeding techniques better adapted to future environmental conditions.

Progress Of Work

Updated April 28, 2025:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

Objective 1 can be divided into five sub-objectives to report specific, significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests

As reported in the last progress report, 511 new breeding lines submitted to regional trials were grown. Tissue was collected and DNA was extracted from the tissue. We sent DNA of the 511 breeding lines to David Hyten at UNL for genotyping via skim sequencing. Previously we sent samples to a private service vendor. But because of dramatic prices increases, we decided to work with co-PI Hyten who could provide very similar data for 70% of the cost. This delayed data delivery, but we are confident the data will be in-hand by May. We now have genotyped over 4000 advanced breeding lines entered into the public regional trials, creating an impressive resource helping current and future soybean breeders and geneticists connect genotype to phenotype, and develop genomics-assisted breeding resources.

Data from the 2024 NUST trials was collected this past fall. We formatted the data and sent it to Rex Nelson at Soybase, where it will be uploaded soon.

A manuscript on this work has been submitted to the scientific journal Crop Science. It was recently accepted pending revision. We are currently editing the manuscript for final acceptance.

1) Wartha, C.A., B. Campbell, V. Ramasubramanian, L. Nice, ….19 authors….A.J. Lorenz*. 2025. Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science (Accepted pending revision).

Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely. Below are some highlights from individual program reports that are part of the SOYGEN initiative.

University of Nebraska
Aiming to generate new recombinant populations with high yield and resistance to biotic stress, the UNL soybean breeding program conducted a genomic selection analysis following the 2024 field trials, utilizing phenotypic datasets from multiple years and locations to train the prediction model. This dataset was formed by UNL lines that belonged to elite populations designed to carry resistance alleles to the Soybean Cyst Nematode (Heterodera glycines) for the rhg-1a//rhg-1a, Rhg-2//Rhg-2, and Rhg-4//Rhg-4 genes. These lines have been evaluated in field trials since 2022. In addition, the Northern Uniform Soybean Trials yield datasets from 2012, 2018, 2019, and 2020 were added to train the model. Lines from both datasets have been extensively tested in maturity 2 and maturity 3 locations in Nebraska and surrounding states. UNL lines were genotyped with micro-inversion probes (MIP), and the NUST lines were genotyped with 6K SNP chip. Genotypes were imputed and filtered accordingly. For the analyses, yield values were adjusted for the experimental design model and the best linear unbiased estimations (BLUEs) were used as input for the genomic selection analyses. This analyses accounted for the genotype-by-environment (GEI) interaction, considering that complex, non-linear interactions between lines and environments regularly occur in the soybean breeding context. The GS4PB R Shiny App (previously known as SOYGEN2 R Shiny App) and its codes were used to run the analyses.
Seven lines were selected by the breeder following the analyses. These lines were highly ranked based on their GEBV from the genomic prediction analyses. In addition, they contain at least one allele of interest for the three mentioned genes, and some of these lines are homozygous for two loci of significant interest (rhg-1a//rhg-1a, Rhg-4//Rhg-4).
These seven lines were selected as parents for eight new combinations in the UNL Winter Nursery Crossing Block project, conducted in Puerto Rico between January-April 2025. It is expected that F1 plants of these populations will be planted and genotyped in Nebraska in June 2025. As future directions, these seven new populations will be advanced, and their selected progenies will be tested in multi-environment field trials to identify superior lines for yield and resistance to Soybean Cyst Nematode. Additionally, using SOYGEN2 yield datasets in both regular and sparse genomic selection designs will enable the UNL Soybean Breeding Program to efficiently select superior lines.

University of Missouri
Andrew Scaboo’s lab is diving into the collected data from SOYGEN2 in the genomic selection experiment. This experiment tested genomic prediction versus phenotypic selection versus random selection at the University of Minnesota, North Dakota State University, University of Illinois, and University of Missouri. The selection treatments we applied in the original experiment were not as successful as we had hoped. Currently, we are analyzing the data to learn why the genomic selection treatment was not as successful as anticipated, and how we can better understand and utilize it in the future. Because this multi-institutional dataset is large and complex, we are first developing the analysis framework and treatments using the Missouri data only. Several initial analyses were described in the last progress report. During this last reporting period, we extensively evaluated the effect of genotype imputation. Figure 1 shows that methods of genotype imputation implemented in the “GS4PB” application improves prediction accuracy overall. This indicates that these methods will be powerful approaches towards improving the cost effectiveness of genomic prediction for driving genetic gain in yield.

Figure 1. (See attached document)

University of Minnesota
As described in the last report, the UMN soybean breeding program has refined its GS pipeline and tested it extensively on the UMN Preliminary Yield Trials (PYT) data. PYT 2023 progeny population lines were assayed using 1K low-density (LD) genotyping assay and parents of PYT23 lines from the crossing block were assayed using a low-pass sequencing platform to generate high density (HD) variant data. The 50K SoySNP Chip subset from the HD data set as the parental reference panel to impute 1K LD set to 50K HD set (~30K SNPs after QC). We used this imputed data to make predictions using genomic prediction models that include GxE interaction effects. In the summer of 2024 we planted a trial including lines selected using genomic prediction and phenotypic selection. The trial was successfully planted at six Minnesota locations. Every location yielded good data collected at harvest.
During the past reporting period, we analyzed the data to compare the phenotypic selection versus the genomic selections. For each treatment, we selected both high and low yielding lines. Selecting low yielding lines is important as it tests our ability to identify the poorly yielding lines. This ensures we are not devoting precious field phenotyping resources towards lines that are predicted to have low yield.

Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. We have submitted a peer-reviewed article to the journal Plant Genome.

1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (Submitted).

This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.

Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.

Three main tasks are associated with this objective.
1. Conduct a large, multi-institutional, multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that soybean researchers can use to identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively so that crop growth can be evaluated in relation to certain environmental stressors.

Progress on each task:

1. For this objective, we are conducting a multi-environment, multi-institutional coordinated performance trial of 1200 diverse breeding lines. Each breeding line will be phenotyped for several agronomic and phenological traits, and each will be genotyped using low pass resequencing technologies. Detailed environmental data for each growing location in each year will be collected and analyzed. The ultimate goal is to better predict the interactions between the environment and genotype. If successful, we will leverage genomic data, phenotype data, and environmental data to predict how new breeding lines may perform in environments that a producer is most likely to encounter.
Last summer these breeding lines were successfully grown and phenotyped for yield at 21 locations. Data has been delivered to the University of Minnesota in a centralized database. Plans were laid for scanning all 12,400 samples with NIR to measure protein and oil. We worked with Perkin-Elmer to develop a protocol to standardize instruments across universities. Standardization samples were collected across universities to represent diversity in germplasm and growing conditions. From UMN, samples were sentto collaborators for scanning on their instruments to create the standardization file. We anticipate the samples to be scanned for protein and oil content after planting when time allows.

Another major activity was the design and packaging of the 2025 yield trials. All seeds have been delivered to 2025 field locations, and 90% of the packaging is complete. Fields have and designed and sent to cooperators. All is on schedule for a successful 2025 planting.

Genotype data from the Hyten lab was collected and returned on April 11. The genotype data includes 15 million molecular markers (SNPs) imputed from skim sequencing data, with a 50K subset available. The SNPs were mapped to the latest Williams82 genome version, Wm82a6v1. Analysis of these data will proceed over the coming months.

2. We completed the development of the Seasonal Characterization Engine (SCE). The SCE is a specialized tool integrated within the R environment, designed to streamline the analysis of trial data using the Agricultural Production Systems Simulator (APSIM) model. Users begin by uploading trial data specifying key parameters such as location, latitude, longitude, and maturity. Predefined input files, such as the soybean seed composition test, facilitate accurate data setup.

After data upload, users select a crop model template (e.g., soybean or maize) and specify maturity handling methods. Users then choose appropriate weather and soil databases, considering geographic coverage to avoid analysis errors.

The APSIM model generates environmental variables specific to different crop growing stages, providing a detailed characterization of conditions throughout the growing season. Upon initiating the analysis, the SCE provides real-time process updates via the R console.

Analytical results are available through multiple visualization tools:
1. Results Viewer: Offers box plot visualizations of selected variables, downloadable individually or collectively.
2. Heat Maps: Enables detailed inspection of seasonal and environmental covariates by growth stage, comparing specific parameters within consistent genetic maturity groups.
3. Trial Similarities: Presents correlation heat maps illustrating seasonal profile similarities among trial sites, complemented by dendrograms to clarify site relationships. Outputs are readily exportable for further analysis.
4. Thermal Time and Precipitation: Summarizes accumulated thermal time and precipitation over growing seasons, offering comparative analyses between and within sites across multiple years. These insights support understanding climate variability and trends.

3. During the 2024 season, Rainey Lab at Purdue University has focused on phenotyping soybean crops using high-resolution UAS imagery from 10 SOYGEN sites to extract canopy traits. A data management system has been set up at Purdue to help SOYGEN collaborators with secure storage of UAS image data, efficient data sharing, and access to processed plot-level outputs. To date, key canopy traits including spectral, textural, and structural features, have been extracted from 6,320 two-row plots covering about 1,200 genotypes across V3 to R5 growth stages. The processed plot-level data includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This growing dataset provides valuable resources for soybean breeders, supporting the development of advanced AI/ML algorithms and trait derivation.

For biomass prediction, we established ‘Calibration Plots’ — 8-row plots planted adjacent to the SOYGEN plots to enable nondestructive biomass sampling. A representative subset of 36 lines from the SOYGEN3 GEI panel, with three replications, was used in this experiment. Ground biomass sampling was performed on rows 2, 3, and 4, while rows 6 and 7 were used for UAS-based data collection and harvest yield measurements. UAS flights were conducted within one day before or after each ground sampling event, and from these flights, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index, and an array of structural and textural features.

In addition to the calibration dataset, we incorporated our previous ‘Public Biomass’ experiments (2021–2023), which utilized public soybean breeding germplasm from the north-central U.S. region.

Biomass prediction was approached using two modeling strategies:
1. A simplified model using two robust drone-derived predictors — canopy coverage and canopy height — both known to exhibit strong correlations with biomass.
2. A comprehensive model leveraging a broader suite of structural and textural features, capturing more detailed aspects of plant architecture and canopy structure.

Machine learning models developed with these approaches achieved high performance, with R² values reaching 0.89 and RMSE as low as 68.70, demonstrating strong predictive accuracy. Our current focus is on refining these models by minimizing time-dependent biases, enabling reliable biomass predictions on any given day regardless of growth stage or sampling date.

Structural features derived from 3D models, like canopy volume, and the top and side canopy geometry, are also good indicators for biomass. We anticipate that the marker data will facilitate the development of a biomass prediction model. We expanded the ground truth biomass dataset in 2024, and plan to do so again in 2025.
Preliminary analysis has been conducted on yield estimation using image-derived features. The results show promising potential for predicting yield, with an R² value around 0.45. To improve prediction accuracy, efforts will focus on expanding the dataset and incorporating additional data such as weather variables, soil characteristics, and genetic marker information. The integration of expanded data sources and advanced modeling techniques is anticipated to significantly strengthen the robustness and predictive power of the models.

Status of 2024 season UAS data collection and processing
See attached report.

Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.

Recent improvements in our SV calling pipeline have significantly enhanced the accuracy and resolution of SV detection. By integrating five distinct SV callers, we are now able to identify a greater number and diversity of structural variants with higher confidence. These improvements were implemented using the Williams 82 version 6 (a6) reference genome, a more complete and accurate assembly with no scaffolds compared to its predecessors. GWAS was conducted using both the SVs and SNPs identified from the a6 reference genome, yielding highly significant associations between structural variants and key agronomic traits. These traits include, but are not limited to, yield, stress resistance, and plant architecture. The use of a refined SV dataset has reduced background noise and increased the power to detect true associations. Gene Ontology (GO) analysis of significant GWAS markers located within gene models revealed that many of the associated SVs and SNPs influence genes known to control trait development and physiological processes. These functional annotations provide valuable biological insights into how structural variation contributes to phenotypic diversity in soybeans.

Comparative analyses demonstrate that the use of the Williams 82 a6 reference genome produces more meaningful GWAS results than the earlier version 4 (a4). GWAS performed with a4, which contains more unresolved scaffolds, tends to yield less consistent and potentially spurious associations. This underscores the importance of reference genome quality in structural variation-based GWAS.

Our findings highlight the critical role of high-quality reference genomes and comprehensive SV calling pipelines in conducting accurate and biologically relevant GWAS. The integration of multiple SV callers, coupled with the Williams 82 a6 reference, has led to improved detection of trait-associated structural variants. These results demonstrate that unresolved genomic regions can compromise GWAS outcomes, potentially leading to random or misleading associations.

View uploaded report PDF file

Updated November 21, 2025:
Objective 1. Continue to develop and enhance genomics-assisted breeding resources and tools to facilitate routine application in public breeding programs.

Objective 1 can be broken down into five sub-objectives for which we can report specific and significant progress during these past six months.
Sub-obj 1: Continue to genotype with genome-wide and trait-targeted markers all new breeding lines entered in the Northern Uniform Soybean Tests
Progress: 267 new breeding lines entered into the Northern Uniform Regional trials were grown in the field and sampled for DNA. DNA extraction has been conducted and we are in the process of shipping samples for genotyping. These will contribute to the large genotype dataset we have amassed as part of this project.

Data from the 2024 NUST trials has been uploaded to Soybase.
A manuscript on this work we have pursued for the last several years has been published in Crop Science

Wartha, C. A., Campbell, B. W., Ramasubramanian, V., Nice, L., Brock, A., Cai, G., Eskandari, M. M., Graef, G., Hudson, M. E., Hyten, D., Mahan, A. L., Martin, N. F., McHale, L., Miranda, C., Dominguez, E. M., Nelson, R., Rainey, K., Rajcan, I., Scaboo, A., … Lorenz, A. J. (2025). Genomic analysis and predictive modeling in the Northern Uniform Soybean Tests. Crop Science, 65(5), e70138.

Sub-obj 2: Enable individual public breeding programs to test and use genomic prediction
Originally, this project explicitly funded the integration of genomic prediction into the public soybean breeding pipelines to expedite yield improvement. Because of budget cuts, the funding for this part of the project was removed. Nevertheless, this project has continued to instigate and enable several public programs to start using genomic prediction routinely.
There is nothing beyond the activities reported in the last report to report for this reporting period.

Sub-obj 3: Development of a genomic prediction R-Shiny app for easy implementation of GS for breeders.
The first version of this application is now complete. The journal article reporting and describing this tool has been accepted for publication.

1) Ramasubramanian, V., C. Wartha, L. Singh, P. Vitale, S. Ru, and A.J. Lorenz*. 2025. GS4PB: An R/Shiny Application to Facilitate a Genomic Selection Pipeline for Plant Breeding. Plant Genome (In press).
This manuscript describes the development of the application, the components, how it can be used to execute genomic selection for plant breeding, and how it can be accessed. This application is freely available to the public, and will enable plant breeders to implement genomic selection.

In addition to the tool description, we also reported on how using this tool resulted in better selections than phenotypic selection. When making phenotypic selections in 2024 and validating them in 2025, we found that several lines that would have been culled using phenotypic selection actually were some of the best performing lines in 2025. Genomic selection did not make this same mistake, correctly culling lines. This test helps validate the tool we developed and will be convincing for users to adopt it. We are in the process of designing additional features to add to the tool to help soybean breeders enable genomic selection and increase the effectiveness of their breeding programs.

Sub-obj 4: Adopting and advancing BreedBase for storage of information for soybean genomic prediction.
This database is currently working for this project, so there is nothing new to report here. We have been uploading all our genotypic data used for this project to this database.

Sub-obj 5: Connect target and training populations using imputation that leverages pedigree relationships and enhance this capacity by inclusion of this method in the software application.
As mentioned in the last report, this sub-objective has been completed. We have implemented these methods in our software application GS4PB.

Objective 2. Develop and test methods for predicting cultivar performance in future target environments through genomics-assisted breeding models, phenomics, and environmental characterization.
Three main tasks are associated with this objective.
1. Conducting a large, multi-institutional multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that current and future soybean researchers can use identify methods to improve genomic prediction methodology for yield.
2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
3. Quantify biomass non-destructively to that crop growth can be evaluated in relation to certain environmental stressors.
Progress on each task is reported in order below:
1. Conducting a large, multi-institutional multi-environment trial to create a “genotype-by-environment interaction” dataset for soybean that current and future soybean researchers can use identify methods to improve genomic prediction methodology for yield.
The major activity this past reporting period was growing another season of the GxE trials described in the last report. All seeds were packaged, shipped, and planted at each of the 21 locations. Data templates were distributed, and we are receiving data back from cooperators this fall. As far as we know, we only lost two of the 21 locations, and we anticipate good data from 19 locations, totalling nearly 40 environments of data for this project, creating a powerful dataset for us to explore GxE genomic prediction modeling.

2. Define environmental variables driving genotype-by-environment interactions that can be used to enhance prediction models.
As described in the last report, this objective has been completed. We are working to use this tool for GxE modeling in the next version of SOYGEN.

3. Quantify biomass non-destructively to that crop growth can be evaluated in relation to certain environmental stressors.

During the 2024 and 2025 seasons, the Rainey Lab at Purdue University has focused on high-resolution UAS-based phenotyping of soybean plots across 10 SOYGEN sites to extract canopy-level traits. A centralized data management system was developed at Purdue to support SOYGEN collaborators with secure UAS data storage, efficient data sharing, and streamlined access to processed plot-level outputs.
Beginning in 2024, key canopy traits - including textural and structural features were extracted from 6,320 two-row plots, representing approximately 1,200 genotypes from V3 to R5 growth stages. Processing for the 2025 season is currently underway. The resulting dataset includes labeled and curated RGB image clips, binary masks, and 3D point cloud data. This expanding resource provides a valuable foundation for soybean breeders, supporting the development of advanced AI/ML algorithms and improved trait derivation.

Biomass Calibration Experiments:
To enable non-destructive biomass estimation, we established dedicated Calibration Plots adjacent to the SOYGEN trials:
Calibration 1 (2024–2025)
• Included 36 representative lines from the SOYGEN3 GEI panel, with three replications.
• Ground biomass measurements were collected from subsections of rows 2, 3, and 4.
• Rows 6 and 7 were used for UAS imaging and harvest yield.
• Stand counts of the sampled area were recorded (2025).
• UAS flights occurred within one day before or after each ground sampling event.
Calibration 2 (2025)
• Included 20 representative lines, also with three replications.
• Ground biomass sampling was conducted on the entirety of rows 1–5.
• Rows 6 and 7 were harvested for yield.
• Stand counts of the rows were recorded.
• UAS imaging covered both sampled rows and yield rows.
• Flights were conducted within one day before ground sampling.

The Calibration 2 design was particularly effective in reducing time-dependent biases, enabling reliable biomass predictions across growth stages and sampling dates.
For both experiments, we extracted several key drone-derived metrics, including canopy coverage, canopy height, green leaf index (GLI), and a range of textural and structural features.
Biomass Prediction Modeling:

Two modeling strategies were implemented:

1. Simplified Model
Utilized three robust predictors- canopy coverage, canopy height, and GLI, known to correlate strongly with biomass.

2. Comprehensive Model
Incorporated an expanded set of structural and textural features to capture finer details of plant architecture.
Machine-learning models built using these approaches demonstrated strong predictive performance, achieving R² up to 0.89 with RMSE as low as 68.70.

Application to Collaborator Sites and Yield Estimation
The enhanced biomass prediction model has been applied to multiple collaborator sites, where predicted biomass showed strong correlations with harvested yield, reaching up to 0.82. We are also investigating yield estimation directly from predicted biomass.
In parallel, we are integrating genomic data to perform GWAS for identifying both stage-specific and season-wide significant SNPs, with additional downstream genomic analyses underway.

Objective 3. Discover structural variants (SVs) and test whether modelling structural variants improves genomic predictions for yield and seed composition.
The Hudson laboratory is mostly concerned with the use of soybean genome data to improve how efficiently and effectively we can breed new soybean varieties. To check how much the quality of the soybean reference genome (the master copy of Williams 82 generated by the Department of Energy (DOE) Joint Genome Institute (JGI)) affects our results, we ran a number of tests comparing two versions, a4 and a6. The older a4 version was the best available until recently, while a6 is the newest and most complete. Because our project has uncovered extensive structural variation in SoyNAM, we want to determine how much that variation affects important traits in the plants, and how accurately we can find the genes for the traits (heritability) for use in genomic selection and molecular breeding.
We compared how each version affected three things:
1. Phylogenetic trees (family trees for the SoyNAM population) that show how similar the lines in the population are at the DNA level. Do the structural variants and genome versions affect our analysis of how closely related the plants are?

2. Principal Components (PCs) and kinship matrices, which are also tools that describe how closely related different plants are. However, we use these tools directly to control our Genome-wide association studies (GWAS), and they affect our predictions of which genes are likely to control the important traits.

3. GWAS, which find links between genetic variation (which can be SNPs (small single-letter differences in DNA), and / or SVs (large rearrangements or insertions/deletions), and traits such as yield or disease resistance, but require accurate data on kinship between the individuals in the population.
When we built phylogenetic trees using each of the two genome versions, we used just the SNPs, just the SVs, and both together.
For a4, the trees made from SVs and from SNPs didn’t match very well, showing that gaps and errors in the older genome caused unstable results.
For a6, the trees based on SVs looked stable and consistent when compared with a4, while trees based on small changes (SNPs) still varied more. This means that SVs may be more likely to capture real biological relationships that are not easily thrown off by small assembly errors.
Next, we tested every possible combination of:
• The version of the genome (a4 or a6),
• The type of genetic differences (SNPs, SVs, or both), and
• The relationship data (PCs and kinship matrices).
We judged accuracy by checking how well each GWAS result matched previously confirmed genetic regions, QTLs (quantitative trait loci), which are known to affect traits in the same soybean population.
The best matches came from using the newer a6 genome for everything—its SNPs, SVs, and combined data—along with relationship data also based on a6. These tests found more of the known true QTLs, showing that the higher-quality a6 genome gives cleaner, more realistic results. Using older a4 data (the best available for all previous analyses of SoyNAM) introduced small mistakes because of its missing or mis-assembled parts.
By fine-tuning how we use the a6 genome and how we include SVs, we have improved the heritability (the portion of trait differences explained by the genetic variation we can measure) and sensitivity of our soybean studies for several traits. The heritability is directly related to our ability to breed for enhanced traits, as well as to build genomic selection methods.
Our second project looks at how unbalanced structural changes in DNA—such as extra or missing pieces—affect the overall size of the genome. This helps double-check our SV results and the overall pangenome.
To study this, we have built a new way to estimate genome size (how much total inherited DNA is in the nucleus of each soybean line) that isn’t easily thrown off by sequencing errors.
We use a k-mer–based genome size estimation (GSz) method. A k-mer is a short DNA sequence that helps estimate total DNA content by counting how often certain patterns appear. Using this method, we found much more variation in genome size among the SoyNAM parent plants than we expected.
Genome size is known to influence many basic cell traits across living things (Eukaryotes). This means the differences in GSz among SoyNAM parents may represent a new layer of population structure, genetic organization in the population that is very important for accurately predicting loci encoding traits, and genomic trait prediction. Early tests show that genome size differences seem to be linked with important plant traits like yield, seed protein and oil levels, and fiber content. We are cautious with these findings because they are new, and older estimation methods can be thrown off by sequencing errors or repeated DNA regions. To reduce these problems, we developed a new GSz method that is much less sensitive to sequencing mistakes and that properly accounts for repeats. First results in a model species look promising, and we plan to apply this improved method to the SoyNAM lines soon.
If this approach works, it will add a second and independent way to describe how populations differ genetically, helping us build better models of population structure and improving genetic studies that depend on those models.
Although these two projects started separately, they point to the same big idea: different large-scale DNA processes, shaped by different evolutionary forces, each describe part of how populations are structured. Ignoring any one of these signals could lead to wrong estimates of genetic relationships and errors in later analyses. Our work has already improved our estimation of relationships and thus the heritability estimates of important traits. Genome size differences also seem to be linked to seed oil and protein levels, meaning that variation in genome size might help predict seed quality in future breeding work.
Previously, as part of this project, we have obtained substantial in-kind support where DOE JGI are internally funding the sequencing of a soybean pangenome, of several hundred lines including all of the SoyNAM, from samples we are supplying. These lines will each have their own reference genome created which should be similar quality to the a6 version of Williams 82. Nothing like this pangenome has been created before, in industry or academia, and our results above from the a4 vs a6 comparison indicate that the pangenome will have a very large impact on our ability to improve soybean varieties, and the speed with which we can do it. This year we obtained the first dozen or so genome assemblies for this pangenome, and so far the quality of these assemblies looks to be similar to that for a6. We expect to complete at least three hundred genomes, at a cost of several million dollars of DOE internal funding, as part of this project.

Final Project Results

Benefit To Soybean Farmers

Soybean breeding has a large impact on the efficiency and profitability of agriculture through the development of high yielding new varieties with critical defensive traits and enhanced seed composition. Ensuring that such programs (both private and public) are using state-of-the-art technologies to drive genetic gain in the face of changing environments and narrowing genetic diversity will contribute to continual development and release of ever better varieties. Additionally, these efforts help to educate future agricultural scientists and soybean breeders that are best prepared to enter the seed industry and develop impactful future products for farmers, keeping the North Central region competitive in soybean production.

The United Soybean Research Retention policy will display final reports with the project once completed but working files will be purged after three years. And financial information after seven years. All pertinent information is in the final report or if you want more information, please contact the project lead at your state soybean organization or principal investigator listed on the project.