Updated June 8, 2022:
SOYGEN 2: Increasing soybean genetic gain for yield and seed composition by developing tools, know-how and community among public breeders in the north central US
Objective 1: Elevating collaborative field trials
1a. Development of a database to store, query, and distribute data from collaborative field trials
Database tables and draft query user interfaces have been created. Phenotypic data from collaborative trials from 1989 to the present have been loaded into the data tables and are accessible to project participants. Environmental data will be available through an interface to the DayMet meterological API. Beta testing of the interface by project participants continues. Additionally, a soybean specific breedbase installation has been created, allowing users to upload and share data from their own breeding program and leverage the (being populated) Uniform Trial data to gain accuracy in yield predictions.
1b. Updating the Uniform Soybean Trials
We collected 6K genotype data on all 2020 UT lines. The 2020 SCN UT lines will be planted in the field along with all 2021 UT and SCN UT lines for tissue collection and genotyping. All materials from 2021 UT and 2020 SCN UT was sampled and DNA isolation will commence shortly.
By discussion and agreement of UT collaborators, data submission forms have been updated for current and future field trial sites to include GPS coordinates. This will allow weather data to be linked to phenotypic data.
Weather datasets were collected in the site years corresponding to NUST field trials from using the geographic coordinates of the field trials linked with the DAYMET weather data. This information along with field trial phenotypic information will be used to compare the year to year site trialing similarity.
Objective 2: Development of a genomic breeding facilitation suite
2a. Genotyping methods
We have received 17,259 DNA samples to run with the 1k SNP set. A total of 14,739 have been genotyped. We have surpassed the project goal of 10,000 genotyped samples.
2b. Imputation methods
Imputation of progeny with low-pass sequencing has been tested on a small scale. Scripts were completed and are being tested in the Lorenz laboratory. We have been working to improve their accuracy and iterating new versions to make the scripts more useful in different use cases. Scripts have been distributed to project participants upon request. See below for details.
2c. Genomic Prediction Facilitation Suite
During this past reporting reporting period we were able to install a genome-wide marker database called GIGWA (https://gigwa.southgreen.fr/gigwa/). We have deposited our current genome-wide marker data into this, including all the genotype data collected on the UT as part of this project. A workflow of software tools and scripts was initiated to seamlessly combine data held in this database with phenotypic data and genomic prediction models to ease the use of genomic selection in a practical breeding context. There are a few steps that need to be developed, such as low-to-high marker density imputation and training population optimization. A tutorial of the workflow was provided to project participants.
On a related front, co-PI Nelson has begun the adoption of a platform called BreedBase (breedbase.org). This will be available to public breeders for depositing the phenotypic and genotypic data from individual breeding programs as well as collaborative datasets, such as the Uniform Trials. It will facilitate the use of genome-wide marker data for breeding. This has been installed at Soybase and project participants are uploading data for beta testing and can receive or have received individualized training on request from BreedBase personnel.
Objective 3: Evaluation of soybean breeding methods that increase gain
3a. Advanced spatial analysis.
Preliminary yield prediction models have been run on single location progeny rows from 2019 using elastic net, ridge regression and lasso. Preliminary results show RMSE of 7 bu/acre and R2 of 0.69. Models have shown relative maturity and pedigree information to have the largest effect on yield. Soil parameters and canopy area have also shown some significance. Soil data is extracted using fine scale soil maps generated in collaboration with soil scientist Dr. Miller and his postdoc Dr. Khaledian. With these soil maps we get soil nutrient data (N,P,K, CA,MG, CEC, NO3, OM) as well as soil texture data on a 3m x 3m scale. Further machine learning and model development and selection criteria are being developed with Dr. Sarkar and his graduate student Luis Riera.
In collaboration with statistician Dr. Dutta and his graduate student, Dongjin Li, we have prepared a tutorial using the statgenSTA R package. This tutorial includes videos, and an html notebook showing the steps from data preparation, fitting and running models, as well as outlier analysis. The statgenSTA package allows users to fit traditional non-spatial models, as well as spatial models, by including row and column information as well as replications. Users can use the lme4, SPATs or ASREML packages for fitting the data. This tutorial will be shared with the breeding community prior to the fall season. We used the SPATs engine, which uses a penalized spline for spatial correction. This allows for a more dynamic spatial correction compared to the traditional moving means corrections. We also used this tool in our spatial adjustments for 2020 yield trials, and compared it with the traditional moving means method that we have used in the past.
Code and full tutorials for the selection process were shared with the entire research group.
3b. Development of breeding program specific genomic prediction models
More than 14,000 advanced lines have been genotyped (see Objective 2) and are being used for the development of training models and implementation of genomic selection within individual breeding programs.
3c. Genomic plus secondary trait selection at the progeny row stage
For each of four breeding programs, nearly 2500 progeny rows were grown, phenotypically evaluated and genotyped in 2021. In addition, canopy coverage data was extracted from UAV images. Selections were made based on phenotypic data alone (primarily yield), genotypic data alone (genomic selection using model developed from UT data), phenotypic data plus canopy coverage, genotypic data plus canopy coverage, and random selections. These selection methods are being tested in 2022.
3d. Exploration of genomic prediction to reduce unfavorable correlations between seed yield and protein
We used genomic prediction to predict the mean, variance, yld-pro correlation, and superior progeny mean of all possible crosses among 2019 and 2020 UT lines. We made this information available to all SOYGEN2 breeders for their consideration in terms of 2021 crosses. F1 progeny between 5 breeder selected high yield high protein lines and 5 model select high yield high protein lines were harvested from 8 breeding programs and increased to F2.
3e. Rapid cycling
Three cycles of genomic selection were completed on schedule, with coordination between UNL (genotyping from Hyten lab, model predictions from Lorenz lab) and KSU (Schapaugh conducting crossing and advancements. About 100-400 F1s were created each cycle and about 40 F1s were selected based on genotypic data. Following each of the cycles of selection (up to 3) there have been inbreeding. F4 generations have been grown for Cycles 0, 1, and 3, with Cycle 2 being one generation behind due to miscommunication with the winter nursery. Testing will occur in 2022.
3f. Evaluation of putative “yield” alleles
Due to inability to MTA from the USDA for many of the cultivars used in the pedigrees of these lines, we were only able to complete a single cross combination: LG09-8165 x LG11-5120. Markers were developed for four putative yield loci segregating between these parents and F3 families have been planted.
Objective 4: Characterization and use of the USDA Soybean Germplasm Collection, a foundation for future success
As we did for yield and agronomic traits previously, we conducted a genome-wide association analysis for each of the seed composition traits using the multi-year, multi-location phenotype data collected as part of this project, along with the existing 50K genotype data from the collection. The association analyses were conducted by sampling group (CLU, RAN, SSD) and over all lines together.
This SOYGEN (Science Optimized Yield Gains across Environments) project leverages and builds upon ongoing and previously funded work to increase soybean genetic gain for yield and seed composition by developing tools, know-how and community among public breeders in the north central US. In support of these goals we are adding to the availability and utility of public soybean data: adding genotypic data for tens of thousands of breeding lines and cultivars, much of which is attached to high quality, geo-referenced field data. We are testing and learning what breeding methods can be improved with this data and how to do so in cost effective manner. Ultimately, this will lead to faster development of improved (yield and seed quality) soybean cultivars, which will provide farmers with increased production and increase the competitiveness of US soybean in the global market.