Progress Summary: With the advance of high-throughput technologies and the increasing recognition of their significant impact on soybean research and product development, worldwide researchers have invested tremendously and generated a massive amount of genomic, genetic, and phenotypic data. This provides unprecedented power and opportunities to address many issues that the US soybean industry faces. Unfortunately, it becomes a major challenge for most soybean researchers to analyze and use the massive amount of data in their research because it requires a lot of tedious work, powerful computational infrastructure, a set of new technical skills and scientific research methodologies, and the development of new tools to organize, analyze, and mine the massive amount of data.
Over the past decade, we have devoted significant effort to developing large-scale data analysis and data mining technologies and new uses of such large-scale dataset for soybean improvement. Applying the technologies and dataset, we have successfully discovered causative genes and alleles underlying the quantitative trait loci (QTLs) for traits important to US soybean farmers. The objectives of the project are to consolidate and analyze the massive amount of data generated by worldwide researchers into user-friendly data resources and integrate multidisciplinary data-driven approaches that we developed in the past decade to develop a robust big-data-driven technology platform for the US soybean community.
The project progresses well and is ahead of all milestones (see Detailed Progress) in FY2023. We applied our big-data analysis pipeline to consolidate and analyze genome sequencing data of ~12,000 soybean accessions and ~8,000 transcriptome sequencing data available to the public. We generated several datasets for DNA variants and expression of all genes in different treatments. Having employed a set of new data-mining strategies, we identified a large set of QTLs, putative causative genes and alleles, functional markers, and inferred genetic and gene networks for soybean genetic improvement. We devoted significant efforts to deploying data sources and data technologies to the soybean community and providing technical assistance to use the data. We made eight oral and poster presentations at Cellular and Molecular Biology of the Soybean conference, Plant and Animal Genome Conference, and Soybean Breeder Workshop to promote the use of the data resource and technologies. We also provided technical assistance/consultation and the existing data-driven technologies to US companies such as Impossible Foods, universities, and research institutes for a variety of uses. We are finishing multiple publications to deploy our new data source and technologies to the soybean community. Our early publication of 1500 soybean genomes and dataset released in Soybase and Ag Data Commons has been accessed over 2,000 times. With the increasing use of large-scale data in soybean research, we expect that the use of the new data sources and data-driven approaches from the project will increase significantly with a great impact on US soybean agriculture and industry.