LGRPV2

I. Introduction

What is LGRPV2?

Fabaceae (commonly known as legume), one of the most diverse families of angiosperms, plays an important role in the maintenance of terrestrial ecosystems and the reproduction of human civilization. In recent years, the genomes of legumes have been rapidly accumulated. Here, we developed an updated version of the Legume Genomics Research Platform (LGRPv2: http://www.fabaceae.cgrpoee.top), enabling the in-depth exploration of legume genomics. LGRPv2 contains 105 published genomes of 56 legumes and our newly deciphered Tamarindus indica genome, and has JBrowse for browsing the 56 legume genomes (56-LGs) with chromosome level. Analyzing of 56-LGs and outgroup grape (Vitis vinifera) genome, we annotated biological functions and duplication types of 18,050,760 protein-coding genes, generated 10,092 synteny dotplots and 82,416,492 blocks of pair-wise genomes, identified 589,512 paralogs and 218,284 orthologs, constructed hierarchical multigenome alignments, and inferred the ancestral genomes and chromosome rearrangements of legumes. Moreover, we identified 59,127 synteny-based orthologroups, 81,343,237 transposable elements, 1,304 m6As, 176,927 regulatory factors, and 633,110 important trait genes in 56-LGs. Consequently, we implemented a series of user-friendly query, analysis, and visualization tools and interfaces in LGRPv2 to facilitate the exploration of legume genomics using these large-scale results. Notably, the development of DotView, SynView, DecoBrowse, and AncVisual tools provides four new gateways for using the current synteny data and ancestral genomes to reveal the paleogenome reshuffling and its consequences during the polyploidizations and (re)diploidizations in legumes. Systematically, we integrated the species encyclopedia, multi omics data, ecological resources, cultivation techniques, relevant literatures, and external database connections, and developed the corresponding query interface for these resources. Considering the mining of new data, we developed ‘one-stop’ comparative genomics toolbox containing 57 window operated bioinformatics tools, of which 29 tools are newly developed by us. Besides, we provided interactive statistical charts, user manuals, communities, and submission ports for the resources of data and tools in LGRPv2. In short, LGRPv2 is a comprehensive database with genomic synteny sources as the central gateway, and could be an important community for the exploration of legume genomics.

 

II. Datasets and Workflow

Data sources

LGRPV2 processed the genome data of 65 leguminous plant species, obtaining processed CDS, PEP, and GFF3 files respectively. The software and scripts used in the processing can be found on GitHub (https://github.com/zijian-yu/LGRPV2). Detailed references corresponding to each plant species can be found in the supplementary tables of the article for more comprehensive information.

 

Data analysis pipelines

Identification of polyploidy events. To identify polyploid events, we first performed genome-wide BALSTP (E-value <1e-5, score >100) within and between the studied genomes using the software BALST (Altschul et al., 1990). Then, using CollinearScan software (Wang et al., 2006), the best 10 BLASTP matches were selected for inferring gene splicing regions (blocks) within or between genomes. Where the maximum gap was set to 50 spacer genes and large gene families with more than 50 members were removed from the blocks. The median value of synonymous nucleotide substitutions (Ks) for collocated genes was further used to determine the degree of divergence of the identified blocks. We calculated the Ks values between tandem gene pairs using the Bioperl statistical module and the Nei-Gojobori method (Nei & Gojobori, 1986). We further plotted adjacent gene pairs as dot plots based on genomic location and used different colored dots to distinguish whether the anchor gene pair was the best BLAST hit within/between genomes. We then identified the immediate and paralogous genomic regions within and between genomes based on the generated homology dot plots. Between genomes, a region was identified as an orthologous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with species differentiation; within genomes, a region was identified as a paralogous region if the median Ks of the gene pairs located in that splice region was approximately equal to the value of the Ks peak associated with a particular polyploidization event. Finally, we can infer the history of WGD by investigating the ratio of syntenic depths within and between genomes.

Identification of event-related genes and dating of key evolutionary events. Plant genomes evolve at different rates (Cui et al., 2006; Wang et al., 2011), making it difficult to determine the timing of key events in their evolutionary history. Here, we constructed a correction algorithm for redetermining key evolutionary events in monocotyledons. First, based on orthologous and paralogous regions identified within and between genomes, we isolated sets of orthologous and paralogous lineages resulting from species divergence and polyploidy events. Second, we determined the evolutionary rates of key evolutionary events in monocotyledons by performing nuclear function analysis of Ks between these orthologous and paralogous relatives. Finally, we performed several rounds of Ks correction for the evolutionary rates of these events according to different correction bases. The first round of correction was based on the Ks distribution peaks of the differentiation events in monocotyledons and grapes to have the same values. After the first round of correction, there was still a large divergence between the τ and σ events produced by homologous plants. Therefore, similar to the first round of correction, we performed several more rounds of Ks correction based on τ and σ events. Details of the correction process can also be found in our previous articles (Wang et al., 2017; Wang, J et al., 2018; Wang, J et al., 2019b; Wang et al., 2022), and the computational script of the correction algorithm has been stored in Github (https://github.com/wangjiaqi206/corrected-evolutionary-dating).

Comparison of genome fractionation. By comparing the rates of gene retention and loss, we can characterize the degree of divergence between subgenomes produced by different polyploidization events. In which, the gene deletion rate was calculated by dividing the number of collinear gene deletions in the study species by the total number of genes per chromosome in reference genome. The genome retention rate was calculated by dividing the number of the most conserved collinear genes (orthologs retained in both reference genomes) in the study species by the number of relatively conserved tandem genes (orthologs only retained in the main reference genome). In addition, the degree of divergence between event-produced subgenomes can also be inferred by a statistical method we previously developed, the polyploidy index (P-index) (Wang, J et al., 2019a). In this study, the P-index among the subgenomes of the Acorus tatarinowii, Vanilla planifolia, Asparagus officinalis, A. setaceus and Zingiber officinale genomes was calculated, using V. vinifera, E. guineensis and M. acuminata as the references, where the sliding window was set to 95, disregarding the degree of divergence of subgenomes that are too similar or too different (parameter < 0.08 and > 0.8). In addition, previously studies have demonstrated that the P-index ~ 0.3 could be used as a threshold to classify auto- and allopolyploidies (Wang, J et al., 2019a). The reason is that the known and previously inferred allopolyploidies always have larger P-index > 0.3, including that the Brassica napus, Zea mays, Gossypium hirsutum, and Brassica oleracea (Schnable et al., 2011; Chalhoub et al., 2014; Li et al., 2014; Wang, M et al., 2015; Renny-Byfield et al., 2017). While the inferred autopolyploidies of Glycine Max, Populus trichocarpa, and Actinidia chinensis (Murat et al., 2017; Wang et al., 2017; Wang, JP et al., 2018) often have P-index < 0.3.

The pipeline for inferring ancestral karyotypes and evolution. The inference of ancestral genome structure and paleogenome remodelling trajectories is divided into 7 main steps. 1) Genome-wide comparison of the species involved, based on BLAST (Altschul et al., 1990) software, to confirm conserved homologous genes between and within genomes. 2) The homology information obtained from BLASTP was entered into CollinearScan (Wang et al., 2006) or MCScanX (Wang et al., 2012) for collinearity analysis to identify the synteny blocks. 3) Identification of orthologs and paralogs associated with speciation and polyploidy by inter- and intra-genomic comparisons. 4) Identification of conserved ancestral regions (CARs) by the combination of dotplots and gene collinearity between genomes. 5) Identification of ancient chromosomal rearrangements in conjunction with species trees. For example, if the conserved chromosomal regions CARs 1 and 2 are adjacent in the study species A, B, then it is reasonable to assume that CARs 1 and 2 are fused in the ancestor of A and B. If CARs 1 and 2 are not adjacent in study species B, it is difficult to determine the ancestral structure of species A and B. A reference species would then need to be introduced, and if CARs 1 and 2 also adjacent in the reference species R, then the ancestral structure of A and B would still be CAR1-CAR2. In addition, the inference of ancestral chromosomes rearrangements also needs to consider the effects of duplication, and we have modelled the possible scenarios in Then, by identifying and collating all the CAR rearrangements, we can bottom-up infer the ancestral karyotype and its composition of the study species. 7) After determining the ancestral genome, we can identify the fusion patterns and rearrangement trajectories of paleochromosome by comparing the CRAs in the dotplot between the modern and ancestral genome. For example, if the two chromosomes corresponding to the same ancestral chromosome in the study species are structurally different, such as the translocation, then this change should occur after the WGD; and conversely, before the WGD, such as the end-to-end joining fusion (EEJ) and nested chromosome fusion (NCF). The actual process of inferring ancestral genome and paleochromosome remodelling trajectories can be more complex, and requires careful and lengthy verification and validation.

Gene Family Analysis Pipeline. Gene families can be easily identified in IPAP, which has three sequence matching modes, such as Blast, Diamond, and Blast match. these three functions can be used to match target sequences against known protein sequences and thus filter the desired gene families. In addition, there is also a structural domain identification function, which allows easy structural domain prediction of target sequences through the Pfam database. After the gene family sequences are identified, researchers can perform multiple sequence comparisons and then construct phylogenetic trees. Meanwhile, codon and CPG island prediction can be performed in IPAP, and non-synonymous substitution rate (Ka) and synonymous substitution rate (Ks) can also be calculated. In addition, researchers can predict and map motifs and gene structures. This greatly facilitates the needs of researchers for gene family analysis.

 

III. Browse

Community and collection of resources

Items

Brief Introduction

Records

Pair-wise dotplots

Homologous structure dotplot related to Fabaceae

2,028

P. vulgaris Hierarchical alignments

Pvu Hierarchical alignments gene pairs

27,996

V. vinifera Hierarchical alignments

Pvu Hierarchical alignments gene pairs

28,164

Event-related genes

Information on gene pairs associated with Event-related

70,673

Functional genes

Function-related gene family information

50,178

Transcription factor

Transcription factor gene family information

50,056

ncRNA

These include rRNA, tRNA, snRNA, snoRNA and microRNA

*

Transposable elements

Details of Transposable elements

*

Gene family information for Orthologous Gene

38,690

Gene Function

Information on functionally annotated genes

8,470,419

Pathway

Detailed information on Fabaceae-related Pathway

912,552

Domain

Information on domin identified using pfam libraries

106,111

Fabaceae Community page

An online community for plant Karyotype research community

-

 

IV. FAQ

A. What information does Leguminosae platform Database provide for plant Karyotype evolution?

We built a user-friendly, web-based comparative and functional genomics platform, an integrated platform for polyploid and paleo-genomic evolutionary analysis in the Fabaceae (LGRPv2, http://www.fabaceae.cgrpoee.top/). We established 45 tools for analyzing polyploids and covariance pipelines, which we named IPAP (Integrated Polyploidy Analysis Pipeline). Then, we selected 25 representative collections of legumes to the chromosome level for systematic bioinformatics analysis. The analysis results are also stored in LGRPv2, a platform used to help researchers easily query, compare and download the results of these genomic resources and bioinformatics analyses. For example, the platform stores 2,028 = (25 + 1) * (25 + 1) * 3 homology structure dot plots. The homologous gene hierarchy table with P. vulgaris and V. vinifera as reference has 27,996 and 28,164 rows of homologous gene pairs, respectively. Based on the homologous gene hierarchy table, 70673 event-related genes and genomic hybrid gene pairs were obtained. Paleo-genomic karyotypes of Fabaceae were inferred and evolutionary trajectories were animated and displayed.MCScanX was used for gene identity analysis. The results of 676 (26*26) MCScanX were explored using SynVisio's interactive tool for multi-scale genome visualization 912,552 genes Fabaceae and 106,111 structural domain information, 38,690 orthogroups, 8,470, 419 gene function annotation analysis results, 912,552 pathway map annotation analysis information. Using the PfamScan software, and the HMM model downloaded from the Pfam database to identify functionally important genes, we identified, for example, growth hormone genes, anthocyanin genes, flowering-related genes, resistance genes, nitrogen fixation-related genes, oil synthesis genes, and m6A genes. cmscan software was used to identify ncRNAs. repeatModeler-2.0.3, RepeatMasker, and DeepTE were used for TE transposon prediction. orthofinder (v2.0.9) was used to identify 25 genome-directed genes. Functional and pathway annotations of genes were analyzed with InterProScan (v5.51) and GhostKoala, a tool provided by the Kyoto 64 Encyclopedia of Genes and Genomes (http://www.genome.jp/kegg/), respectively. And, we collected 135 genomic information resources from 58 legume species (Table S1). We then selected a representative collection of 25 legumes to the chromosome level for systematic bioinformatic analysis. The analysis results are also stored in LGRPv2, a platform used to help researchers easily query, compare and download the results of these genomic resources and bioinformatics analyses

 

B. How to download the data in Leguminosae platform Database?

All data in the Legume Platform database can be downloaded from the appropriate resource page. Such as genome data, pangenomic data, transcriptome data, Fabaceae pathways, Jbrowseetc..

 

C. How to contact us?

If you meet any troubles or find any bugs when you visit Leguminosae platform Database, please email to [email protected] or [email protected], pull requests in Fabaceae Community or you can contact us by:

Address info 21 Bohai Road,Caofeidian, Tangshan 063210, Hebei, China

Tel: +86-0315-8805607

 

V. Citation

Data files contained in the Leguminosae platform Database are free of all copyright restrictions and made fully and freely available for non-commercial use. Users of the data should cite the following articles:

・Fabaceae genomes bioinformatics platform: A comprehensive database of Fabaceae genomes

・Multi-dimensional reshuffling of ancestral genome during post-polyploid diploidization shaped family Fabaceae