Whole-genome analyses
This page provides an introduction and links to the results of some of our genome-wide analyses of SAS, PAS, expression levels and CDS start reassignments (Siegel, et al. 2010 Nucl. Acids Res. 38:4946-57 DOI PMC URL PMID abstract). These data were mapped to genome version 4 and, along with SAS and PAS analyses from Tschudi and Ochsenreiter labs (Nilssen et al. 2010 PLoS Pathogens 6:e1001037 and Kolev et l. 2010 PLoS Pathogens 6:e1001090), have been incorporated into the genome annotation at GeneDB and EuPathDB (TriTrypDB),where curated and raw-mapped SAS and PAS data can be viewed as Genome Browser tracks at TriTrypDB, together with displays of genome-wide histone modification (Siegel, et al. 2009 Genes Dev 23:1063-76 DOI PMC URL PMID abstract).

Splice site (SAS) predictions

I hope these data are useful for some of you. All data are predictions, but I am confident that most of them are valid. It is possible, in a few cases where we predicted long N-terminal extensions and long UTRs, that there could be an additional unrecognized (by TrypDB) CDS between the predicted splice site and the downstream gene to which it was assigned. Where the mRNA abundance was low, only the predominant SAS will be identified and there may be undetected alternate SAS. The listings indicate the life-cycle and strain sources, and give the UTR and Splice Signal regions, etc.

An Excel file contains comprehensive data on 10,857 potential SAS for 6,959 genes for which we had at least 2 hits and unique matches, etc, as described in (manuscript submitted).

Some SAS data predict different CDS start sites than those assigned in TrypDB

When assigning splice sites I became aware that an incorrect ATG had been predicted for many genes in TrypDB. These results have been separated into 4 classes and have been attached to the corresponding genes as user comments in TriTrypDB. An Excel file contains this subset of the SAS predictions and a pdf file contains a summary of proposed alternative ATGs.

Class 1. An SAS downstream of the TrypDB-predicted start codon indicates that a subsequent in-frame ATG must form the N-terminus of the protein. This situation could not have been predicted from the sequence alone.

Class 2. The existence of alternatively spliced mRNAs, that would be consistent with the originally predicted ATG and further downstream ATG, would generate alternative proteins differing at their N-termini.

Class 3. Some genes in TryDB have in-frame ATG codons upstream of the ones assigned in TrypDB. From the SAS data, I was able to determine the presumably correct upstream ATG codon for these.

Class 4. The existence of alternatively spliced mRNAs predicts the existence of two alternative forms of the encoded protein, because a longer UTR contained an in-frame ATG upstream of the originally predicted CDS start. There might be more of these and they could be interesting (there could be copies of proteins with and without a mitochondrial import sequence, for example).

The entire TrypDB-derived set of genes for which we did not have SAS data was screened for the existence of upstream in-frame ATGs. These are listed in the pdf summary file with the caveat that we cannot be sure if these ATGs are within the mRNA because we do not have splice-site data for them.

Polyadenylation Site (PAS) predictions

An Excel file contains comprehensive data on 10,863 potential PAS for 5,948 genes identified by RNA sequence tags that terminated in at least 8 A residues and had 15-60 nt of non-A sequence. Many of the possible PAS in this data set are single hits, although all are unique hits in the sense that the number of hits does not exceed the number of reiterated genes, where the genes are reiterated. You can sort and select the data according to abundance, UTR length, etc. You might find some of these predictions useful for particular genes, but they need to be confirmed if they are important for you.

 

Top of page