We are searching data for your request:
Upon completion, a link will appear to access the found materials.
After completing this chapter you should be able to...
- Describe biodiversity as the equilibrium of naturally fluctuating rates of extinction and speciation.
- Summarize potential causes of mass extinctions and the associated biodiversity loss.
- Compare the present rate of extinction to the background extinction rate.
- Describe the causes and consequences of the sixth mass extinction.
- Describe how the loss of biodiversity is measured.
- Distinguish among the Red List categories.
- Name, define, and provide examples of the five major threats to biodiversity.
- Provide examples of the successes and failures of biological control in regulating invasive species.
Five mass extinctions with losses of more than 50 percent of extant species are observable in the fossil record. The sixth mass extinction is currently in progress with present-day extinction rates much greater than the background extinction rate.
The core threats to biodiversity are human population growth and unsustainable resource use. These are habitat loss, overexploitation, pollution, invasive species, and climate change. Habitat loss occurs through deforestation, damming of rivers, and other activities. Water and air pollution introduce toxic substances into the environment that harm plants and animals. Overexploitation is a threat particularly to aquatic species, but the poaching of terrestrial animals and overcollection of plants and fungi also puts species at risk. Invasive species have been the cause of a number of extinctions and are especially damaging to islands and lakes. Climate change is forcing range changes that can lead to extinction. It is also affecting adaptations to the timing of resource availability that negatively affects species in seasonal environments. Climate change will also raise sea levels, eliminating some islands and reducing the area of all others.
3.2.9: Review - Biology
Instructor: Professor Junping Shi
Chapter 1: Derivation of reaction-diffusion equations (18 pages)
Chapter 2: Diffusion equation on a bounded domain (22 pages)
Chapter 3: Diffusion with point source
Chapter 4: Nonlinear scalar reaction-diffusion equations
Chapter 5: Reaction diffusion systems
1/19 Introduction Lecture 1
1/24 Review of calculus, review of basic population models Lecture 2
1/26 Nondimensionalization, derivation of reaction-diffusion models Lecture 3
2/9 Brownian motion and diffusion (Paul Dunlap)
2/16 Fisher's 1937 paper (Ryan Carpenter)
2/21 Nonlinear diffusion equation (Daniel Grady)
2/23 Skellam's 1953 paper (Patrick Lucey)
2/28 An approximate solution for Fisher equation in high dimension (Fumie Hirata)
2/28 Exact traveling wave solution for diffusion equation with Allee effect (Tina Little)
3/21 Evolution and dispersal of genes, Fisher equation Lecture 4 (powerpoint)
3/30 Stability in dynamical systems and Turing instability Lecture 5
4/4 Turing instability and bifurcation Lecture 6 Turing Patterns in animal coats
4/6 Chemotaxis and slime mold model Lecture 7
4/11 A reaction-advection-diffusion equation from chaotic chemical mixing Lecture 8
4/13 Periodic solutions of systems Lecture 9
([S] is Shi's lecture notes, [B] is Britton's book)
Homework 1 (due 1/31, Tuesday): [B] page 15 (1.6, 1.8), [S] Chapter 1 (1,2,3,4,10) Projects/Presentations Maple and Matlab Programs 3-d graphing: Demonstrate Maple commands for 3-d graphing Difference equations for gene evolution prog1 prog2 Matlab programs simulating R-D equations and systems:
Homework 2 (due 2/8, Wednesday): [B] page 153 (5.5,5.7) [S] Chapter 1 (8,9,13) Chapter 2 (2,4,5)
Homework 3 (due 2/15, Wednesday): [S] Chapter 2 (7(you can use result of 6), 12,14)
Howework 4 (due 2/22, Wednesday): [B] page 158 (5.9, 5.11) and following problems:
(1) Find an exact traveling wave of equation u_t=D u_
(Hint: use Maple, and the form of the solution is u(x,t)=v(x-ct), v(z)=1/(1+exp(az))^<2/m>
(2) (optional, I don't know the answer or if an answer exists)
(Hint: use Maple, and maybe also the same form as above, but I don't have solution for this one)
Homework 5 (due 3/29 Wednesday)
(1) [S] Chapter 4 (5) You do not need to write the actual Maple code, but write the "psedo-code" which reflects the Robin boundary conditions.
(2) [S] Chapter 4 (11) (Hint: use Maple to solve for u_2)
Homework 1: answer of Homework 1 (prob 1-3), and solve differential equations
Fourier series of a solution of diffusion equation: Demonstrate the smothering effect of diffusion
Differential equations: Demonstrate how to solve initial value problem, boundary value problem of ODE, and PDE
Boundary conditions and smoothering effect Demonstrate effect of different boundary conditions, and smoothering effect of diffusion
Chemical problem: show how to solve the chemical mixing problem
Robin boundary condition: calculation of Robin boundary eigenvalues, and critical patch size
Diffusive Malthus model: show the effect of different growth rate on the fate of population which lives in a bounded region
Patterns of eigenfunction in 2-d: spatial patterns of eigenfunctions of Laplacian on a square
Diffusion with a point source: simulation of the fundamental solutions in 1-d and 2-d
Diffusion with a continuous source: simulation of solution of diffusion equation on a half line with fixed value at x=0
Fuel spill problem: solve the fuel spill problem in Section 3.3
Muskrat dispersal: use data fitting function to match the muskrat population growth
Traveling wave of Fisher equation: show an exact traveling wave solution of Fisher equation
Traveling wave of generalized Fisher equation: calculate an exact traveling wave solution of generalized Fisher equation
Self-similar solution of diffusion equation: calculate the self-similar solutions of linear and nonlinear diffusion equations
Gypsy-moth problem: calculate the invasion speed of gypsy moth in east United States
Programs by Marcus Garvie (Florida State University)
Maple and Matlab Programs
3-d graphing: Demonstrate Maple commands for 3-d graphing
Difference equations for gene evolution prog1 prog2
Matlab programs simulating R-D equations and systems:
Reference Books in Mathematical Biology
- Mathematical Biology, Vol. 1: An Introduction. By James Dickson Murray, Springer-Verlag, New York, (2002).
- Mathematical Biology, Vol. 2: Spatial Models and Biomedical Applications. By James Dickson Murray, Springer-Verlag, New York, (2002).
- Mathematical Models in Biology. By Leah Edelstein-Keshet, McGraw-Hill, Boston, (1988). SIAM, (2005).
- Elements of Mathematical Ecology. By Mark Kot, Cambridge University Press, (2001).
- Diffusion and Ecological Problems: Modern Perspectives. By Akira Okubo, Simon A. Levin, Springer-Verlag, New York, (2001).
- Quantitative Analysis of Movement: Measuring and Modeling Population Redistribution in Animals and Plants. By Peter Turchin, Sinauer Associates, Inc, (1998).
- yes, your recursion code can be improved for performance.
- yes, part of that improvement can come from sorting the data.
- yes, there's a way to refactor the code to not use recursion, and it may even be faster.
Modeling of Biological Systems, A Workshop at the National Science Foundation in 1996
Why Is Mathematical Biology So Hard? Michael C. Reed, Notices of American Mathematical Society, March, 2004.
Mathematical Challenges in Spatial Ecology Claudia Neuhauser, Notices of American Mathematical Society, Dec. 2001.
We Got Rhythm: Dynamical Systems of the Nervous System Nancy Kopell, Notices of American Mathematical Society, Jan. 2000.
Getting Started in Mathematical Biology Frank Hoppensteadt , Notices of American Mathematical Society, Sept. 1995.
Some Advice to Young Mathematical Biologists Kenneth Lange, (from internet), date unknown.
How the leopard gets its spots? James Murray, Scientific American, 258(3): 80-87, 1988.
2. Manuscript length
Research articles and Tools and Resources articles: the total length of the article should not exceed 8000 words, including the main text and figure legends, but not the title page, abstract, materials and methods section or reference list (reference citations in the text do count towards the word limit). The total number of display items (Figures and Tables) must not exceed eight (8).
Short reports: the total length of the article should not exceed 3000 words, including the main text and figure legends, but not the title page, abstract, materials and methods section or references (reference citations in the text do count towards the word limit). The total number of display items (Figures and Tables) must not exceed four (4).
Note that final word limits will depend on the paper submitted and are at the discretion of the Editors.
Results and discussion
In the subsequent sections, we demonstrate EVM as an automated gene structure annotation tool using rice and human genome sequences and related evidence. First, using the rice genome, we develop the concepts that underlie the algorithm of EVM as a tool that incorporates weighted evidence into consensus gene structure predictions. We then turn our attention to the human genome, in which we examine the role of EVM in concert with PASA to annotate protein-coding genes and alternatively spliced isoforms automatically. In each scenario, we include comparisons with alternative annotation methods.
Evaluation of ab initiogene prediction in rice
The prediction accuracy for each of the three programs Fgenesh , GlimmerHMM , and GeneMark.hmm  was evaluated using a set of 1,058 cDNA-verified reference gene structures. All three were nearly equivalent in both their exon prediction accuracy (about 78% exon sensitivity [eSn] and 72% to 79% exon specificity [eSp]) and complete gene prediction accuracy (22% to 25% gene sensitivity [gSn] and 15% to 21% gene specificity [gSp] Figure 1). The breakdown of prediction accuracy by each of the four exon types indicates that all gene predictors excel at predicting internal exons correctly (about 85% eSn) while predicting initial, terminal, and single exons less accurately (44% to 68% eSn Figure 2).
Rice Ab initio gene prediction accuracies. Gene prediction accuracies are shown for GeneMark.hmm, Fgenesh, and GlimmerHMM ab initio gene predictions based on an evaluation of 1058 cDNA-verified reference rice gene structures. The accuracy of EVidenceModeler (EVM) consensus predictions from combining all three ab initio predictions using equal weightings (weight = 1 for each) is also provided.
Ab initio prediction sensitivity by exon type. Individual ab initio exon prediction sensitivities based on comparisons with 1,058 reference rice gene structures are shown for each of the four exon types: initial, internal, terminal, and single. Results are additionally shown for EVidenceModeler (EVM) consensus predictions where the ab initio predictions were combined using equal weights.
Although each gene predictor exhibits a similar level of accuracy, they differ greatly in the individual gene structures they each predict correctly. The Venn diagrams provided in Figure 3 reveal the variability among genes and exons predicted correctly by the three programs. Although each program predicts up to 25% of the reference genes perfectly, only about a quarter of these (6.2%) were identified by all three programs simultaneously. It is also notable that more than half (54%) of the cDNA-verified genes are not predicted correctly by any of the gene predictors evaluated. At the individual exon level, there is much more agreement among predictions, with 60.5% of the exons correctly predicted by all three programs. Only 7.1% of exons are not predicted correctly by any of the three programs. The Venn diagrams indicate much greater overall consistency among internal exon predictions, correlated with the inherently high internal exon prediction accuracy, as compared with the greater variability and decreased prediction accuracy among other exon types. A relatively higher proportion of the single (22.1%), initial (14.4%), and terminal (13.9%) exon types found in our reference genes are completely absent from the set of predicted exons.
Venn diagrams contrasting correctly predicted rice gene structure components by ab initio gene finders. Percentages are shown for the fraction of 1,058 cDNA verified rice genes and gene structure components that were predicted correctly by each ab initio gene predictor. The cDNA-verified gene structure components consist of 7,438 total exons: 86 single, 5408 internal, 972 initial, and 972 terminal.
Consensus ab initioexon prediction accuracy
Although there is considerable disagreement among exon calls between the various gene predictors, when multiple programs call exons identically they tend more frequently to be correct. Figure 4 shows that by restricting the analysis to only those exons that are predicted identically by two programs, exon prediction specificity jumps to 94% correct, regardless of the two programs chosen. Exon prediction specificity improves to 97% if we consider only those exons that are predicted identically by all three programs. Note that although the specificity improves to near-perfect accuracy, the prediction sensitivity drops from 78% to 60%. Although we cannot rely on shared exons to predict all genes correctly, we can in this circumstance trust those that are shared with greater confidence. EVM uses this increased specificity provided by consensus agreement among evidence for gene structure components and reports these specific components as part of larger complete gene structures at the same time, EVM uses other lines of evidence to retain a high level of sensitivity.
Exon prediction accuracy limited to consensus complete exon calls. Exon sensitivity (eSn) and exon specificity (eSp) were determined by comparing ab initio predicted exons. Exons were restricted to those perfectly agreed upon by either two or three different gene predictors. Only those predicted exons found within 500 base pairs flanking the 1,058 reference gene structures were considered for the specificity calculations.
Consensus gene prediction by EVM
Unlike conventional ab initio gene predictors that use only the composition of the genome sequence, EVM constructs gene structures by combining evidence derived from secondary sources, including multiple ab initio gene predictors and various forms of sequence homologies. In brief, EVM decomposes multiple gene predictions, and spliced protein and transcript alignments into a set of nonredundant gene structure components: exons and introns. Each exon and intron is scored based on the weight (associated numerical value) and abundance of the supporting evidence genomic regions corresponding to predicted intergenic locations are also scored accordingly. The exon and introns are used to form a graph, and highest scoring path through the graph is used to create a set of gene structures and corresponding intergenic regions (Figure 5 see Materials and methods, below, for complete details). Because of the scoring system employed by EVM, gene structures with minor differences, such as small variations at intron boundaries, can yield vastly different scores. For example, a cDNA-supported intron that is only three nucleotides offset from an ab initio predicted intron could be scored extraordinarly high as compared with the predicted intron, although they differ only slightly in content. Likewise, an intron that is fully supported by multiple spliced protein alignments will be scored higher than an alternate intron of similar length yielded by only a single similarly weighted protein alignment. In this way, EVM uses the abundance and weight of the various evidence to score gene structure components appropriately to promote their selection within the resulting weighted consensus genome annotation.
Consensus Gene Structure Prediction by EVM. The main aspects of the EVidenceModeler (EVM) weighted consensus prediction generating algorithm are depicted here, exemplified with a 7 kilobase region of the rice genome. The top view illustrates a genome browser-style view, showing the ab initio gene predictions GlimmerHMM, Fgenesh, and GeneMark.hmm, AAT-gap2 spliced alignments of other plant expressed sequence tags (ESTs), Program to Assemble Spliced Alignments (PASA) assemblies of rice EST and full-length cDNA (FL-cDNA) alignments, AAT-nap spliced alignments of nonrice proteins, and GeneWise protein homology-based predictions. Top strand and bottom strand evidence are separated by the sequence ticker. Evidence is dismantled into candidate introns and exons candidate exons are shown in the context of the six possible reading frames at the figure bottom. A coding, intron, and intergenic score vector are shown feature-specific scores (see Materials and methods) were added to corresponding vectors here for illustration purposes only, and note that all introns have feature-specific scores. The selection of exons, introns, and intergenic regions that define the highest scoring path is shown by the connections between exon features within the six-frame feature partition. This highest scoring path yields two complete gene structures, shown as an EVM tier at top, corresponding to the known rice genes (left) LOC_Os03g15860 (peroxisomal membrane carrier protein) and (right) LOC_Os03g15870 (50S ribosomal protein L4, chloroplast precursor).
To demonstrate the simplest application of EVM, we combine only the three ab initio gene predictions and weight each prediction type equally. Figures 1 and 2 display the results in comparison with the ab initio prediction accuracies we demonstrate that, by incorporating shared exons and introns into consensus gene structures, complete gene prediction accuracy is improved by at least 10%. Exon prediction accuracy is increased by about 6%, and exon prediction accuracies for each exon type are mostly improved, with the exception of the initial exon type, for which GeneMark.hmm alone is slightly superior.
Consensus gene prediction accuracy using varied evidence types and associated weights
A gene structure consensus as computed by EVM is based on the types of evidence available and their corresponding weight values. In the example above, each evidence type provided in the form of ab initio gene predictions was weighted identically. In the case where each prediction type is equivalent in accuracy, this may be sufficient, but when an evidence type(s) is more accurate, a higher weight(s) applied to that evidence is expected to drive the consensus toward higher prediction accuracy. Figure 6 illustrates the impact of varied weight combinations and sources of evidence on exon and complete gene structure prediction sensitivity. In the first set (iterations 1 to 10), only the three ab initio gene predictions are combined using random weightings. Prediction accuracy ranges from 22% to 38% gSn and 77% to 84% eSn. In the second set (iterations 11 to 20), sequence homologies are additionally included in the form of spliced protein alignments (using nap of AAT), spliced alignments of ESTs derived from other plants (using gap2 of AAT), and GeneWise protein-homology-based gene predictions. There, complete prediction accuracy ranges from 44% to 62% gSn and 88% to 92% eSn. In the third and final set (iterations 21 to 30), PASA alignment assemblies derived from rice transcript alignments were included, from which a subset define the correct gene structure. In the presence of our best evidence and randomly set weights, prediction accuracy ranges from 75% to 96% gSn and 95% to 99% eSn.
Response of EVM prediction accuracy to varied evidence types and weights. Iterations (30) of randomly weighted evidence types were evaluated by EVidenceModeler (EVM). Iterations 1 to 10 included only the ab initio predictors GlimmerHMM, Fgenesh, and GeneMark.hmm. Iterations 11 to 20 additionally included AAT-nap alignments of nonrice proteins, GeneWise predictions based on nonrice protein homologies, and AAT-gap2 alignments of other plant expressed sequence tags. Iterations 21 to 30 included Program to Assemble Spliced Alignments (PASA) alignment assemblies and corresponding supplement of PASA long-open reading frame (ORF)-based terminal exons. Exon and complete gene prediction sensitivity values resulting from EVM using the corresponding weight combinations are plotted below.
Although this represents just a minute number of possible random weight combinations, it demonstrates the effect of the weight settings and the inclusion of different evidence types on our consensus prediction accuracy. By including evidence based on sequence homology, our prediction accuracy improves greatly, doubling to tripling complete gene prediction accuracy of ab initio programs alone or in combination. Also, very different weight settings can still lead to similar levels of performance, particularly in the presence of sequence homology data.
EVM consensus prediction accuracy using trained evidence weights
Given the variability in consensus gene prediction accuracy observed using different combinations of weight values, finding the single combination of weights that provides the best consensus prediction accuracy is an important goal. Searching all possible weight combinations to find the single best scoring combination is not tractable, given the computational effort needed to explore such a vast search space. To estimate a set of high scoring weights, we employed a set of heuristics that use random weight combinations followed by gradient ascent (see Materials and methods, below). For the purpose of choosing high performing weights and evaluating their accuracy, we selected 1,000 of our cDNA-verified gene structures and used half for estimating weights and the other half for evaluating accuracy using these weights (henceforth termed 'trained weights'). In both the training and evaluation process, accuracy statistics were limited to each reference gene and flanking 500 base pairs (bp). However, EVM was applied to regions of the rice genome including the 30 kilobase (kb) region flanking each reference gene, to emulate gene prediction by EVM in a larger genomic context.
Because the training of EVM is not deterministic, and each attempt at training can result in a different set of high-scoring weights, we performed the process of training and evaluating EVM on the rice datasets three times separately. The trained weight values computed by each training process are provided in Additional data file 2 (Table S1), and the consensus gene prediction accuracy yielded during each evaluation is provided in Additional data file 2 (Table S2). The average gene prediction accuracy is provided in Figure 7. On this set of 500 reference genes, the average exon and complete gene prediction accuracies for the ab initio predictors are similar to those computed earlier for the larger complete set of 1,058 cDNA-verified genes. EVM applied to the ab initio predictions alone using optimized weights yielded 38% gSn and 34% gSp, approximately 10% better than the best corresponding ab initio accuracy. By including the additional evidence types in the form of protein or EST homologies independently, complete gene prediction sensitivity increases to 49% to 56% gSn and 44% to 50% gSp. Using all evidence minus the PASA data, complete gene sensitivity reaches 62% gSn and 56% gSp. Note that each gain in sensitivity is accompanied by a gain in specificity, indicating overall improvements in gene prediction accuracy.
Rice consensus gene prediction accuracy using optimized evidence weights. Gene prediction accuracy for EVidenceModeler (EVM) was calculated at the nucleotide, exon, and complete gene level using trained weights and specific sets of evidence, applied to 500 of the reference rice gene structures. The evidence evaluated is described as follows: EVM:GF includes ab initio gene predictions (GF) alone EVM:GF+gap2 includes GF plus the AAT-gap2 alignments of other plant expressed sequence tags (gap2) EVM:GF+nap includes GF plus AAT-nap alignments of nonrice proteins (nap) EVM:GF+GeneWise includes GF plus the GeneWise predictions based on nonrice protein homologies (GeneWise) EVM:ALL(-PASA) includes GF, nap, gap2, and GeneWise EVM:ALL(+PASA) additionally includes the Program to Assemble Spliced Alignments (PASA) alignment assemblies and PASA long-open reading frame (ORF)-based terminal exon supplement. Sn, sensitivity Sp, specificity.
Intuitive versus trained weights
Although we can computationally address the problem of finding a set of weights that yield optimal performance, it is clear from our analysis of randomly selected weights that there could be numerous weight combinations that provide reasonable accuracy. In general, we find that combinations of assigned weightings in the following form provides adequate consensus prediction accuracy:
(ab initio predictions) ≤ (protein alignments, EST alignments) < (GeneWise) < (PASA)
Using such a weight combination (gene predictions = 0.3, proteins and other plant ESTs = 1, GeneWise = 5, PASA = 10), we find that our consensus exon and complete gene prediction accuracy is quite comparable, with our intuitive weights providing performance levels that in most cases are just slightly lower than those of our trained weights (Additional data file 1 [Figure S1]). In each case, accuracy measurements with intuitive weight settings were within 3% of the results from trained weights. The ability to tune EVM's evidence weights intuitively provides a flexibility that is not as easily afforded by current software systems based on a strict probabilistic framework.
EVM versus alternative annotation tools: Glean and JIGSAW
The accuracy of EVM was compared with that of competing combiner-type automated annotation tools using both Glean and JIGSAW. The publicly available Glean and JIGSAW software distributions were downloaded and run using default parameter settings. We trained JIGSAW using datasets identical to those provided to EVM, using the 500 reference genes and associated evidence for training and the separate 500 genes and evidence for evaluation. Glean's unsupervised training is tightly coupled to the prediction algorithm, and so Glean was executed on the entire set of 1,000 genes and associated evidence, with the proper half used for evaluation purposes. Exon and complete gene prediction accuracies are shown in Figure 8. Each evidence combiner demonstrates substantial improvements in accuracy in the presence of sequence homology evidence. EVM fares well in this combiner showdown, and in most cases it provides the greatest prediction accuracy of the three tools analyzed.
EVM's accuracy compared with Glean and JIGSAW. Both JIGSAW and Glean were trained and evaluated on the rice genome data, and accuracies were compared with those of EVidenceModeler (EVM). The trained weights utilized by EVM are provided in Additional File 2 (Table S3). PASA, Program to Assemble Spliced Alignments Sn, sensitivity Sp, specificity.
The prediction accuracy between JIGSAW and EVM is strikingly similar for two of the evidence combing scenarios examined: combining gene predictions with other plant EST alignments (gap2), and when all alignment data are included minus the rice PASA evidence (all). We further examined the latter case, in which both JIGSAW and EVM predicted more than 60% of the complete genes accurately, to determine the similarity of their gene predictions. Of the 500 reference genes tested, there are 310 predictions generated identically between EVM and JIGSAW, of which 260 were correct. Therefore, although their prediction accuracies can be strikingly similar, overall the gene structures predicted are quite different.
A strength of EVM is its ability to utilize heavily trusted forms of evidence, such as gene structures inferred from alignments of cognate FL-cDNAs and ESTs. Each of the three programs were trained in the presence of cDNA-supported gene structures as provided by PASA (long open reading frame [ORF] structures within PASA alignment assemblies), a subset of that defines a correct gene structure (see Materials and methods, below). All three tools demonstrated the greatest prediction accuracy in the presence of PASA evidence. Although each tool is effectively provided with evidence containing all complete introns and exons that define the correct gene structure, only EVM is found to be capable of nearly perfect prediction accuracy. Of the 500 evaluated reference genes, EVM predicted only six incorrectly when supplied with PASA evidence along with the competing evidence types (ab initio predictions, and protein and other plant EST alignments). These six incorrect predictions involved three cases in which neighboring genes were merged into single predictions, two cases in which improper gene termini were chosen, and a single case that was confounded by a large degenerate retrotransposon insertion within an intron of a gene, an element that was not masked and excluded from the gene prediction effort.
Comparison with manual annotation
It is expected and reassuring that EVM provides nearly perfect complete gene accuracy in the presence of high quality and reliable complete gene structure data, as provided in the form of the PASA alignment assemblies. The importance of such ESTs and FL-cDNAs for gene structure annotation is well known [42–45], and software such as PASA can annotate gene structures based solely on these data in absence of pre-existing gene annotations or ab initio gene predictions . A greater challenge is to achieve maximal consensus gene prediction accuracy in the absence of these data, which is the typical scenario with newly sequenced genomes that lack extensive EST or FL-cDNA sequences as companion resources. In such cases we must rely on the accuracy of ab initio gene predictors and homologies to sequences from other organisms, and it is here that, in lieu of an equivalent automated annotation method, we expect to have the greatest gains from expert scientists directly evaluating and modeling complete gene structures based on these sources of evidence.
In our application of EVM thus far, the relevant set of input evidence is that which contains the ab initio gene predictions, protein alignments, GeneWise predictions based on protein homology, and the alignments to ESTs derived from other plants (Figure 7 entry 'EVM:All(-PASA)', read as EVM with all evidence minus PASA evidence). Using trained weights, EVM correctly predicted 92% of the known exons and 62% of the 500 cDNA-verified genes correctly, on average. If the subset of the native cDNA data that defines the correct gene structure is not supplied as evidence, and if components of such known gene structures are not available as candidate introns and exons, then EVM will be unable to predict the gene correctly. In an effort to establish the upper limit of gene prediction accuracy in the absence of cDNA evidence, we propose use of the accuracy of manual annotation on the same dataset. The accuracy of human annotation has never been adequately measured, although it is widely assumed that human annotation is the 'gold standard' for genome projects. For our study, a set of human annotators was asked to evaluate these data in absence of cognate rice cDNA alignments, and were instructed to model a gene structure manually that best reflected the available evidence. In absence of the rice cDNAs, manual annotation accuracy resulted in 96% eSn and 96% eSp, and 81% gSN and 81% gSP (Figure 7). In light of these statistics, we consider the accuracy provided by EVM on the identical dataset to be demonstrably effective as an automated annotation system, and approaching the better accuracy obtained through manual curation efforts, particularly when compared with the accuracy of individual ab initio gene predictors on the same dataset.
Application of EVM and PASA to the ENCODE regions of the human genome
The ENCyclopedia of DNA Elements (ENCODE) project was initiated shortly after the sequencing of the human genome with the aim being to identify all functional elements, including all protein-coding genes, in the human genome sequence . The pilot phase of the project focused on only 1% (about 30 megabases spread across 44 regions) of the genome, termed the ENCODE regions. The GENCODE (encyclopedia of genes and genes variants) consortium was formed to provide high quality manual annotation and experimental verification of protein coding genes in these regions . The human ENCODE Genome Annotation Assessment Project (EGASP) was established to evaluate the accuracy of automated genome annotation methods by comparing automated annotations of the ENCODE regions with the GENCODE annotations . Participants in the EGASP competition were allowed access to 13 ENCODE regions along with their corresponding GENCODE annotations, which could be used for training purposes. Groups submitted their automated annotations for the remaining 31 regions, after which time the corresponding GENCODE annotations were released and the automated annotation methods were evaluated based on a rigorous comparison with the GENCODE annotations .
The sequences, gene predictions, and annotations involved in EGASP additionally serve as a resource for evaluating current and future annotation methods. Similarly to our application of EVM to the rice genome using cDNA-verified gene structures for training and evaluation purposes, we applied EVM to the ENCODE regions using the GENCODE annotations for training and evaluation purposes, analogous to the original EGASP competition. Evidence used by EVM included the evidence tracks provided by University of California at Santa Cruz: TWINSCAN, SGP2, GENEID, GENSCAN, CCDSGene, KNOWNGene, ENSEMBL (ENSGene), and MGCGene. Additional evidence generated in our study included AAT alignments of nonhuman proteins, GeneWise predictions based on the nonhuman protein homologies, AAT nucleotide alignments of select animal gene indices, and PASA alignment assemblies generated from GMAP alignments of human ESTs and FL-cDNAs. The GlimmerHMM predictions used by EVM were those generated as part of the EGASP competition, and were obtained separately.
There are several notable differences between the training and evaluation of EVM on the ENCODE regions as compared with the earlier application to rice. The cDNA-verified rice genes used for training and evaluation were restricted to a single splicing isoform. In addition, each gene was complete, containing the protein-coding region from start to stop codon. The GENCODE protein-coding annotations, in contrast, include alternative splicing isoforms and several partial gene structures. Accuracy measurements computed for rice genes included each cDNA-verified gene and the flanking 500 bases, whereas accuracy measurements on the ENCODE regions included these sequence regions in their entirety and all corresponding protein-coding gene annotations.
EVM was trained on the 11 ENCODE test regions and then evaluated on the remaining 33 regions. Training and evaluation were performed under two independent trials. The trained weights and corresponding accuracy values are provided in Additional data file 2 (Tables S4 and S5). Our initial analysis of EVM on this dataset utilized the ab initio gene predictions, and the EST and protein homologies, similar to our earlier analysis with rice. The average gene prediction accuracy for the source predictions and EVM with varied additional evidences is illustrated in Figure 9. The ab initio gene predictions used as evidence by EVM individually predict genes with accuracies mostly less than 20% gSn the best individual performer was TWINSCAN, with 22% gSn and 20% gSp. By combining these predictions alone, EVM improves complete gene prediction accuracy to 31% gSn and 27% gSp, which is significantly better performance than any of the individual ab initio predictors. By including spliced alignments to dog, pig, mouse, or rat assembled EST databases, gene prediction sensitivity further improves to 38% to 45% gSn and 34% to 40% gSp. EST alignments from the more distantly related chicken yield slight improvement from using the predictions alone, but not to the extent of mammals. Alignments to the more distantly related sea squirt and frog gene indexes offer little to no improvement in prediction accuracy. Overall, the improvements in EVM prediction accuracy afforded by alignments to the nonhuman gene indexes correlate well with their phylogenetic distance from human, with mouse and rat being found most useful. By including human EST and FL-cDNA alignments in the form of PASA alignment assemblies along with the ab initio predictions, gene prediction sensitivity improves to 63%. Protein homologies included with ab initio predictions, in the form of AAT (nap) alignments or GeneWise predictions, also demonstrated an improvement in gene prediction accuracy, with 36% to 56% gSn and 30% to 44% gSp as compared with the 31% gSn and 27% gSp from combining the predictions alone.
Human consensus gene prediction accuracy by EVM. The consensus gene prediction accuracy by EVidenceModeler (EVM) is shown based on trained evidence weights and the corresponding combination of evidence, as applied to the GENCODE test regions of the human genome. The accuracies for the inputted gene predictions obtained from the ENCODE Genome Annotation Assessment Project (EGASP) dataset are provided for reference sake, including GENSCAN, TWINSCAN, GlimmerHMM, GeneMark.hmm on the repeat-masked genome, GeneID, and SGPgene. EVM-GF corresponds to EVM applied to these gene prediction tiers alone (GF), and serves as the baseline evidence for the subsequent entries. EVM-GeneWise includes GeneWise predictions based on nonhuman protein homologies EVM-nap includes AAT-nap spliced alignments of nonhuman proteins the EVM:gap2_* series includes AAT-gap2 alignments of corresponding transcripts from the Dana Farber Gene Indices (CINGI, Ciona intestinalis [sea squirt] XGI, Xenopus tropicalis [frog] GGGI, Gallus gallus [chicken] DOGGI, Canis familiaris [dog] SSGI, Sus scrofa [pig] RGI, rat MGI, mouse) EVM-alignAsm includes Program to Assemble Spliced Alignments (PASA) alignment assemblies and corresponding terminal exon supplement and EVM:All includes all evidence described (GF, gap2, nap, GeneWise, and PASA). Sn, sensitivity Sp, specificity.
Post-EVM application of PASA to annotate alternatively spliced isoforms
EVM is not designed to model alternative splicing isoforms directly. This is, however, a primary function of our companion annotation tool PASA, which contributes to the automated annotation of gene structures in several ways. PASA, like EVM, is made freely available as open source from the PASA website . Above, PASA alignment assemblies were used as one source of gene structure components by EVM. Alternatively, PASA can generate complete gene structures based on full-length alignment assemblies (alignment assemblies containing at least one FL-cDNA) by locating the longest ORF within each alignment assembly, and annotate gene structures and alternatively spliced isoforms restricted to the transcriptome. A third application of PASA is to perform a retroactive processing of a set of pre-existing gene structure annotations, whereby alignment assemblies are incorporated into untranslated region annotations, exon modifications, correctly splitting or merging predicted gene structures, and used to model alternative splicing isoforms .
To demonstrate the effect of applying PASA as a postprocess to integrate transcript data into an existing set of gene structure annotations (which we refer to as 'PASAu', for PASA updates), we applied PASA separately to the ab initio predictions, the various University of California at Santa Cruz gene prediction tracks (which we refer to as 'other predictions'), and to the EVM-generated datasets that either utilized or excluded the other predictions. The change in prediction accuracy as a result of applying PASA's annotation updates is illustrated in Additional data file 1 (Figure S2). PASAu can yield relatively large improvements (increases from 23% to 33% in gSn and from 7% to 32% in gSp) to the accuracy of the various ab initio predictions by incorporating transcript alignment assembly-based updates. PASAu-resulting changes to the accuracies of the other original predictions were more variable, mostly involving small increases in transcript sensitivity and larger decreases in transcript specificity more GENCODE transcripts predicted correctly, but additional PASA-based transcripts not represented in the GENCODE dataset were also identified. The EVM gene sets were affected similarly.
The small change in gSn and gSp resulting from the annotation update functions of PASA to the EVM predictions is not surprising, given that the PASA alignment assemblies were included here as inputs during the generation of the consensus gene structures by EVM. The most notable consequence of the PASA updates was the modeling of alternative splicing isoforms. Although the number of genes annotated as alternatively spliced was variable across the different annotation gene sets, the ratio of transcripts per alternatively spliced gene was fairly uniform, and largely consistent with the prevalence of alternatively spliced genes described in the GENCODE annotations (Figure 10). The reason for the variability in the number of alternatively spliced genes is because of PASAu's stringent validation tests, forsaking automated gene structure updates in favor of targeted manual evaluation in those cases in which the tentative gene structure updates or candidate splicing isoforms vary greatly from the originally annotated gene structures .
Addition of alternatively spliced isoforms using PASAu. By applying Program to Assemble Spliced Alignments (PASA) to the various annotation datasets, PASA can automatically annotate alternative splicing isoforms. The number of alternatively spliced genes and the number of transcripts per alternatively spliced gene are shown, including the pre-PASAu and post-PASAu values. Only the EnsEMBL dataset includes models for alternatively spliced isoforms before the application of PASA. Dotted lines indicate the corresponding values based on the GENCODE reference annotation dataset: 147 alternatively spliced genes and 3.42 transcripts per alternatively spliced gene. Transcript isoforms alternatively spliced only in untranslated regions were ignored. Here, EVM:All(+OP) refers to the inclusion of the EVM:All evidence plus the 'other predictions' from ENCODE Genome Annotation Assessment Project (EGASP), including EnsEMBL, ENSgene, KnownGene, and CCDSgene, used by EVidenceModeler (EVM) as the OTHER_PREDICTION evidence class (Table 1).
The gene prediction accuracy of EVM, PASA alone, and PASA applied as a postprocess to update EVM predictions is provided along with the accuracies of methods evaluated as part of the EGASP competition in Figure 11. PASA, when used in isolation to annotate gene structures automatically based on transcript alignments alone, yields an impressive 60% gSN and 87% gSP these values reflect the abundance and utility of the human ESTs and FL-cDNAs available. EVM, with its greatest accuracy throughout the various surveys of the EGASP dataset presented, yielded prediction accuracies of between 63% and 76% gSn and of between 47% to 54% gSp.
EVM and PASA automated annotation accuracies compared to alternatives. The gene prediction accuracy of both EVidenceModeler (EVM) and Program to Assemble Spliced Alignments (PASA) are shown in the context of the other methods evaluated as part of the ENCODE Genome Annotation Assessment Project (EGASP) competition. Although PASA alone performs quite well, the benefits from applying PASA as a postprocess to the EVM consensus predictions are not immediately apparent, except in the enumeration of alternatively spliced isoforms as shown in Figure 10. PASA and EVM are shown to perform similarly to the best performing methods in the EGASP competition.
Although it is useful to compare accuracies of these various tools based on their ability to recreate the GENCODE annotation for the ENCODE regions, direct comparisons between each method based on these data may be generally useful but not exactly valid. In the case of ab initio gene prediction tools that require only the genome sequence as input, direct comparisons between the results of the gene predictors are fully justified, because the inputs are exactly identical. The focus of EGASP was to examine the accuracy of diverse automated annotation methods and not necessarily to perform head-to-head comparisons between each method. Therefore, groups were allowed to use any evidence available to them to assist in their annotation efforts, and so, for example, the additional evidence used by JIGSAW was not exactly the same inputs utilized by Exogean, or EVM as described here. The analogous experiments we directed in rice were more tightly controlled, given that each software tool was trained and executed using identical inputs. Even so, although alternative methods examined as part of the EGASP competition are shown to exceed EVM's accuracy, even if only slightly, EVM does fare well as an automated annotation system, especially when it is compared with the individual ab initio predictions.
Results and Discussion
130,000 long terminal repeat (LTR) retrotransposons in the 48 recently sequenced bird genomes  and obtained 2,118 presence/absence patterns of insertions that occurred within the neoavian radiation and are distributed genome-wide (S1 Table, S1 Fig, S1 Data). These RE markers were obtained after visual inspection under strict criteria for coding of character states at orthologous RE loci (see Materials and Methods), because we aimed to minimize the two sources of potential homoplasy independent RE insertion and precise excision.
Two Thousand RE Markers with Minimal Homoplasy
Homoplasy via independent RE insertion requires the retrotransposition of the same RE subtype into precisely the same genomic location, in the same orientation, and featuring an identical target site duplication. In addition to these factors that make independent insertions very rare, the LTR retrotransposons studied here have a low copy number (e.g., 3,138 copies in the zebra finch genome), were active only for a short time period around the neoavian radiation , and show no target site preference among thousands of reconstructed ancestral target sequences of inserted elements (S2 Fig). We therefore propose that the probability of homoplasy caused by independent insertions among our RE markers is extremely low. Homoplasy via precise excision is the deletion of the RE insertion and one copy of the duplicated target site, but not a single bp more or less than that. These requirements make the occurrence of precise excisions very rare and we therefore visually inspected all of our markers for precise boundaries of presence/absence states and coded imprecise or poorly aligned boundaries as missing data. Altogether, we suggest that our 2,118 RE markers contain negligible homoplasy, and conflicts are instead due to ILS-derived hemiplasy.
To verify that incongruences constitute ILS-derived hemiplasy, Hormozdiari et al.  proposed to test for topological consistence between each RE marker and a sequence tree derived from its flanking nucleotides. However, we note that failure of this test for some of their RE markers does not equal homoplasy of RE markers. Alternative and more plausible causes for inconsistencies are homoplasy or tree reconstruction uncertainties in the flanking sequence trees and the fact that recombination may cause different topologies between adjacent loci . Unfortunately, single-locus sequence trees of Neoaves have an average topological distance of 63% for introns and 66% for ultraconserved elements (UCEs) from the main Jarvis et al. tree . This means that the average nonexonic locus fails to congruently resolve most of the neoavian internodes. We note that it is therefore not possible to independently verify hemiplasy in neoavian RE markers by comparison to flanking sequence trees. Nevertheless, if homoplasy was prevalent in our RE markers, we would expect to see an equal distribution of RE incongruences across all of the sampled clades of Neoaves. While we find dozens of presence/absence markers with incongruences affecting the short branches within the neoavian radiation (S1 Table e.g., the core landbirds and core waterbirds clades), there is not a single RE incongruence in our presence/absence matrix (S1 Table) affecting well-accepted internal relationships within postradiation taxa, such as passerines, parrots, eagles, penguins, the woodpecker/bee-eater clade, the hummingbird/swift clade, and the flamingo/grebe clade. Such an imbalance of RE incongruences strongly implies that homoplasy is indeed negligible among our 2,118 RE markers.
Genome-Scale RE-Based Phylogeny of Neoaves
We analyzed the RE presence/absence matrix using Felsenstein’s polymorphism parsimony  and obtained a single most parsimonious RE (MPRE) tree, whose branches are supported by a total of 1,373 conflict-free insertion events across the neoavian radiation (Fig 1B). The topology is very similar to previous phylogenomic estimates using mostly noncoding nucleotide data [4,10,17–21], including relationships previously strongly supported in whole-genome sequence analyses  (Fig 1A), such as the sunbittern/tropicbird, bustard/turaco, and mesite/sandgrouse clades. From these three groups, only the sunbittern/tropic clade was previously recovered in some multilocus analyses [19–21].
(A) The main whole-genome sequence tree from Jarvis et al.  mapped with our 2,118 retrotransposon markers (745 incongruent markers tree length = 5,579 consistency index = 0.40 retention index = 0.64). (B) The same markers mapped on the single MPRE tree (S2 Data) resulting from analysis of their 2,118 presence/absence patterns (720 incongruent markers tree length = 5,377 consistency index = 0.41 retention index = 0.66) under Felsenstein’s polymorphism parsimony . Black branches indicate topological concordances between the MPRE tree and the main Jarvis et al. tree , and discordances are limited to the deepest neoavian internodes (grey dashed branches) and the conflicting position of the mousebird (grey branches). The amount of ILS-free, conflict-free insertion events (blue bold numbers) was identified for each internode, and numbers within doughnut plots indicate counts of ILS-affected RE insertion events leading to the persistence of insertion polymorphisms across two (green), three (orange), or more (red parts of doughnut plots) speciation events. (C–E) Schematic illustration of the different genealogical fates of segregating presence (colored lines) or absence (black lines) alleles following RE insertion (colored circles) in an exemplary five-taxon species tree. We show one respective example for the different degrees of gene tree–species tree conflict that can be caused by incomplete lineage sorting (ILS) across two (C), three (D), or more than three (E) successive speciation events. Incongruence of RE presence/absence patterns (dashed boxes) is illustrated with REs as colored ovals, target site duplications as white squares, and orthologous genomic flanks as black lines. The bird paintings were generated by Jon Fjeldså (used with permission).
3 Answers 3
There are three reasonable responses here:
Bearing that in mind, this answer becomes 'complicated'.
Basic performance improvements for current code:
I dislike any recursive function which rely on external (outside-the-method) values. In your case, the sumInStack is external. This makes the target hard to 'see'.
Additionally, if we do sort the data, there are some benefits we can have, and a way to restructure the recursion to make it do less work (since we can guarantee that all values after a point have certain properties. ):
consider the method (assuming sorted data ):
You would call this function with:
So, that is 'can the code be improved?' and 'will sorting help'
As for the 'unrolled' (no recursion) version of the system, it can be done. It would require three int arrays:
The sum gives and indices act like a stack, and the depth is how deep the stack is (again, assume sorted data):
Another way to do problems like this &mdash investigating properties of all subsets (that is, members of the "power set") &mdash is to think of the main set as a list of cells, and each cell as a binary digit position. A member of the power set can therefore be described by a binary number, such that the subset contains only those elements of the set corresponding to a 1 in the binary value.
By doing that, you can generate the power set just by counting. Of course this gets a little complicated when the original set has more values in it that can be comfortably dealt with by the native integer type in a given programming language, but Java has BigInteger . (Enumerating a power set for any purpose is going to be a little painful for original sets that big anyway.)
I have not fully worked it out, but the best algorithm here is probably dynamic programming. Basically, I would order the values and at each one keep all possible sums, considering earlier sums.
Note that there is some efficiency above because some combinations are repeated many times. For example, at item 3, the output value 3 can be obtained from either (1 * 3_from_previous_sum + 0 * 3) or (0 * 3_from_previous_sum + 1 * 3). The further you go, the more such redundant values happen.
I have not worked out is if this would clearly be more efficient than using brute force search, but I am pretty sure it would. Dynamic programming should increase the memory requirement of the algorithm, but decrease the compute time.
The example table I made would be useful to answer whether a given sum can be attained or not, but not to give all combinations that can produce a sum, if it exists. To answer that second question, the table would have to be modified to also associate with each output sum value all the combinations which can produce it.
Dividing Exponents in Mixed Bases
When you do algebra with exponents, in many situations there are different bases in the equation. For example, you might encounter x 2 y 3 ÷ x 3 y 2 . You can only work with exponents if they have the same base, so you work with the x parts and the y parts separately:
In reality, y 1 is just y, but it’s shown here for clarity. Note that it’s possible to have negative exponents as well as positive ones. In this case,
You can’t simplify the expressions more than this, so this is all you need to do.