We are searching data for your request:
Upon completion, a link will appear to access the found materials.
So far all I have found is this: https://www.biostars.org/p/132527/
My guess is that some alleles, for whatever reason, bind to things like transcription factors 'moreso' (for want of better word - affinity maybe) than others. Or maybe some alleles are more likely to get methylated, for example.
I feel my understanding is very much bare bones. Is anyone able to go through it a little more comprehensively? Thanks
Allele-specific oligonucleotide hybridization, or dot blotting, is a method for testing known mutations. In this method DNA products generated by polymerase chain reaction (PCR) are dotted in duplicate membranes: one membrane is dotted, or hybridized, with an oligonucleotide corresponding to the wild-type sequence, and the other to a mutant sequence. Normal individuals show hybridization only with the normal probe homozygous individuals show hybridization only with the mutant probe and heterozygous individuals hybridize with both probes (normal and wild-type). 92, 93
Restriction digest is a method that uses several restriction enzymes to detect variations from the expected digestion patterns of sequences.
Microarrays use nucleotide sequences placed in a solid support or chip. Then the test sample undergoes amplification by PCR and is then hybridized and analyzed with computer technology. 92, 100
Chapter 2 - Allele-Specific Mutation Detection
This chapter provides a description of three relatively simple polymerase chain reaction (PCR)-based techniques that are applied to detect known nucleotide variations in DNA. Allele-specific mutation detection is feasible and quite straightforward using all three methodologies: amplification refractory mutation system (PCR-ARMS), allele-specific oligonucleotide probes (PCR-ASO), and competitive oligonucleotide priming (COP). All methods have a comparably high degree of accuracy and specificity. Although PCR-ARMS and PCR-ASO enjoy broad applicability in low-throughput laboratories, COP's potential has not been fully exploited yet, presumably because the detection systems are somewhat elaborate and need special equipment. A simulation study is conducted where it is observed that although conventional allele-specific amplification has somewhat higher inherent specificity than competitive oligopriming reaction, it actually may be easier to optimize the latter ones to offer greater reproducibility and tolerance to alterations in target amounts without any significant loss of specificity. Allele-specific mutation detection sets the standards for the development of high-throughput systems, such as microarrays for the purpose of large-scale single nucleotide polymorphism (SNP) genotyping that concern genome scans and DNA diagnostics of genetic, acquired, and infectious diseases. A methodology that resembles PCR-ASO is employed to screen for HBB mutations in an array format. Together with low-cost reagents and short processing times, a potential use for this technology in screening programs for large populations is indicated.
In 2007 a study found only one-fifth of transcription across the human genome is associated with protein-coding genes,  indicating at least four times more long non-coding than coding RNA sequences. However, it is large-scale complementary DNA (cDNA) sequencing projects such as FANTOM (Functional Annotation of Mammalian cDNA) that reveal the complexity of this transcription.  The FANTOM3 project identified
35,000 non-coding transcripts from
10,000 distinct loci that bear many signatures of mRNAs, including 5’ capping, splicing, and poly-adenylation, but have little or no open reading frame (ORF).  While the abundance of long ncRNAs was unanticipated, this number represents a conservative lower estimate, since it omitted many singleton transcripts and non-polyadenylated transcripts (tiling array data shows more than 40% of transcripts are non-polyadenylated).  However, unambiguously identifying ncRNAs within these cDNA libraries is challenging since it can be difficult to distinguish protein-coding transcripts from non-coding transcripts. It has been suggested through multiple studies that testis,  and neural tissues express the greatest amount of long non-coding RNAs of any tissue type.  Using FANTOM5, 27,919 long ncRNAs have been identified in various human sources. 
Quantitatively, lncRNAs demonstrate
10-fold lower abundance than mRNAs in a population of cells,   which is explained by higher cell-to-cell variation of expression levels of lncRNA genes in the individual cells, when compared to protein-coding genes.  In general, the majority (
78%) of lncRNAs are characterized as tissue-specific, as opposed to only
19% of mRNAs.  In addition to higher tissue specificity, lncRNAs are characterized by higher developmental stage specificity,  and cell subtype specificity in heterogeneous tissues, such as human neocortex.  In 2018, a comprehensive integration of lncRNAs from existing databases, published literatures and novel RNA assemblies based on RNA-seq data analysis, revealed that there are 270,044 lncRNA transcripts in humans. 
In comparison to mammals relatively few studies have focussed on the prevalence of lncRNAs in plants. However an extensive study considering 37 higher plant species and six algae identified
200,000 non-coding transcripts using an in-silico approach,  which also established the associated Green Non-Coding Database (GreeNC), a repository of plant lncRNAs.
In 2005 the landscape of the mammalian genome was described as numerous 'foci' of transcription that are separated by long stretches of intergenic space.  While long ncRNAs are located and transcribed within the intergenic stretches, the majority are transcribed as complex, interlaced networks of overlapping sense and antisense transcripts that often include protein-coding genes,  giving rise to a complex hierarchy of overlapping isoforms.  Genomic sequences within these transcriptional foci are often shared within a number of different coding and non-coding transcripts in the sense and antisense directions  For example, 3012 out of 8961 cDNAs previously annotated as truncated coding sequences within FANTOM2 were later designated as genuine ncRNA variants of protein-coding cDNAs.  While the abundance and conservation of these interleaved arrangements suggest they have biological relevance, the complexity of these foci frustrates easy evaluation.
The GENCODE consortium has collated and analysed a comprehensive set of human lncRNA annotations and their genomic organisation, modifications, cellular locations and tissue expression profiles.  Their analysis indicates human lncRNAs show a bias toward two-exon transcripts. 
|Name||Species||Web Server||Repository||Input File||Main Model/Algorithm||Training-set||Year Published||Reference|
|LGC||Plant/Animal||LGC||FASTA/BED/GTF||Relationship between ORF length and GC content||NO||2019|||
|COME||Plant/Human/Mouse/Fly/Worm||COME||COME||GTF||Balanced Random Forest||YES||2017|||
|lncRScan-SVM||Human||NA||FASTA/BED/GTF/GFF||Support Vector Machine||YES||2015|||
|CNCI||Plant/Animal||NA||FASTA/GTF||Support Vector Machine||NO||2013|||
|PLEK||Vertebrate||NA||PLEK||FASTA||Support Vector Machine||NO||2014|||
|PhyloCSF||Vertebrate/Fly/Mosquito/Yeast/Worm||NA||FASTA||Phylogenetic Codon Model||YES||2011|||
There has been considerable debate about whether lncRNAs have been misannotated and do in fact encode proteins. Several lncRNAs have been found to in fact encode for peptides with biologically significant function.    Ribosome profiling studies have suggested that anywhere from 40% to 90% of annotated lncRNAs are in fact translated,   although there is disagreement about the correct method for analyzing ribosome profiling data.  Additionally, it is thought that many of the peptides produced by lncRNAs may be highly unstable and without biological function. 
Initial studies into lncRNA conservation noted that as a class, they were enriched for conserved sequence elements,  depleted in substitution and insertion/deletion rates  and depleted in rare frequency variants,  indicative of purifying selection maintaining lncRNA function. However, further investigations into vertebrate lncRNAs revealed that while lncRNAs are conserved in sequence, they are not conserved in transcription.    In other words, even when the sequence of a human lncRNA is conserved in another vertebrate species, there is often no transcription of a lncRNA in the orthologous genomic region. Some argue that these observations suggest non-functionality of the majority of lncRNAs,    while others argue that they may be indicative of rapid species-specific adaptive selection. 
While the turnover of lncRNA transcription is much higher than initially expected, it is important to note that still, hundreds of lncRNAs are conserved at the sequence level. There have been several attempts to delineate the different categories of selection signatures seen amongst lncRNAs including: lncRNAs with strong sequence conservation across the entire length of the gene, lncRNAs in which only a portion of the transcript (e.g. 5′ end, splice sites) is conserved, and lncRNAs that are transcribed from syntenic regions of the genome but have no recognizable sequence similarity.    Additionally, there have been attempts to identify conserved secondary structures in lncRNAs, though these studies have currently given way to conflicting results.  
Large-scale sequencing of cDNA libraries and more recently transcriptomic sequencing by next generation sequencing indicate that long noncoding RNAs number in the order of tens of thousands in mammals. However, despite accumulating evidence suggesting that the majority of these are likely to be functional,   only a relatively small proportion has been demonstrated to be biologically relevant. As of January 2016, 294 LncRNAs have been functionally annotated in LncRNAdb (a database of literature described LncRNAs),   with the majority of these (183 LncRNAs) being described in humans. As of June 2018, a total of 1867 human lncRNAs that with experimental evidences have been community-curated in LncRNAWiki (a wiki-based, publicly editable and open-content platform for community curation of human lncRNAs)  in respect of the functional mechanisms and disease associations, which can also be accessed in LncBook.  According to the curation of functional mechanisms of lncRNAs based on the literatures, lncRNAs are extensively reported to be involved in transcriptional regulation.  A further large-scale sequencing study provides evidence that many transcripts thought to be lncRNAs may, in fact, be translated into proteins. 
In the regulation of gene transcription Edit
In gene-specific transcription Edit
In eukaryotes, RNA transcription is a tightly regulated process. NcRNAs can target different aspects of this process, targeting transcriptional activators or repressors, different components of the transcription reaction including RNA polymerase (RNAP) II and even the DNA duplex to regulate gene transcription and expression.  In combination these ncRNAs may comprise a regulatory network that, including transcription factors, finely control gene expression in complex eukaryotes.
NcRNAs modulate the function of transcription factors by several different mechanisms, including functioning themselves as co-regulators, modifying transcription factor activity, or regulating the association and activity of co-regulators. For example, the ncRNA Evf-2 functions as a co-activator for the homeobox transcription factor Dlx2, which plays important roles in forebrain development and neurogenesis.   Sonic hedgehog induces transcription of Evf-2 from an ultra-conserved element located between the Dlx5 and Dlx6 genes during forebrain development.  Evf-2 then recruits the Dlx2 transcription factor to the same ultra-conserved element whereby Dlx2 subsequently induces expression of Dlx5. The existence of other similar ultra- or highly conserved elements within the mammalian genome that are both transcribed and fulfil enhancer functions suggest Evf-2 may be illustrative of a generalised mechanism that tightly regulates important developmental genes with complex expression patterns during vertebrate growth.   Indeed, the transcription and expression of similar non-coding ultraconserved elements was shown to be abnormal in human leukaemia and to contribute to apoptosis in colon cancer cells, suggesting their involvement in tumorigenesis.  
Local ncRNAs can also recruit transcriptional programmes to regulate adjacent protein-coding gene expression. For example, divergent lncRNAs that are transcribed in the opposite direction to nearby protein-coding genes (comprise a significant proportion
20% of total lncRNAs in mammalian genomes) possibly regulate the transcription of nearby adjacent essential developmental regulatory genes in pluripotent cells 
The RNA binding protein TLS, binds and inhibits the CREB binding protein and p300 histone acetyltransferease activities on a repressed gene target, cyclin D1. The recruitment of TLS to the promoter of cyclin D1 is directed by long ncRNAs expressed at low levels and tethered to 5’ regulatory regions in response to DNA damage signals.  Moreover, these local ncRNAs act cooperatively as ligands to modulate the activities of TLS. In the broad sense, this mechanism allows the cell to harness RNA-binding proteins, which make up one of the largest classes within the mammalian proteome, and integrate their function in transcriptional programs. Nascent long ncRNAs have been shown to increase the activity of CREB binding protein, which in turn increases the transcription of that ncRNA.  A recent study found that a lncRNA in the antisense direction of the Apolipoprotein A1 (APOA1) regulates the transcription of APOA1 through epigenetic modifications. 
Recent evidence has raised the possibility that transcription of genes that escape from X-inactivation might be mediated by expression of long non-coding RNA within the escaping chromosomal domains. 
Regulating basal transcription machinery Edit
NcRNAs also target general transcription factors required for the RNAP II transcription of all genes.  These general factors include components of the initiation complex that assemble on promoters or involved in transcription elongation. A ncRNA transcribed from an upstream minor promoter of the dihydrofolate reductase (DHFR) gene forms a stable RNA-DNA triplex within the major promoter of DHFR to prevent the binding of the transcriptional co-factor TFIIB.  This novel mechanism of regulating gene expression may in fact represent a widespread method of controlling promoter usage given that thousands of such triplexes exist in eukaryotic chromosome.  The U1 ncRNA can induce transcription initiation by specifically binding to and stimulating TFIIH to phosphorylate the C-terminal domain of RNAP II.  In contrast the ncRNA 7SK, is able to repress transcription elongation by, in combination with HEXIM1/2, forming an inactive complex that prevents the PTEFb general transcription factor from phosphorylating the C-terminal domain of RNAP II,    thereby repressing global elongation under stressful conditions. These examples, which bypass specific modes of regulation at individual promoters to mediate changes directly at the level of initiation and elongation transcriptional machinery, provide a means of quickly affecting global changes in gene expression.
The ability to quickly mediate global changes is also apparent in the rapid expression of non-coding repetitive sequences. The short interspersed nuclear (SINE) Alu elements in humans and analogous B1 and B2 elements in mice have succeeded in becoming the most abundant mobile elements within the genomes, comprising
6% of the mouse genome, respectively.   These elements are transcribed as ncRNAs by RNAP III in response to environmental stresses such as heat shock,  where they then bind to RNAP II with high affinity and prevent the formation of active pre-initiation complexes.     This allows for the broad and rapid repression of gene expression in response to stress.  
A dissection of the functional sequences within Alu RNA transcripts has drafted a modular structure analogous to the organization of domains in protein transcription factors.  The Alu RNA contains two ‘arms’, each of which may bind one RNAP II molecule, as well as two regulatory domains that are responsible for RNAP II transcriptional repression in vitro.  These two loosely structured domains may even be concatenated to other ncRNAs such as B1 elements to impart their repressive role.  The abundance and distribution of Alu elements and similar repetitive elements throughout the mammalian genome may be partly due to these functional domains being co-opted into other long ncRNAs during evolution, with the presence of functional repeat sequence domains being a common characteristic of several known long ncRNAs including Kcnq1ot1, Xlsirt and Xist.    
In addition to heat shock, the expression of SINE elements (including Alu, B1, and B2 RNAs) increases during cellular stress such as viral infection  in some cancer cells  where they may similarly regulate global changes to gene expression. The ability of Alu and B2 RNA to bind directly to RNAP II provides a broad mechanism to repress transcription.   Nevertheless, there are specific exceptions to this global response where Alu or B2 RNAs are not found at activated promoters of genes undergoing induction, such as the heat shock genes.  This additional hierarchy of regulation that exempts individual genes from the generalised repression also involves a long ncRNA, heat shock RNA-1 (HSR-1). It was argued that HSR-1 is present in mammalian cells in an inactive state, but upon stress is activated to induce the expression of heat shock genes.  The authors found that this activation involves a conformational alteration to the structure of HSR-1 in response to rising temperatures, thereby permitting its interaction with the transcriptional activator HSF-1 that subsequently undergoes trimerisation and induces the expression of heat shock genes.  In the broad sense, these examples illustrate a regulatory circuit nested within ncRNAs whereby Alu or B2 RNAs repress general gene expression, while other ncRNAs activate the expression of specific genes.
Transcribed by RNA polymerase III Edit
Many of the ncRNAs that interact with general transcription factors or RNAP II itself (including 7SK, Alu and B1 and B2 RNAs) are transcribed by RNAP III,  thereby uncoupling the expression of these ncRNAs from the RNAP II transcriptional reaction they regulate. RNAP III also transcribes a number of additional novel ncRNAs, such as BC2, BC200 and some microRNAs and snoRNAs, in addition to the highly expressed infrastructural ‘housekeeping’ ncRNA genes such as tRNAs, 5S rRNAs and snRNAs.  The existence of an RNAP III-dependent ncRNA transcriptome that regulates its RNAP II-dependent counterpart was supported by a recent study that described a novel set of ncRNAs transcribed by RNAP III with sequence homology to protein-coding genes. This prompted the authors to posit a ‘cogene/gene’ functional regulatory network,  showing that one of these ncRNAs, 21A, regulates the expression of its antisense partner gene, CENP-F in trans.
In post-transcriptional regulation Edit
In addition to regulating transcription, ncRNAs also control various aspects of post-transcriptional mRNA processing. Similar to small regulatory RNAs such as microRNAs and snoRNAs, these functions often involve complementary base pairing with the target mRNA. The formation of RNA duplexes between complementary ncRNA and mRNA may mask key elements within the mRNA required to bind trans-acting factors, potentially affecting any step in post-transcriptional gene expression including pre-mRNA processing and splicing, transport, translation, and degradation. 
In splicing Edit
The splicing of mRNA can induce its translation and functionally diversify the repertoire of proteins it encodes. The Zeb2 mRNA, which has a particularly long 5’UTR, requires the retention of a 5’UTR intron that contains an internal ribosome entry site for efficient translation.  However, retention of the intron is dependent on the expression of an antisense transcript that complements the intronic 5’ splice site.  Therefore, the ectopic expression of the antisense transcript represses splicing and induces translation of the Zeb2 mRNA during mesenchymal development. Likewise, the expression of an overlapping antisense Rev-ErbAa2 transcript controls the alternative splicing of the thyroid hormone receptor ErbAa2 mRNA to form two antagonistic isoforms. 
In translation Edit
NcRNA may also apply additional regulatory pressures during translation, a property particularly exploited in neurons where the dendritic or axonal translation of mRNA in response to synaptic activity contributes to changes in synaptic plasticity and the remodelling of neuronal networks. The RNAP III transcribed BC1 and BC200 ncRNAs, that previously derived from tRNAs, are expressed in the mouse and human central nervous system, respectively.   BC1 expression is induced in response to synaptic activity and synaptogenesis and is specifically targeted to dendrites in neurons.  Sequence complementarity between BC1 and regions of various neuron-specific mRNAs also suggest a role for BC1 in targeted translational repression.  Indeed, it was recently shown that BC1 is associated with translational repression in dendrites to control the efficiency of dopamine D2 receptor-mediated transmission in the striatum  and BC1 RNA-deleted mice exhibit behavioural changes with reduced exploration and increased anxiety. 
In siRNA-directed gene regulation Edit
In addition to masking key elements within single-stranded RNA, the formation of double-stranded RNA duplexes can also provide a substrate for the generation of endogenous siRNAs (endo-siRNAs) in Drosophila and mouse oocytes.  The annealing of complementary sequences, such as antisense or repetitive regions between transcripts, forms an RNA duplex that may be processed by Dicer-2 into endo-siRNAs. Also, long ncRNAs that form extended intramolecular hairpins may be processed into siRNAs, compellingly illustrated by the esi-1 and esi-2 transcripts.  Endo-siRNAs generated from these transcripts seem particularly useful in suppressing the spread of mobile transposon elements within the genome in the germline. However, the generation of endo-siRNAs from antisense transcripts or pseudogenes may also silence the expression of their functional counterparts via RISC effector complexes, acting as an important node that integrates various modes of long and short RNA regulation, as exemplified by the Xist and Tsix (see above). 
In epigenetic regulation Edit
Epigenetic modifications, including histone and DNA methylation, histone acetylation and sumoylation, affect many aspects of chromosomal biology, primarily including regulation of large numbers of genes by remodeling broad chromatin domains.   While it has been known for some time that RNA is an integral component of chromatin,   it is only recently that we are beginning to appreciate the means by which RNA is involved in pathways of chromatin modification.    For example, Oplr16 epigenetically induces the activation of stem cell core factors by coordinating intrachromosomal looping and recruitment of DNA demethylase TET2. 
In Drosophila, long ncRNAs induce the expression of the homeotic gene, Ubx, by recruiting and directing the chromatin modifying functions of the trithorax protein Ash1 to Hox regulatory elements.  Similar models have been proposed in mammals, where strong epigenetic mechanisms are thought to underlie the embryonic expression profiles of the Hox genes that persist throughout human development.   Indeed, the human Hox genes are associated with hundreds of ncRNAs that are sequentially expressed along both the spatial and temporal axes of human development and define chromatin domains of differential histone methylation and RNA polymerase accessibility.  One ncRNA, termed HOTAIR, that originates from the HOXC locus represses transcription across 40 kb of the HOXD locus by altering chromatin trimethylation state. HOTAIR is thought to achieve this by directing the action of Polycomb chromatin remodeling complexes in trans to govern the cells' epigenetic state and subsequent gene expression. Components of the Polycomb complex, including Suz12, EZH2 and EED, contain RNA binding domains that may potentially bind HOTAIR and probably other similar ncRNAs.   This example nicely illustrates a broader theme whereby ncRNAs recruit the function of a generic suite of chromatin modifying proteins to specific genomic loci, underscoring the complexity of recently published genomic maps.  Indeed, the prevalence of long ncRNAs associated with protein coding genes may contribute to localised patterns of chromatin modifications that regulate gene expression during development. For example, the majority of protein-coding genes have antisense partners, including many tumour suppressor genes that are frequently silenced by epigenetic mechanisms in cancer.  A recent study observed an inverse expression profile of the p15 gene and an antisense ncRNA in leukaemia.  A detailed analysis showed the p15 antisense ncRNA (CDKN2BAS) was able to induce changes to heterochromatin and DNA methylation status of p15 by an unknown mechanism, thereby regulating p15 expression.  Therefore, misexpression of the associated antisense ncRNAs may subsequently silence the tumour suppressor gene contributing towards cancer.
Many emergent themes of ncRNA-directed chromatin modification were first apparent within the phenomenon of imprinting, whereby only one allele of a gene is expressed from either the maternal or the paternal chromosome. In general, imprinted genes are clustered together on chromosomes, suggesting the imprinting mechanism acts upon local chromosome domains rather than individual genes. These clusters are also often associated with long ncRNAs whose expression is correlated with the repression of the linked protein-coding gene on the same allele.  Indeed, detailed analysis has revealed a crucial role for the ncRNAs Kcnqot1 and Igf2r/Air in directing imprinting. 
Almost all the genes at the Kcnq1 loci are maternally inherited, except the paternally expressed antisense ncRNA Kcnqot1.  Transgenic mice with truncated Kcnq1ot fail to silence the adjacent genes, suggesting that Kcnqot1 is crucial to the imprinting of genes on the paternal chromosome.  It appears that Kcnqot1 is able to direct the trimethylation of lysine 9 (H3K9me3) and 27 of histone 3 (H3K27me3) to an imprinting centre that overlaps the Kcnqot1 promoter and actually resides within a Kcnq1 sense exon.  Similar to HOTAIR (see above), Eed-Ezh2 Polycomb complexes are recruited to the Kcnq1 loci paternal chromosome, possibly by Kcnqot1, where they may mediate gene silencing through repressive histone methylation.  A differentially methylated imprinting centre also overlaps the promoter of a long antisense ncRNA Air that is responsible for the silencing of neighbouring genes at the Igf2r locus on the paternal chromosome.   The presence of allele-specific histone methylation at the Igf2r locus suggests Air also mediates silencing via chromatin modification. 
Xist and X-chromosome inactivation Edit
The inactivation of a X-chromosome in female placental mammals is directed by one of the earliest and best characterized long ncRNAs, Xist.  The expression of Xist from the future inactive X-chromosome, and its subsequent coating of the inactive X-chromosome, occurs during early embryonic stem cell differentiation. Xist expression is followed by irreversible layers of chromatin modifications that include the loss of the histone (H3K9) acetylation and H3K4 methylation that are associated with active chromatin, and the induction of repressive chromatin modifications including H4 hypoacetylation, H3K27 trimethylation,  H3K9 hypermethylation and H4K20 monomethylation as well as H2AK119 monoubiquitylation. These modifications coincide with the transcriptional silencing of the X-linked genes.  Xist RNA also localises the histone variant macroH2A to the inactive X–chromosome.  There are additional ncRNAs that are also present at the Xist loci, including an antisense transcript Tsix, which is expressed from the future active chromosome and able to repress Xist expression by the generation of endogenous siRNA.  Together these ncRNAs ensure that only one X-chromosome is active in female mammals.
Telomeric non-coding RNAs Edit
Telomeres form the terminal region of mammalian chromosomes and are essential for stability and aging and play central roles in diseases such as cancer.  Telomeres have been long considered transcriptionally inert DNA-protein complexes until it was shown in the late 2000s that telomeric repeats may be transcribed as telomeric RNAs (TelRNAs)  or telomeric repeat-containing RNAs.  These ncRNAs are heterogeneous in length, transcribed from several sub-telomeric loci and physically localise to telomeres. Their association with chromatin, which suggests an involvement in regulating telomere specific heterochromatin modifications, is repressed by SMG proteins that protect chromosome ends from telomere loss.  In addition, TelRNAs block telomerase activity in vitro and may therefore regulate telomerase activity.  Although early, these studies suggest an involvement for telomeric ncRNAs in various aspects of telomere biology.
In regulation of DNA replication timing and chromosome stability Edit
Asynchronously replicating autosomal RNAs (ASARs) are very long (
200kb) non-coding RNAs that are non-spliced, non-polyadenylated, and are required for normal DNA replication timing and chromosome stability.    Deletion of any one of the genetic loci containing ASAR6, ASAR15, or ASAR6-141 results in the same phenotype of delayed replication timing and delayed mitotic condensation (DRT/DMC) of the entire chromosome. DRT/DMC results in chromosomal segregation errors that lead to increased frequency of secondary rearrangements and an unstable chromosome. Similar to Xist, ASARs show random monoallelic expression and exist in asynchronous DNA replication domains. Although the mechanism of ASAR function is still under investigation, it is hypothesized that they work via similar mechanisms as the Xist lncRNA, but on smaller autosomal domains resulting in allele specific changes in gene expression.
Recent recognition that long ncRNAs function in various aspects of cell biology has focused increasing attention on their potential to contribute towards disease etiology. More than 80% (1502 among 1867 lncRNAs in LncBook) experimentally studied lncRNAs have been reported to be associated with 462 diseases and 28 MeSH disease terms, and 97,998 lncRNAs are potentially associated with diseases based on the multi-omics evidence.  A handful of studies have implicated long ncRNAs in a variety of disease states and support an involvement and co-operation in neurological disease and cancer.
The first published report of an alteration in lncRNA abundance in aging and human neurological disease was provided by Lukiw et al.  in a study using short post-mortem interval Alzheimer's disease and non-Alzheimer's dementia (NAD) tissues this early work was based on the prior identification of a primate brain-specific cytoplasmic transcript of the Alu repeat family by Watson and Sutcliffe in 1987 known as BC200 (brain, cytoplasmic, 200 nucleotide). 
While many association studies have identified unusual expression of long ncRNAs in disease states, there is little understanding of their role in causing disease. Expression analyses that compare tumor cells and normal cells have revealed changes in the expression of ncRNAs in several forms of cancer. For example, in prostate tumours, PCGEM1 (one of two overexpressed ncRNAs) is correlated with increased proliferation and colony formation suggesting an involvement in regulating cell growth.  MALAT1 (also known as NEAT2) was originally identified as an abundantly expressed ncRNA that is upregulated during metastasis of early-stage non-small cell lung cancer and its overexpression is an early prognostic marker for poor patient survival rates.  More recently, the highly conserved mouse homologue of MALAT1 was found to be highly expressed in hepatocellular carcinoma.  Intronic antisense ncRNAs with expression correlated to the degree of tumor differentiation in prostate cancer samples have also been reported.  Despite a number of long ncRNAs having aberrant expression in cancer, their function and potential role in tumourogenesis is relatively unknown. For example, the ncRNAs HIS-1 and BIC have been implicated in cancer development and growth control, but their function in normal cells is unknown.   In addition to cancer, ncRNAs also exhibit aberrant expression in other disease states. Overexpression of PRINS is associated with psoriasis susceptibility, with PRINS expression being elevated in the uninvolved epidermis of psoriatic patients compared with both psoriatic lesions and healthy epidermis. 
Genome-wide profiling revealed that many transcribed non-coding ultraconserved regions exhibit distinct profiles in various human cancer states.  An analysis of chronic lymphocytic leukaemia, colorectal carcinoma and hepatocellular carcinoma found that all three cancers exhibited aberrant expression profiles for ultraconserved ncRNAs relative to normal cells. Further analysis of one ultraconserved ncRNA suggested it behaved like an oncogene by mitigating apoptosis and subsequently expanding the number of malignant cells in colorectal cancers.  Many of these transcribed ultraconserved sites that exhibit distinct signatures in cancer are found at fragile sites and genomic regions associated with cancer. It seems likely that the aberrant expression of these ultraconserved ncRNAs within malignant processes results from important functions they fulfil in normal human development.
Recently, a number of association studies examining single nucleotide polymorphisms (SNPs) associated with disease states have been mapped to long ncRNAs. For example, SNPs that identified a susceptibility locus for myocardial infarction mapped to a long ncRNA, MIAT (myocardial infarction associated transcript).  Likewise, genome-wide association studies identified a region associated with coronary artery disease  that encompassed a long ncRNA, ANRIL.  ANRIL is expressed in tissues and cell types affected by atherosclerosis   and its altered expression is associated with a high-risk haplotype for coronary artery disease.  
The complexity of the transcriptome, and our evolving understanding of its structure may inform a reinterpretation of the functional basis for many natural polymorphisms associated with disease states. Many SNPs associated with certain disease conditions are found within non-coding regions and the complex networks of non-coding transcription within these regions make it particularly difficult to elucidate the functional effects of polymorphisms. For example, a SNP both within the truncated form of ZFAT and the promoter of an antisense transcript increases the expression of ZFAT not through increasing the mRNA stability, but rather by repressing the expression of the antisense transcript. 
The ability of long ncRNAs to regulate associated protein-coding genes may contribute to disease if misexpression of a long ncRNA deregulates a protein coding gene with clinical significance. In similar manner, an antisense long ncRNA that regulates the expression of the sense BACE1 gene, a crucial enzyme in Alzheimer’s disease etiology, exhibits elevated expression in several regions of the brain in individuals with Alzheimer's disease  Alteration of the expression of ncRNAs may also mediate changes at an epigenetic level to affect gene expression and contribute to disease aetiology. For example, the induction of an antisense transcript by a genetic mutation led to DNA methylation and silencing of sense genes, causing ß-thalassemia in a patient. 
Does PEN3 Transport Multiple Structurally Unrelated Substrates?
This study, together with previous work, implicates plasma membrane-resident PEN3/PDR8/ABCG36 to function in response to various abiotic and biotic stresses, indicating that the protein may export multiple substrates, including (1) Cd 2+ in leaf protoplasts ( Kim et al., 2007) (2) IBA in root tips ( Strader and Bartel, 2009) (3) unknown substrate(s) under salt and drought stress conditions ( Kim et al., 2010) (4) IG-derived products of the PEN2 pathway required for preinvasive defense to nonadapted powdery mildews ( Stein et al., 2006 Bednarek et al., 2009) (5) unknown substrate(s) linked to SA hyperaccumulation, leaf chlorosis, and cell death in response to host-adapted G. orontii ( Fig. 4) and other leaf pathogens ( Kobae et al., 2006 Stein et al., 2006) and (6) substrates limiting growth of the necrotrophic ascomycete pathogen P. cucumerina ( Stein et al., 2006 Sanchez-Vallet et al., 2010). PDR-type ABC transporters can have multiple substrates, a property reflected in the name of the protein family (e.g. PDR5 in yeast [Saccharomyces cerevisiae] confers resistance to a large set of functionally and structurally unrelated exogenous antifungal and anticancer drugs Jungwirth and Kuchler, 2006 Rea, 2007 Kang et al., 2011). For this reason, the existence of multiple in planta PEN3 substrates is not unexpected. Alternatively, PEN3-mediated efflux of one common stress-inducible compound into the apoplast is conceivable for plant adaptation to diverse abiotic and biotic stresses.
Here, we have characterized an unusual pen3 allele, pen3-5, which was isolated in the context of a mutant screen aimed to identify postinvasive defense components to powdery mildews in a pen2-1 null mutant background. Both PEN2 and PEN3 are required for effective preinvasive defense to nonadapted powdery mildews ( Lipka et al., 2005 Stein et al., 2006), and their gene expression is coregulated ( Humphry et al., 2010), but PEN3 alone has an additional function in limiting postinvasive growth to these pathogens ( Fig. 1 Supplemental Fig. S1 ), strongly suggesting that different PEN3 substrates become engaged in pre- and postinvasive defense against nonadapted powdery mildew fungi.
4OGlcI3F Is a Product of IG Metabolism and Directly Linked to PEN3 Substrate(s) in Preinvasive Defense to Nonadapted Powdery Mildews
All known pen3 alleles, including pen3-5 described in this study, fully impair preinvasive defense to nonadapted powdery mildews ( Fig. 3, C and D). This indicates that all of the resulting single-amino acid changes in the ABC transporter ( Fig. 2) affect the PEN3-mediated transport of product(s) generated by the PEN2 myrosinase metabolic pathway ( Bednarek et al., 2009). Our analyses identified 4OGlcI3F as a Trp-derived compound that has biosynthesis that is stimulated upon powdery mildew inoculation and is dependent on PEN2 and CYP81F2 activity ( Fig. 6), indicating that 4OGlcI3F is one of the end products of pathogen-stimulated IG metabolism ( Bednarek et al., 2009). Moreover, this metabolite accumulated to significantly higher levels in leaves of all tested pen3 alleles compared with wild-type plants, suggesting that 4OGlcI3F could be a molecule directly transported to the apoplast by PEN3. However, because the end product of the PEN2 metabolite pathway with a function in preinvasive defense is thought to exert antimicrobial activity against eukaryotic fungal pathogens ( Bednarek et al., 2009), such a compound might be also toxic for the eukaryotic host. For this reason, it is likely that the nontranslocated bioactive molecule does not accumulate inside pen3 plant cells but instead, is metabolized to a less active derivative(s). This hypothesis is supported by the presence of a Glc residue in the identified structure of 4OGlcI3F, which is considered as a prominent detoxification strategy for plant metabolites ( Morant et al., 2008). Thus, we propose that 4OGlcI3F is not the molecule directly transported by PEN3 but rather, its precursor(s).
In accordance with published results, the PEN2 metabolic pathway requires the activity of CYP81F2 P450 monooxygenase that converts indol-3-ylmethyl glucosinolate ( I3G) to 4-OH- I3G ( Bednarek et al., 2009 Pfalz et al., 2009). This IG can be further converted by respective O-methyltransferases ( IGMTs) to 4MI3G, which hyperaccumulates in pen2 mutants upon pathogen inoculation ( Bednarek et al., 2009 Pfalz et al., 2011). However, it is presently unknown whether IGMTs are required for preinvasive defense responses to nonadapted powdery mildews. In the simplest biosynthetic route for 4OGlcI3F that possesses GlcO and not a methoxyl group, 4-OH- I3G would be converted by PEN2 without any IGMT contribution ( Fig. 7A). However, it is still possible that the IGMTs are involved in 4OGlcI3F formation and that one of the biosynthetic intermediates is demethylated by a putative O-demethyltransferase ( Hagel and Facchini, 2010).
Models for biosynthetic pathways of PEN3 substrates and deduced allele-specific PEN3 transport activities in pen3 plants. A, PEN3 activities in the efflux of different Trp-derived metabolites upon attack by pathogenic fungi, including bioactive products of the CYP81F2/PEN2 pathway (4OH-X) against nonadapted powdery mildews and unknown metabolites (Y) for resistance against P. cucumerina. Dashed lines indicate putative steps and enzymatic components. DMT, O-demethyltransferase 4OHI3G, 4-hydroxy-indol-3-ylmethyl glucosinolate UGT, UDP-glucosyltransferase. B, Allele-specific uncoupling of a subset of PEN3 functions. Wild-type PEN3 transports IG-derived products (yellow triangles), IBA (green squares), and unknown substrates linked to SA hyperaccumulation (blue circles). Differential hyperaccumulation of these compounds in the indicated genotypes is shown. *, Substitutions in mutant PEN3 variants.
Models for biosynthetic pathways of PEN3 substrates and deduced allele-specific PEN3 transport activities in pen3 plants. A, PEN3 activities in the efflux of different Trp-derived metabolites upon attack by pathogenic fungi, including bioactive products of the CYP81F2/PEN2 pathway (4OH-X) against nonadapted powdery mildews and unknown metabolites (Y) for resistance against P. cucumerina. Dashed lines indicate putative steps and enzymatic components. DMT, O-demethyltransferase 4OHI3G, 4-hydroxy-indol-3-ylmethyl glucosinolate UGT, UDP-glucosyltransferase. B, Allele-specific uncoupling of a subset of PEN3 functions. Wild-type PEN3 transports IG-derived products (yellow triangles), IBA (green squares), and unknown substrates linked to SA hyperaccumulation (blue circles). Differential hyperaccumulation of these compounds in the indicated genotypes is shown. *, Substitutions in mutant PEN3 variants.
Notably, IBA, another potential PEN3 substrate, is also a Trp-derived compound that comprises the indole core in its structure ( Strader and Bartel, 2009 Ruzicka et al., 2010). For this reason, it is possible that the indole core serves as a primary structural motif recognized by PEN3 for transport into the apoplast. However, given the fact that indole-3-acetic acid, which is also an indolic molecule, is not a PEN3 substrate, it is likely that the side chains attached to the core ring structure act as the substrate recognition/interaction sites.
Accumulation of PEN3 Substrate(s) in pen3 Plants Triggers Redox Imbalance and SA Biosynthesis in Response to Host-Adapted Powdery Mildews
Here, we have reported a pen3 allele-specific uncoupling of a subset of PEN3 functions: pen3-5 and pen3-6 single-amino acid substitution mutants retain wild type-like susceptibility to host-adapted G. orontii and do not hyperaccumulate SA upon pathogen challenge ( Figs. 4 and 7B). In contrast, an edr infection phenotype and leaf chlorosis were seen on pen3-1, pen3-2, and pen3-4 plants ( Fig. 4, A and B). Notably, this edr infection phenotype is independent of the PEN2 pathway ( Supplemental Fig. S6 ), implicating that other PEN3 substrates than IG-derived metabolites are transported into the apoplast during colonization with host-adapted powdery mildews. The edr phenotype in the presence of pen3-1, pen3-2, and pen3-4 alleles is tightly correlated with a pathogen-inducible hyperaccumulation of hydrogen peroxide and SA in leaf tissue ( Fig. 4, C and D Supplemental Fig. S2 ). The accumulation of hydrogen peroxide is usually closely interconnected with SA biosynthesis, and the accumulation of both molecules forms a self-amplifying feedback loop ( Vlot et al., 2009). The regulation of this loop is linked to the glutathione-ascorbate cycle. For instance, the ascorbate-deficient vitamin C mutants hyperaccumulate constitutively both SA and hydrogen peroxide ( Mukherjee et al., 2010). In addition, a study using Arabidopsis mutants catalase-deficient2 and glutathione-deficient cadmium-sensitive2 indicates that the cellular glutathione redox status is a key player linking intracellular hydrogen peroxide with the activation of the SA pathway ( Han et al., 2013). Collectively, this strongly suggests that, upon colonization with host-adapted powdery mildews on pen3-1, pen3-2, and pen3-4 plants, a PEN3 substrate(s), which is directly or indirectly linked to plant redox balance, hyperaccumulates inside host cells, thereby triggering hydrogen peroxide and SA overaccumulation. This, in turn, can explain the observed edr infection phenotype to G. orontii and G. cichoracearum on these plants. Taken together, this indicates the existence of a PEN3-dependent regulatory mechanism for intracellular redox balance in wild-type plants.
Single-Amino Acid Substitutions in TMDs Uncouple a Subset of PEN3-Mediated Stress Responses
What is the molecular mechanism underlying the allele-specific uncoupling of PEN3 functions? Similar to pen3-5 and pen3-6, pen3-1 and pen3-2 alleles encode single-amino acid mutants of the ABC transporter ( Fig. 2), but hydrogen peroxide and SA hyperaccumulation in response to G. orontii is only seen in the presence of the latter two alleles. pen3-1 and pen3-2 alleles affect invariant residues in the N- or C-terminal NBDs of PEN3 and share with the pen3-4 null mutant a dysfunction of all tested plant responses ( Fig. 7B). The NBDs of ABC transporters are highly sequence conserved among plants and fungi, including the Walker A and Walker B core motifs, and the ABC signature ( Prasad and Goffeau, 2012). During substrate transport across membranes, the two NBDs undergo conformational changes with nucleotide-dependent NBD dimerization and nucleotide hydrolysis-dependent dissociation for substrate translocation ( Jungwirth and Kuchler, 2006). The amino acid substituted in pen3-1 is located in the ABC signature of the N-terminal NBD (G354D) next to a predicted S/T phosphorylation site T353 ( Blom et al., 1999). In pen3-2, the point mutation affects G915S, one of the key residues in the consensus Walker A motif (GxxGxGKS/T x represents any amino acid) of the C-terminal NBD ( Stein et al., 2006). Those mutations likely impair ATP binding and/or ATP hydrolysis, which are needed for transport activity of PDR-type transporters. Notably, PEN3 steady-state levels are clearly reduced in pen3-1 and pen3-2 ( Fig. 2D), indicating that either PEN3 transport activity is linked to PEN3 accumulation/turnover in the plasma membrane or the amino acid substitutions cause misfolding of PEN3, which is recognized in the endoplasmatic reticulum and eliminated by the plant endoplasmatic reticulum-associated degradation pathway for integral membrane proteins ( Müller et al., 2005 Lu et al., 2009 Saijo et al., 2009 Tintor and Saijo, 2014).
Plasma membrane-resident PEN3 focally accumulates underneath attempted fungal entry sites, and this process is triggered by the perception of MAMPs, such as fungus-derived chitin or bacterium-derived flg22 ( Stein et al., 2006 Xin et al., 2013). Similarly, extracellular POWDERY MILDEW RESISTANT4 (PMR4)/GLUCAN SYNTHASE-LIKE5 (GSL5)-mediated callose deposition occurs underneath attempted fungal entry sites ( Jacobs et al., 2003). Extracellular callose deposition mediated by PMR4/GSL5 can also be induced by flg22 treatment in a process that needs PEN3 ( Clay et al., 2009). Preinvasive resistance to the nonadapted powdery mildews and flg22-trigered callose deposition are both compromised in pen3-1 and pen3-2 plants but not in the presence of pen3-5 and pen3-6 alleles ( Fig. 3, C and D Supplemental Fig. S3 ), suggesting that PEN3 focal accumulation might only be defective in the former two mutant alleles. The two missense alleles pen3-5 and pen3-6 encode PEN3 single-amino acid substitutions, but these substitutions reside in TMDs and do not affect PEN3 steady-state levels ( Fig. 2D). Although TMDs in general form an α-helix as the basal structure, their amino acid sequences are poorly conserved among yeast PDR transporters ( Prasad and Goffeau, 2012). It has been proposed that substrate specificity of PDR transporters is determined by TMDs, which are essential for substrate selection, recognition, and translocation ( Jungwirth and Kuchler, 2006 Lamping et al., 2010). Random and site-directed mutagenesis together with biochemical studies on yeast PDR5 and Candida albicans CANDIDA DRUG RESISTANCE1 transporters have revealed the existence of multiple substrate-binding sites in different TMDs and that some substrates can associate with more than one site ( Jungwirth and Kuchler, 2006 Tanabe et al., 2011). For example, the S1260F substitution in PDR5 affects the transport only of a subset of substrates ( Ernst et al., 2010). Thus, the observed uncoupling of a subset of PEN3-mediated responses in pen3-5 and pen3-6 plants could be explained by impaired substrate selectivity and/or binding of a subset of PEN3 substrates ( Fig. 7B). Remarkably, all pen3 mutations, except the unique pen3-5 allele, result in enhanced root growth sensitivity to exogenous IBA treatment ( Fig. 5). This assigns a critical role of the corresponding L704 residue in the fourth transmembrane span for substrate selectivity/binding of 4OGlcI3F precursor(s) but not for the efflux of IBA or compounds linked to intracellular redox balance ( Fig. 7B).
A closely related paralog of PEN3, PDR9/ABCG37/POLAR AUXIN TRANSPORT INHIBITOR-SENSITIVE1 (PIS1), has been proposed to act redundantly with PEN3 at outermost root plasma membranes in IBA transport. Plants carrying pdr9-2 ( T-DNA insertion) or pis1-1 (point mutation) are hypersensitive to IBA, and this phenotype is aggravated in the pdr9-2 pen3-4 double mutants ( Ruzicka et al., 2010). We tested whether PDR9 can function together with PEN3 in preinvasive defense against nonadapted powdery mildews E. pisi and B. graminis. We found indistinguishable entry rates on pdr9-2 and pis1-1 mutants compared with the wild type ( Supplemental Fig. S7 ). This strongly suggests that PDR9 does not act together with PEN3 in preinvasive defense. Thus, only PEN3 transports the active product of the PEN2 metabolic pathway in leaves. In this context, the amino acid exchanged in pen3-5, L704F, is a V in PDR9 (in Columbia-0 [ Col-0] Fig. 2C), which is consistent with the assumption that PEN3 L704 is critical for the efflux of 4OGlcI3F precursor(s) but not IBA. This notion is supported by the lack of natural allelic variation at the respective PEN3 and PDR9 residues among 260 examined Arabidopsis accessions ( Long et al., 2013 http://1001genomes.org).
Evidence for Diversified Trp-Derived Antimicrobial Compounds Exported by PEN3
Previous work suggested that IG biosynthesis together with PEN2-mediated metabolism of these compounds are required for flg22-induced callose deposition. Consequently, pen2-1 and pen3-1 plants were found to be defective in callose production upon application of this MAMP ( Clay et al., 2009). We have shown here that flg22-induced callose deposition is diminished in pen3-1, pen3-2, and pen3-4 plants but retained in pen3-5 and pen3-6 mutants ( Supplemental Fig. S3 ). This indicates that the PEN3 substrate required for flg22-induced callose deposition is distinct from the PEN2-generated molecule(s) critical for preinvasion resistance. Alternatively, it is possible that PEN3 substrates in both defense responses are the same if the residual activity of PEN3-5 is sufficient to mediate callose deposition but insufficient for preinvasive resistance. Previous work showed greater growth of the P. cucumerina ascomycete fungus on leaves of pen3 compared with pen2 plants ( Stein et al., 2006 Sanchez-Vallet et al., 2010), indicating that PEN3, apart from product(s) of the PEN2 pathway, exports another molecule critical for Arabidopsis defense to this pathogen. The same studies also indicated that Arabidopsis defense to host-adapted and nonadapted P. cucumerina strains is mediated primarily by glutathione and Trp-derived secondary metabolites. This was concluded from P. cucumerina supersusceptibility infection phenotypes on glutathione-deficient phytoalexin-deficient2 and double cyp79B2 cyp79B3 knockout plants ( Fig. 7A Sanchez-Vallet et al., 2010). The latter mutant carries mutations in two P450 monooxygenases mediating the early step in the biosynthesis of Arabidopsis Trp-derived secondary metabolites ( Zhao et al., 2002 Glawischnig et al., 2004 Böttcher et al., 2009). For this reason, it is likely that, during Arabidopsis defense to P. cucumerina, PEN3 translocates to the apoplast molecules that are linked to glutathione or Trp-derived metabolite(s). Because glutathione takes part in intracellular redox balance (see above), it is conceivable that the edr phenotype to host-adapted powdery mildews involves the same compound class, which confers supersusceptibility to P. cucumerina on pen3 plants. However, the pen3-5 allele clearly uncouples the contrasting infection phenotypes to these two pathogens ( Fig. 4 Supplemental Fig. S4 ), indicating that eds to P. cucumerina can occur without redox imbalance. However, because PEN3 is implicated in the efflux of IBA and indole-type end product(s) of the PEN2 pathway, it is very likely that this transporter is also capable to accept as a substrate another structurally related Trp-derived molecule(s) that limits P. cucumerina growth ( Sanchez-Vallet et al., 2010). Together, this illustrates the chemical diversity and quantitative mode of action of small molecules exported by PEN3 in extracellular defense to leaf pathogens.
We observed pronounced methylation difference between alleles spanning whole CpG islands in the non-imprinting loci of human autosomes, and proved a correlation with allelic gene expression level in one case. We further confirmed that the methylation differences between alleles were strongly correlated with genetic polymorphisms, which can further influence the inter-individual variability of DNA methylation. Our results suggest the ASM can affect many genes in the human genome and the genetic differences must be taken into account in future comparative DNA methylation studies.
Results and discussion
MBASED: meta-analysis based allele-specific expression detection
First, we give an overview of our method, MBASED, with detailed descriptions provided in Materials and methods and in Supplementary methods in Additional file 1. Given RNA read counts supporting reference and alternative alleles at individual SNVs within a unit of expression, MBASED provides an estimate of ASE and a corresponding P-value. A unit of expression can be a gene, a transcript isoform, an exon, or an individual SNV: MBASED is agnostic with respect to the nature of the unit provided by the user. In this work, we choose the gene as a unit of ASE, which we define as the union of all exons forming individual transcript isoforms.
For a given gene, MBASED provides a framework for aggregation of SNV-level information into a single measure of ASE. The meta-analytic approach adopted by MBASED relies on specification of gene haplotypes, which may be unknown for many data sets. In one-sample ASE analysis, when true haplotypes are unknown, MBASED uses RNA read counts at individual SNVs within a gene to phase SNVs into two haplotypes. We adopt a pseudo-phasing approach that assigns an allele with a larger read count at each SNV to the 'major' haplotype, with the implicit assumption that ASE is consistent in one direction along the length of the gene. This procedure is not intended to faithfully reconstruct the true underlying haplotypes in all cases, but we expect it to do so for genes showing sufficiently strong ASE. We quantify the allelic imbalance within a sample as the major allele (haplotype) frequency (MAF) of the gene. The ASE detection then becomes a problem of identifying genes with MAF >0.5. Phased counts from the 'major' haplotype are transformed into normally distributed scores, and scores from individual SNVs are combined into a single gene-level score using a meta-analytic approach. This score is then used to obtain an estimate of underlying allelic imbalance. The meta-analytic statistical inference requires the correct specification of gene haplotypes in order to assign proper statistical significance to the observed ASE. Consequently, the pseudo-phasing procedure employed by MBASED in cases of uknown true haplotypes leads to anti-conservative nominal P-values (Materials and methods). We address this problem by employing internal simulations to adjust the reported significance levels. For genes showing strong ASE, we expect our phasing procedure to result in an accurate estimate of MAF, while internal simulations will eliminate most of the allelically balanced genes that may exhibit strong nominal significance due to pseudo-phasing. The basic principles of MBASED in absence of phasing information are illustrated in Figure 1.
Overview of MBASED algorithm (one-sample analysis). The two-sample approach is similar and is described in the text. (A) When true haplotypes are unknown, MBASED pseudo-phases SNVs within a gene by creating a major haplotype out of the alleles with larger RNA read counts at each SNV. A meta-analytic approach is then used to aggregate ASE information across individual SNVs to produce a meta-analysis estimate of major haplotype frequency (MAF), T FT. (B) Keeping total read counts at each SNV constant, we simulate reference allele counts from a null distribution with an underlying haplotype frequency ratio of 1:1, and then pseudo-phase the alleles into haplotypes based on simulated read counts. Repeating this process 10 6 times we obtain an estimate of null distribution of T FT and assign a final ASE P-value, pg,ASE, to gene g as the observed fraction of simulated estimates that are as extreme as or more extreme than T FT.
In two-sample ASE analysis, the goal is to detect differential allelic imbalance between paired samples from the same individual. MBASED treats this problem in an asymmetric way, by designating one of the two samples as the sample of interest, for example, tumor sample in a tumor versus normal comparison. If true haplotypes are unknown, then for any gene that exhibits tumor-specific ASE, only the tumor read counts are informative for separating haplotypes into 'major' and 'minor'. In such cases, MBASED phases alleles at individual SNVs into two haplotypes based exclusively on the sample of interest (tumor, in this case). If normal-specific ASE is under study, for example, when investigating loss of imprinting, then the normal sample can be designated as the sample of interest. Differences between 'major' allele frequencies at individual SNVs in the two samples are used as measures of between-sample ASE. SNV-level scores are combined into a gene-level score using meta-analysis, analogous to the one-sample approach. This composite score provides an estimate of gene-level MAF difference between the samples. As in our one-sample approach, internal simulations are used to assign statistical significance to the observed allelic imbalance, in cases of uknown true haplotypes.
We adopt the approach of DerSimonian and Laird  in establishing a meta-analysis framework for combining information across SNVs. This approach views the true unobserved treatment effect (in our case, ASE) at each observational unit of a common phenomenon (SNVs of a gene) as a random variable with a common mean. The estimate of that mean is obtained by combining information across individual units and represents a measure of the global effect (gene-level ASE). Within this framework, MBASED also reports the P-value corresponding to the constancy of the treatment effect statistic, Q, for multi-SNV genes in both one-sample and two-sample analyses. Q measures the observed extent of inter-SNV variability of ASE within a single gene (heterogeneity). The small reported P-values indicate genes with individual SNVs showing significantly inconsistent estimates of ASE. Such patterns can arise due to differences in ASE between various transcript isoforms of a gene , and therefore MBASED provides metrics for assessing the extent of isoform-specificity of the observed gene-level ASE.
Situations where one allele is favored in the read count data, even in absence of underlying ASE, have been reported in RNA-Seq data, due, for instance, to enrichment protocols, technological artifacts or a choice of a short read aligner ,. We refer to such cases as instances of pre-existing allelic bias. When supplied with the values of probabilities of observing each allele at individual SNVs under conditions of no ASE, MBASED can incorporate such pre-existing biases into its estimates of ASE (Supplementary methods in Additional file 1), and we further provide functionality to estimate these probabilities from the data set itself. Our algorithm is implemented in the R  package MBASED. Further details are found in Materials and methods, Supplementary methods in Additional file 1, and the package vignette.
Robust allele-specific expression detection by MBASED
To demonstrate the performance of MBASED in the absence of phasing information, we analyzed multiple sets of simulated data, in which artificially introduced allele-specific expression patterns were assigned to different genes at various allele preferences and expression levels. We selected a pair of matched tumor-normal samples from our panel (HCC individual 2) and recorded all of the detected exonic heterozygous SNVs in both samples, retaining information about the total RNA coverage of each SNV, while discarding the observed reference and alternative allele counts in each sample. We chose a real data set as the basis for our simulations to ensure that the simulated data sets had realistic distributions of both the number of heterozygous SNVs per gene and the read coverage per SNV.
We assessed the performance of the one-sample MBASED algorithm using the tumor sample. Briefly, we divided all tested genes in the sample into 25 strata based on 5 levels of each of the 2 covariates of interest: the number of SNVs in a gene (1, 2, 3, 4, or 5+) and the average coverage of SNVs in a gene (10 to 20, 20 to 30, 30 to 40, 40 to 50, or 50+ reads/SNV). The stratification was done to ensure that we tested MBASED across a variety of settings. Within each stratum, we randomly assigned a specified fraction f of the genes (for example, 25%) to be ASE true positives (TPs), and the rest of the genes were assigned to be ASE true negatives (TNs). We then simulated read counts for SNVs in ASE TN genes from the null distribution (MAF = 0.5), while for SNVs in ASE TP genes the counts were simulated from a signal distribution, where we varied signal strength (MAF) from 0.7 to 0.9 (Materials and methods). We then ran MBASED on each simulated data set and declared any gene with a Benjamini-Hochberg  (BH) adjusted P-value ≤0.05 to exhibit ASE. We performed 100 simulations for each combination of simulation settings, and Figure 2 illustrates average (over simulations) MBASED performance for f = 25%.
Performance of MBASED on simulated data in one-sample analysis. Genes were broken into 25 strata, based on number of SNVs in a gene and average number of reads per SNV. Within each strata 25% of genes were randomly chosen to exhibit ASE. For each SNV in a true positive ASE gene, one allele was randomly assigned to major haplotype and the corresponding read counts were simulated as described in Materials and methods. MBASED was run on the simulated data and genes with Benjamini-Hochberg adjusted P-values ≤0.05 (false discovery rate (FDR) control of 5%) were declared ASE. The average (across 100 simulations) true positive rate (TPR) and FDR within each strata and for each level of ASE signal (MAF used for ASE true positive genes) are shown. The overall TPR and FDR levels are obtained by giving each stratum weight proportional to the fraction of genes in that stratum (that is, these values are heavily weighted towards genes with few SNVs and low coverage, common in our data), and the average values are given in panel titles, along with their estimated standard errors (SE). MBASED performs very well at higher coverage levels and higher ASE extent.
We found that the true positive rate (TPR) increased with read coverage and underlying ASE strength (MAF), as well as with the number of SNVs in a gene. We controlled the overall false discovery rate (FDR) at the nominal level of 5%, indicating that the P-value adjustment was effective. MBASED performed well even in low information settings. For example, >90% of ASE TP genes with 2 SNVs and 20 to 30 reads/SNV were recovered in simulations with MAF = 0.8. In analyzing real data, we required that a gene exhibit an estimated MAF ≥0.7 in addition to passing the statistical significance cutoff in order to be declared as exhibiting ASE (Materials and methods). As expected, this additional effect size cutoff reduced the TPR drastically for underlying ASE strength MAF = 0.7 (overall TPR fell from 55% to 37%), but had no appreciable effect on the TPR for higher values of MAF (data not shown).
Similarly, we performed simulations in the two-sample setting (Figure 3 Materials and methods). We observed the dependence of the TPR on read coverage and the number of SNVs per gene similar to one-sample simulations, although for a given combination of simulation settings the two-sample method had somewhat lower power.
Performance of MBASED on simulated data in two-sample analysis. Simulations were performed similar to the one-sample case, as described in Materials and methods. MBASED was run on the simulated data and genes with Benjamini-Hochberg adjusted P-values ≤0.05 (false discovery rate (FDR) control of 5%) were declared ASE. The average (across 100 simulations) true positive rate (TPR) and FDR within each strata and for each level of ASE signal (MAF used for ASE true positive genes) are shown. The overall TPR and FDR levels are obtained by giving each stratum weight proportional to the fraction of genes in that stratum (that is, these values are heavily weighted towards genes with few SNVs and low coverage, common in our data), and the average values are given in panel titles, along with their estimated standard errors (SE). MBASED performs well at higher coverage levels and higher ASE extent, but its power is limited for the low-coverage, low-signal scenarios.
MBASED employs beta-binomial distribution to model read count data (Materials and methods), which accounts for extra-binomial variability (overdispersion) often observed in allelic counts in RNA-Seq data sets ,,. We used the levels of overdispersion similar to those observed in real data (Materials and methods) while performing simulations, and note that MBASED performance improves as the amount of overdispersion decreases and the separation between signal and noise distributions of test statistic increases (Figures S1 to S4 in Additional file 1).
We also tested the performance of MBASED in the setting of pre-existing allelic bias, by assuming that at each SNV under conditions of no ASE the probability of observing reference allele count, P ref, was >0.5 (global reference bias). We found the results to be very close to those observed in the no-bias simulations (Figures S5 and S6 in Additional file 1). We conclude that the MBASED method is robust in detecting ASE genes in samples with unknown true haplotypes, with detection power increasing with observed gene coverage and the number of detected heterozygous SNVs in a gene.
We further assessed the performance of MBASED in a situation where the true underlying haplotypes are known. We obtained previously published lymphoblastoid cell line RNA-Seq data and a list of phased genomic variants for (non-cancer) individual NA12878, genotyped together with both parents as part of the 1000 Genomes Project . We pre-processed the data analogously to other samples in our panel (Materials and methods) and applied MBASED both with ('phased') and without ('non-phased') specifying the true haplotypes. Overall, we tested 2,560 genes for ASE, including 1,104 (40%) with >1 heterozygous loci. Using the cutoffs of 0.7 on estimated MAF and 0.05 on adjusted P-value, MBASED found 110 genes to show ASE in the 'phased' setting and 115 genes in the 'non-phased' setting, of which 108 were in common, indicating a high degree of consistency (Figure S7 in Additional file 1 Additional file 2). The small number of observed discrepancies was due to higher power of MBASED to detect ASE in general, and isoform-specific ASE in particular, when true haplotypes are known. A detailed discussion of the observed differences between running MBASED with and without prior knowledge of true haplotypes is provided in the Supplementary discussion in Additional file 1. We further note that running MBASED without supplying the true haplotypes resulted in correct haplotype reconstruction of 40/47 (85%) ASE genes with multiple SNVs. Further investigation revealed that of seven instances where haplotype reconstruction failed, six were likely due to alignment artifacts (Supplementary discussion in Additional file 1).
Finally, we compared the performance of MBASED with that of Skelly et al. , which is to our knowledge the only currently published ASE detection method that allows for variable ASE within a gene. Since the method of Skelly et al. requires that the true haplotypes be known, we used NA12878 RNA-Seq data for this comparison and supplied the true haplotypes to MBASED (Materials and methods). The method of Skelly et al. identified 103 ASE genes (posterior P(ASE) >0.95, posterior median MAF ≥0.7), compared to 110 identified by MBASED, including 94 that were common to both methods (Additional file 2). Of the nine genes identified as ASE by the method of Skelly et al. only, all have estimated MBASED MAF ≥0.8, but fall short of the significance cutoff due to low read coverage (10 to 12 reads/SNV, MBASED ASE P-values 0.05 to 0.17). Of the 16 genes identified as ASE by MBASED only, 15 show posterior P(ASE) >0.95 according to the method of Skelly et al., with posterior median MAF values of 0.58 to 0.7. The lower MAF estimates of Skelly et al. are due to its no-ASE prior imposed on the data. A detailed discussion of the observed differences between MBASED and the method of Skelly et al. is provided in the Supplementary discussion in Additional file 1. We conclude that the two methods produce qualitatively and quantitatively similar results on this data set. We note, however, that MBASED can perform in situations when the true haplotypes are unknown, a major advantage over the method of Skelly et al. In addition, MBASED allows for the effects of pre-existing allelic bias and disambiguates the technical and biological contributions to overdispersion in the data (Materials and methods), while the method of Skelly et al. combines the two.
Cancer samples exhibit high levels of allele-specific expression
We applied the MBASED method to a panel consisting of 18 lung cancer cell lines, 3 non-small cell lung cancer (NSCLC) tumor tissue samples, 4 hepatocellular carcinoma (HCC) tumor tissue samples, and 7 matched normal samples for the tumor tissues (Table S1 in Additional file 3) for a total of 25 cancer (21 lung and 4 liver) and 7 normal samples. None of the samples in the panel had known haplotypes. One-sample MBASED analysis was performed for each of the 32 samples and two-sample analysis was performed for tumor/normal and normal/tumor comparisons of 7 paired samples. Within each sample (or a pair of samples for two-sample analysis) only the genes containing informative heterozygous SNVs were tested for ASE (Materials and methods). Any gene with a BH adjusted MBASED P-value ≤0.05 and estimated MAF ≥0.7 was declared as exhibiting ASE in one-sample analysis. Similarly, any gene with a BH adjusted MBASED P-value ≤0.05 and estimated MAF difference ≥0.2 was declared as exhibiting sample-of-interest-specific ASE in two-sample analysis. This assignment provided one way of determining a set of genes in which to further characterize ASE in downstream analysis. All genes with adjusted inter-SNV ASE variability P-value ≤0.05 were flagged as possibly subject to isoform-specific ASE effects. Further details of the analysis pipeline are provided in Materials and methods, and the full results of MBASED application to the samples in our panel are available in Additional files 4 and 5. We note that the power of MBASED to detect mild levels of ASE is limited in the low coverage setting (right panels of Figures 2 and 3), common in our data, and the ASE levels reported here likely underestimate the true extent of ASE in samples under study.
We found evidence for extensive ASE in the majority of cancer samples in the panel (Figure 4A,B). One-sample analysis revealed 9 to 26% of all tested genes in 7 tumor samples as showing ASE, considerably higher than the 0.5 to 2% ASE rate observed in 7 matched normal samples. The extent of ASE in lung cancer cell lines was highly variable (1 to 32%) and was correlated with the sample RNA-Seq coverage levels (data not shown). In contrast, no such correlation was observed for tissue samples, which had higher RNA sequencing depth (Table S1 in Additional file 3).
Summary of ASE extent in the studied samples. (A) Percentage of tested genes declared ASE in one-sample analysis of lung cancer cell lines. (B) Percentage of tested genes declared ASE in one-sample analysis of tissue samples (both tumor and normal). Considerably more genes are found to show ASE in tumor than in normal samples. (C) Tumor-specificity of ASE found in one-sample analysis of tissue samples. All genes found to be ASE in one-sample analysis are broken down into genes that are (a) not tested in a two-sample comparison (due to lack of common sufficiently covered SNVs, gray), (b) found to show ASE in two-sample comparison (red), and (c) not found to show ASE in two-sample comparison (black).
Of genes that exhibited ASE in the one-sample analysis of tumors and that also were tested for ASE in the two-sample analysis, 48 to 77% showed tumor-specific ASE (Figure 4C). By comparison, a much smaller fraction of genes showing ASE in one-sample analysis of normal samples were found to show normal-specific ASE (3 to 32%), despite higher RNA-Seq coverage of the normal sample in five out of seven sample pairs (Table S1 in Additional file 3). This indicates that while most of ASE observed in normal samples is retained in the tumor, a large fraction of the ASE observed in the tumors has developed during the tumorigenesis process.
Across our 32 samples, we found that in one-sample MBASED analysis 22 out of 2,080 ASE genes with multiple heterozygous SNVs showed evidence of isoform-specific ASE. We note that the significance test based on heterogeneity statistic Q has lower power in the settings of low read coverage and few SNVs, common in our data, and we likely underestimated the extent of isoform-specific ASE. Since 20 of these genes were found in the liver samples (7 in the normal, 13 in the tumor), there might be more isoform-specific ASE occurring in the liver, although none of these genes exhibited liver-specific expression. Alternatively, it is possible that we were hindered in our detection of isoform-specific ASE by the low sequencing depth, since liver samples had the highest RNA-Seq coverage in our data set. In the two-sample MBASED analysis, 16 out of 701 ASE genes with multiple heterozygous SNVs showed evidence of isoform-specific ASE, including 12 in the liver samples (11 in the normal, 1 in the tumor). The biological significance of the observed instances of isoform-specific ASE is unclear and is further complicated by the observation that 10 out of 22 genes with one-sample isoform-specific ASE and 5 out of 16 genes with two-sample isoform-specific ASE were represented by only one transcript isoform. This observation may be due to incompleteness of the current set of gene models or to the variance of SNV-level measures of ASE in those genes exceeding what is allowed by our statistical model.
Overall, the normal samples exhibited limited extent of ASE, using our chosen cutoffs, while the cancer samples showed much higher ASE rates, with isoform-specific ASE playing a limited role, if any.
Allele-specific expression in cancer is linked to large-scale genomic changes
We assessed the copy number (CN) state and allelic imbalance (AI) at the DNA level for all cancer samples (Materials and methods). We found that a large fraction of observed ASE in cancer samples could be attributed to underlying changes in genomic composition. This observation has previously been reported in a single sample of oral cancer . The profiles of these changes appeared to be markedly different between cell lines and tissue samples, with cell lines showing more genomic CN gains and AI (both in CN-gained and CN-neutral regions), but fewer CN losses than tumor tissue samples (Figure 5A). However, these observed differences might be due to different platforms used for CN and AI assessments of tissue samples and cell lines (Materials and methods). Genomic AI and CN changes accounted for >65% of ASE-exhibiting genes in all 18 cell lines (including 17 cell lines with >83%), and >55% of ASE-exhibiting genes in 6 out of 7 tumor tissue samples (Figure 5B), showing cancer ASE to be a phenomenon mainly driven by large-scale DNA alterations. The single exception among the tumor tissues came from NSCLC individual 2, which exhibited ASE in 17% of tested genes (similar to other cancer tissue samples Figure 4B), but had 87% of these ASE-exhibiting genes fall outside of regions of CN alteration or detected AI. This suggests that alternative mechanisms for upregulation of ASE may exist in cancer (for example, allele-specific silencing through DNA methylation) and may be at play here. However, we cannot rule out the alternative possibility that the CN calling algorithm did not perform well on this sample. In 5 out of 7 tumor tissue samples, 6 to 25% of ASE-exhibiting genes fell into regions of CN loss, indicating that the detected ASE in those genes might be due to normal contamination or tumor heterogeneity, as no heterozygous variants should be detected in such regions in the absence of admixture.
Summary of genomic state of genes showing ASE in cancer samples. (A) Proportion of autosomal genome falling into different categories of underlying copy number (CN) and allelic imbalance (AI) states. Cell lines show more CN gains and fewer CN losses than tissue samples. (B) Proportion of ASE autosomal genes falling into different categories of underlying CN and AI states. CN gain (cell lines) and loss (tissues) regions are enriched for ASE genes. The large extent of ASE genes in regions of CN loss in tissues is most likely explained by heterozygosity detection due to normal tissue admixture. The vast majority of ASE genes in tumor tissue samples from NSCLC individual 2 (second from left) fall into CN-neutral, no-AI regions, but the sample exhibits ASE levels comparable to the other two NSCLC patients (compare Figure 4B).
Recurrent instances of cancer-specific ASE point to regions of recurrent genomic alterations
We identified instances of cancer-specific ASE based on the MBASED two-sample analysis of tumor-normal pairs. A selection of such genes is shown in Figure 6A. Generally, genes with recurrent cancer-specific ASE tended to cluster together when found on the same chromosome. For example, chromosome 12 genes ETNK1, GOLT1B and ITPR2 are located in close proximity to KRAS and we found the KRAS-containing segment of chromosome 12 to be lost in two samples and gained in another sample, while an additional two samples exhibited allelic imbalance of the entire chromosome 12 (Figure S8 in Additional file 1). As the result, all four genes (ETNK1, GOLT1B, KRAS, and ITPR2) were found to show recurrent cancer-specific ASE. In another example, all five chromosome 17 genes showing recurrent cancer-specific ASE were located on a portion of the chromosome with lower CN than the rest of chromosome 17 in all 7 tumor tissue samples (Figure S9 in Additional file 1). This frequently lost genomic segment also contained the known tumor suppressor gene TP53, consistent with the recurrent CN loss. In this instance, it is likely that normal admixture gave rise to detected heterozygous variants in these tumor samples, and that we would not detect any ASE under conditions of high tumor purity. In the case of chromosome 8, a segment was commonly present in a lower CN than the rest of the chromosome, but we were unable to definitively associate it with a known oncogenic driver. Finally, in some cases (for example, chromosomes 14 and 16) most of the chromosomes showed AI in multiple samples, giving rise to recurrent ASE. Genes with recurrent ASE in cancer cell lines also showed enrichment for certain chromosomes (Figure 6B). However, it was difficult to associate these recurrent events with common genomic aberrations, due to a considerably richer and more complicated pattern of CN alterations in cell lines (Figure 5A).
Selected instances of recurrent cancer-specific ASE. For both panels, columns are samples and rows are genes. Gene status in each sample is plotted. Note that the non-ASE category (blue) may include genes that fall just short of our ASE cutoffs, as well as genes where low coverage reduces our power to detect ASE. Genes in the non-tested category (gray) lack informative SNVs for ASE assessment in that sample. (A) Selected genes that show tumor-specific ASE (black dot) in multiple cancer tissue samples. Genes were chosen based on whether ASE was gained in tumor tissue samples relative to normal tissue samples, according to two-sample MBASED analysis. We require that the gain of ASE occurs in at least 3/7 tumor samples. Genes are grouped by chromosome (rightmost column) and ordered top-to-bottom in order of chromosomal location. Note that gene RNF167 shows ASE in both normal and tumor samples in HCC individual 4 however, different haplotypes are overexpressed in the two samples. (B) Selected genes that show ASE in multiple lung cancer cell lines. Genes were chosen based on whether ASE was detected based on one-sample MBASED analysis. We require that the ASE occurs in at least 4/18 cell lines and does not occur in any of the 7 normal samples (to exclude, for example, imprinted genes). Genes are grouped by chromosome (rightmost column) and ordered top-to-bottom in order of chromosomal location.
Based on these and similar observations, we conclude that the instances of recurrent ASE in our cancer samples were often driven by recurrent modifications of the underlying genomic CN state, affecting known driver genes in some cases.
Selective overexpression of mutant alleles in cancer samples
We further investigated the interaction between the ASE and mutations in cancer. The ability of MBASED to perform ASE detection without prior knowledge of haplotype phasing allowed us to assess ASE of mutation-containing genes based on information from both SNPs and mutations. We then used MBASED haplotype calls to assign a mutation to the 'major' or 'minor' haplotype. From the 25 cancer samples, we identified 691 non-synonymous somatic (or putative somatic in the case of cell lines Materials and methods) variants that are potentially 'functional', that is, variants that were classified as 'deleterious' by SIFT  or 'damaging' by PolyPhen  or were predicted to result in translation stop gain or loss by Variant Effect Predictor  (Table S2 in Additional file 3). Of these variants, 291 presented the mutant allele as major, including 41 that fell into ASE-exhibiting genes. Overexpressed functional mutant alleles included a number of known or suspected contributors to oncogenesis (Table 1).
We observed five instances of functional mutations that alter codon 12 of KRAS, a known oncogenesis-driving event . In three out of five cases, the mutant allele was significantly over-represented, while in another instance (NSCLC cell line H2009) the over-representation was borderline significant (MAF = 0.66, BH adjusted P-value = 0.1). This suggests the selective pressure to produce a large number of constitutively activated forms of KRAS. We also observed the over-representation of known activating mutation L858R in the kinase domain of EGFR (as well as a novel mutation in the same domain in the same individual), and a mutation in the FAT domain of mTOR, a major regulator of cell-signaling pathways. The FAT domain is a binding site for the mTOR inhibitor DEPTOR, suggesting that this mutation might also be constitutively activating. The potential instances of overexpressed inactivating mutant alleles include a mutation in the transactivation domain of the tumor suppressor EAF2, and a mutation in the ring-finger motif of gene RAD18, which is involved in post-replication DNA damage repair .
Out of 41 instances of functional mutations with the mutant allele in the major haplotype of ASE-exhibiting genes, five fell outside of regions of genomic copy number change and/or allelic imbalance (Table 1, last 5 rows), including mutations in cancer-related genes MYH9, TIMP1, and FAS. However, it is unclear what the exact consequences of these mutations were for protein functionality, and what advantage to tumorigenesis, if any, was conferred by overexpression of the mutant allele.
The overexpression of mutant alleles that might confer some advantage to tumor cells was not universal. We found a small number of examples of functional mutant alleles that were expected to contribute to tumor phenotype but were not overexpressed. For example, cell line H441 contained a mutation in codon 12 of KRAS, but unlike the other four instances of this mutation (Table 1), this mutant allele was under-represented relative to the wild-type allele. We also found two instances of known activating mutations in residue 61 of NRAS, with no evidence of ASE in one case and strong evidence for overexpression of the wild-type allele in the other.
In summary, we found multiple examples of ASE where the overexpressed allele was either a known or suspected activating mutation in an oncogene or an inactivating mutation in a tumor suppressor gene. In almost all such cases the observed ASE arose from underlying DNA CN alterations. We observed some instances where the mutations expected to contribute to the cancer phenotype were underexpressed. It is possible that in such cases the mutation was crucial to early oncogenic processes, but that at later stages of tumor evolution the dependence of the cells on the mutant form of the protein was reduced.
Tumor samples show loss of ASE in approximately 15% of normal ASE genes and elevated ASE on chromosome X
We observed 89 instances of monoallelic expression (ASE with estimated MAF >0.9) in 74 genes across 7 normal samples in our panel. In 36 of these instances (corresponding to 28 genes) the matched tumor sample contained at least one common informative SNV in that gene, enabling us to test these genes for tumor-normal allelic imbalance using the two-sample MBASED approach. We found that in 5 out of 36 tested cases (13.9%), the observed monoallelic ASE (MAE) was specific to a normal sample and was lost in a tumor (Figure S10A in Additional file 1), but there were no instances of recurrent loss of MAE in tumor tissue samples. One example was gene ABP1 in HCC individual 4 (Figure S10B in Additional file 1). A previously described translocation event adjacent to ABP1 in this sample  might be a contributing factor to the observed loss of MAE. We also found one instance of an MAE pattern reversed between normal and tumor, in the BCL2L10 gene in HCC individual 4 (Figure S10C in Additional file 1). BCL2L10 encodes a pro-apoptotic factor and has been implicated in 5-azacytidine resistance in acute myeoloid leukemia and myelodysplastic syndrome patients . The two alleles differ by a pair of SNVs in a 3' UTR, but it is not clear if the observed switch of ASE pattern was due to differential functional efficiency of the two alleles or if one of the alleles was more oncogenic.
Extending this analysis to all instances of ASE in normal samples, including non-monoallelic, we found 161 cases of ASE in normal samples that could be tested for tumor-normal allelic imbalance. In 30 (18.6%) cases, the observed ASE was normal-specific, with 16 (53.3%) such instances not attributable to underlying CN alterations, including four out of five loss-of-MAE cases. The extended analysis also did not reveal any genes with recurrent normal-specific ASE.
Loss of imprinting has been previously reported in cancer . We cross-referenced a list of known imprinted genes  against the list of genes with loss of ASE in tumor samples, and found no instances of loss of imprinting (Figure S10A in Additional file 1). In general, we found that out of 55 known imprinted genes, only 7 could be tested for ASE in 3 or more normal samples. We found that two of these seven genes (FAM50B and NDN) showed monoallelic expression in all tested instances, while the other five genes (GNAS, IGF2, NAA60, SLC22A18, and SLC22A3) did not show any evidence of ASE. These observed patterns could be due to the previously reported tissue-specificity of imprinting .
In addition to imprinting, another known source of ASE is chromosome X inactivation in female cells. We found that all but one of our nine female samples showed much higher rates of ASE on chromosome X than in the rest of the genome (Figure S11 in Additional file 1 Fisher exact test P-value <0.02 for chromosome X versus autosomal ASE rate comparison for all eight samples). The sole exception was a female cell line, H2009, that suffered a loss of a copy of chromosome X and exhibited no ASE on that chromosome. We found that the rates of ASE in chromosome X genes in the two normal female tissue samples were low (<8%), consistent with the existence of several clonal lines in each sample, with different copies of the chromosome inactivated in different clones . On the other hand, all female cancer samples (after excluding H2009) showed high rates of ASE on chromosome X (54 to 100% of tested genes). In some cases, including both female tumor tissue samples, this elevated rate could be attributed to underlying CN alterations. However, two of the cell lines did not show any CN changes or AI on chromosome X (data not shown), suggesting that a monoclonal expansion took place in these samples, giving rise to cell mixture, where one copy of chromosome X was preferentially silenced .
Overall, we observed a moderate extent of loss of normal ASE in tumors, with approximately 15% of normal ASE genes being normal-specific. We did not find any instances of recurrent loss of ASE and we also did not detect any instances of loss of imprinting. The observed loss of ASE did not appear to be driven by the underlying CN alterations, although the exact mechanism and biological significance of this process remain unclear. On the other hand, we observed elevated rates of ASE on chromosome X in cancer samples, occasionally accompanied by underlying genomic allelic imbalance. These latter cases might be due to the previously described high extent of chromosome X inactivation following monoclonal expansion. Our analysis was limited by a small sample size and low sequencing coverage. Larger-scale studies are needed to investigate these issues further.
Ask TaqMan Episode 13 — How TaqMan Works
TaqMan genotyping assays (TaqMan® SNP Genotyping Assays and TaqMan® Drug Metabolism Genotyping Assays) consist of pre-optimized PCR primer pairs and two probes for allelic discrimination .
- A pair of unlabeled primers
- Two TaqMan probes (one with a FAM dye label and one with a VIC dye label) on the 5’ end and minor groove binders (MGB) and nonfluorescent quenchers (NFQ) on the 3´ end.
TaqMan genotyping assays are used to amplify and detect specific alleles in genomic DNA (gDNA). The figure below depicts the TaqMan SNP Genotyping Assay process.
- Genomic DNA is introduced into a reaction mixture consisting of TaqMan® Genotyping Master Mix, forward and reverse primers and two TaqMan® MGB Probes.
- Each TaqMan MGB Probe anneals specifically to a complementary sequence, if present, between the forward and reverse primer sites. When the probe is intact, the proximity of the quencher dye to the reporter dye suppresses the reporter fluorescence.
- The exonuclease activity of AmpliTaq Gold® DNA Polymerase cleaves only probes hybridized to the target. Cleavage separates the reporter dye from the quencher dye, increasing fluorescence by the reporter. The increase in fluorescence occurs only if the amplified target sequence is complementary to the probe. Thus, the fluorescence signal generated by PCR amplification indicates which alleles are in the sample.
TaqMan® Copy Number Assays are run together with a TaqMan® Copy Number Reference Assay in a duplex real-time Polymerase Chain Reaction (PCR). The Copy Number Assay detects the target gene or genomic sequence of interest and the Reference Assay detects a sequence that is known to be present in two copies in a diploid genome.
a. A TaqMan® Copy Number Assay, a TaqMan® Copy Number Reference Assay, TaqMan® Genotyping Master Mix, and a gDNA sample are mixed together in a single well or tube.
b. The gDNA template is denatured and each set of assay primers anneals to its specific target sequences. Each TaqMan® probe anneals specifically to its complementary sequence between forward and reverse primer binding sites. When each oligonucleotide probe is intact, the proximity of the quencher dye to the reporter dye causes the reporter dye signal to be quenched.
c. During each round of PCR, the target and reference sequences are simultaneously amplified by AmpliTaq® Gold DNA Polymerase. This enzyme has a 5′ nuclease activity that cleaves probes that are hybridized to each amplicon sequence. When an oligonucleotide probe is cleaved by the AmpliTaq Gold DNA Polymerase 5′ nuclease activity, the quencher is separated from the reporter dye increasing the fluorescence of the reporter. Accumulation of PCR products can be detected in real time by monitoring the increase in fluorescence of each reporter dye at each PCR cycle.
This method of relative quantitation is used to determine the relative copy number of the target of interest in a gDNA sample, normalized to the known copy number of the reference sequence.
In the current study we present four novel findings. First, we show that within 22 established mouse imprinted clusters, ZFP57 specifically binds known or predicted ICR elements in each of them, but does not interact with secondary somatically methylated DMRs or germline DMRs that are not imprinting controlling. Second, we show that ZFP57 binding can be efficiently used to predict novel imprinted genes, identifying Fkbp6 as a gene whose expression is imprinted and paternal allele-specific in placenta. Third, we show that many ZFP57 targets lie outside imprinted clusters, have biallelic binding, are hypermethylated and are enriched in repressive H3K9me3 histone marks, consistent with KAP1 co-repressor recruitment members of this class include some ERVs where the DNA binding factor targeting the repressive state had not previously been identified. Finally, we show that genetic and epigenetic variation can specify strain-specific monoallelic ZFP57 binding, which is often associated with biased allelic expression of adjacent genes.
2. Building a global chromatin accessibility hypothesis
2.1. The Accessibility Problem: The Difficulty of Reading Genomic DNA
Ultimately, the distinction of one bit of chromatin from another depends on the local DNA sequence. Wrapping the DNA into nucleosomes makes the DNA sequence difficult to access, and if it cannot be accessed, one nucleosome cannot be distinguished from another. Histone modifications help to restore a functional differentiation. The presence of nucleosomes and nature of the histone modifications they carry produce chromatin states that determine whether, when, or how the sequence is read by DNA-binding proteins. Unless DNA-binding proteins have access to the DNA sequence, the apparatus that binds to chromatin and acts on it would not know where to go and, most importantly, the genes and other sequence information could not be read.
In the second part of this essay, I want to consider the question of the accessibility of genomic DNA, apart from cooperative binding or the remodeling that follows the binding of pioneer factors (elaborated in Sec. 2.3) this is a problem that underlies many of the questions concerning histone modifications—their multiplicities and dynamics. I propose an argument that begins to integrate the mass of information about histone modifications from the vantage point of DNA accessibility.
Textbook accounts explain the wrapping of the DNA into nucleosomes as a means of packaging eukaryotic genomes into the nucleus. The wrapping of 147 bp of DNA into a nucleosome clearly reduces the volume occupied, although more by reducing the degrees of freedom than by actually decreasing the space occupied by the double helix. DNA can be compacted much more than it generally is in a eukaryotic nucleus, as shown by the packing of a bacteriophage genome into a capsid or of the human genome in a sperm nucleus. The difference is that the bacteriophage DNA is not available except by releasing the entire content of the capsid, and neither is much, or all, of the sperm genome. The eukaryotic genome in a somatic cell, instead, is available or specifically addressable in part (i.e., euchromatin), and the heterochromatic complement could be said to be accessible at least part of the time. What makes the difference, of course, is that the eukaryotic genome is partitioned into myriads of small packets (i.e., nucleosomes). These can be further packaged hierarchically into higher-order structures that, potentially, can be folded or unfolded locally. In other words, an arbitrary fraction of the genome down to single nucleosomes can, in principle, be opened up individually, allowing access to its DNA content. To do this, however, a nucleosome has to be differentiated from its neighbors or its DNA content has to be searchable at least intermittently.
2.2. Chromatin as a Response to DNA Concentration in the Nucleus
Although space is limiting in the nucleus and compaction of at least part of the genomic DNA, an imperative, far more important tension exists: a tension between the need for controlled genomic accessibility to transcription and the need to reduce the accessibility to most of the genomic DNA. Packaging of a large genome in the nucleus is a necessity, but perhaps far more important is the imperative to hide most of the DNA so that it is not easily accessible to the protein machinery that needs to act on it. All mechanisms that distinguish one genomic site from another, except those based on bulk physical or topological properties, must be able to recognize specific nucleotide sequence motifs, short enough to be readable and bindable by a single protein with sufficient discrimination to minimize the noise resulting from inappropriate binding. The ratio between signal and noise (i.e., specific versus nonspecific binding) is critical for carrying out any kind of gene regulation. This discrimination is lost if the concentration of nonspecific DNA is so great that any DNA-binding protein would spend its time bound inappropriately to the wrong sequences. The key to specificity in the nucleus is therefore intrinsically tied to the need to mask most of the genomic DNA so as to make it unavailable for binding to the regulatory and transcriptional machinery. However, the differentiation of chromatin and specific activities of the genetic information depend ultimately on the local DNA sequence. Packaging into nucleosomes not only prevents access to the DNA sequence, but, by itself, it would reduce the genome to an undifferentiated collection of more or less structurally identical nucleosomes. Viewed in this light, it is clear then that a large part of eukaryotic gene regulation must consist of ways to (1) remove or remodel nucleosomes so as to make the underlying DNA available for DNA-binding proteins (2) do this in a sequence-specific way or at least produce sequence-specific results and (3) develop a way to mark nucleosomes or nucleosomal domains so as to restore some specificity for the action of regulatory proteins (e.g., acetylating histone tails to make certain nucleosomes easier to displace or remodel).
2.3. Nucleosome Density
It is well known that the density of nucleosomes is important for retaining regulatory specificity. If insufficient histones are produced, nucleosomal density is reduced. In Drosophila, this causes loss of heterochromatic silencing and suppression of position-effect variegation (Moore et al. 1983). In yeast and mammalian cells, histone insufficiency causes derepression of many conditionally expressed genes (Han and Grunstein 1988 Lenfant et al. 1996 Wyrick et al. 1999 Celona et al. 2011 Gossett and Lieb 2012). Interestingly, a reduced nucleosome density changes the occupancy (the frequency with which a position is occupied) rather than the distribution of nucleosomes. This is because certain DNA sequences favor the formation of nucleosomes by more easily wrapping around the histone core than others. In addition, the more DNA becomes accessible, the more DNA-binding proteins can bind to their preferred sequences and compete with nucleosome formation.
As shown long ago, initially by Drosophila histone gene deletion experiments, nucleosome density is a necessary precondition for heterochromatic silencing. If the nucleosome density is too low, the DNA becomes too accessible to DNA-binding proteins, in particular, RNA polymerase, and increasing evidence shows that indiscriminate access results in indiscriminate transcription. Transcriptional activity is associated with many other nucleosome-modifying activities, particularly histone acetylation, which prevents the establishment of the heterochromatic state and promotes further accessibility. A similar effect is obtained in the presence of a normal histone gene complement if the concentration of a gene activator is increased (Ahmad and Henikoff 2001). These effects show that without nucleosomes the ability to repress transcription is lost and the ability of nucleosomes to prevent access is in competition with the concentration-dependent binding of the transcriptional machinery. What is lost is not only the ability to repress, but also the control of transcriptional activation in that the requirement for activators to produce transcription is at least partly absolved if RNA polymerase no longer needs the help of various remodeling activities to access the DNA sequence. Furthermore, although the control of access is a major component of the repressive activities of heterochromatin, in normal cells there are windows of opportunity to access even the heterochromatic DNA sequence, and a sufficiently high concentration of a DNA-binding activator can exploit these to bind to its sequence and produce local derepression. In many cases, then, a major limiting factor in the control of transcription is access of RNA polymerase to DNA. Many genes, particularly in higher eukaryotes, have developed ways to ensure that RNA polymerase is preloaded, often transcriptionally initiated, but arrested (paused polymerase) and ready to respond to transcriptional signals that allow it to elongate. In most cases, this requires the access of DNA-binding proteins that configure the nucleosomes around the promoter site. Here too, however, the pausing is dependent on the need of additional factors to overcome nucleosomal obstacles to elongation.
When the DNA is fully occupied by nucleosomes, or at least when histone levels do not limit the nucleosome density, most of the DNA sequence is not directly accessible to DNA-binding proteins. It has been shown that some transcription factors are better able than others to bind to nucleosomal DNA sequences, at least when the binding sites are close to one edge of the nucleosome (Zaret and Carroll 2011). Such “pioneer” factors can gain a toehold by binding to the DNA entering the nucleosome even in a compacted chromatin structure, evict linker histone H1, and invoke nucleosome remodeling machines to unravel the DNA and expose it for the binding of other enhancer binding factors in a multistage process (Li et al. 2010).
2.4. Roaming Activities
Special features may allow certain sequence-specific binding proteins to find their binding sites, possibly by profiting opportunistically from transient opening of the chromatin structure. In general, however, access to DNA requires the help of remodeling machines. Thus, to allow access to nucleosomal DNA without prior sequence information, we would need to hypothesize roaming activities that survey the genomic chromatin and periodically turn it over, so to speak, temporarily opening access to the underlying sequence. At the same time, to prevent such access and ensure regulated opening, we might expect an opposing activity. Are there any known chromatin marks or chromatin activities that might support this hypothesis?
Two features have been identified that are characteristic of sites in which DNA has to be maintained in an accessible state: One is the binding of the CBP histone acetylase or its close relative p300, whether or not accompanied by steady-state enrichment in histone H3K27 acetylation. The other is the chromatin mark H3K4me1, whose role in accessibility is not well understood. These features are characteristically found at enhancer sites, in which CBP is thought to be recruited by most enhancer-binding factors (Heintzman et al. 2007 Xi et al. 2007 Visel et al. 2009). They are also found at promoters and wherever DNA-binding proteins find access to the genomic DNA. These sites have also been found to be hot spots of active nucleosome turnover, detectable by the deposition of nucleosomes containing the histone variant H3.3, often together with the histone variant H2A.Z. This combination of variants is less stable than normal, and easier to remodel or turn over (Jin et al. 2009). The fact that acetylation is not always detected at enhancer sites, despite the presence of CBP, suggests that the acetylated nucleosomes are those that have been displaced to create the nucleosome-free region, which is occupied by DNA-binding proteins.
CBP is often associated with a histone-remodeling activity (e.g., Drosophila Brahma, ortholog of human SNF2L2) and UTX, one of two known histone H3K27 demethylases, but the only one found in Drosophila (Tie et al. 2012). UTX is an essential component of the Trithorax-related (TRR) H3K4 methyltransferase (or MLL3 and MLL4 in mammals), which is the source of H3K4me1 and sometimes H3K4me2, found at enhancers, promoters, and other protein-bound DNA-binding sites (Herz et al. 2012). We might wonder what a H3K27 demethylase might be doing at these sites, but there is a strong connection CBP is responsible for H3K27 acetylation and this activity is blocked by the simple presence of preexisting H3K27 methylation, without need to recruit repressive complexes of any kind.
2.5. Ubiquitous H3K27 Methylation
H3K27 methylation is, in fact, ubiquitous in the genome. It is produced by the Polycomb repressive complex 2 (PRC2), whose methyltransferase subunit in Drosophila is E(z) (ortholog of mammalian Ezh1 and Ezh2). PRC2 is responsible for mono-, di-, trimethylated H3K27. The trimethylated state is the one that has received the most attention because it is the one associated with Polycomb-repressed genes. However, its most abundant product is not H3K27me3, which in somatic cells constitutes some 5%% of total histone H3, but H3K27me2, which is found in a staggering 50%% of all H3 (Peters et al. 2003 Ebert et al. 2004 Jung et al. 2010 Voigt et al. 2012). Dimethylation is therefore the major activity of PRC2 in flies as in man. Kinetic studies (McCabe et al. 2012) show, in fact, that although PRC2 monomethylates and dimethylates rapidly, trimethylation is enzymatically more difficult and probably occurs in vivo primarily where PRC2 is stably bound. H3K27me2 is also the most abundant and broadly distributed type of histone modification, found everywhere except in regions that are enriched in H3K27me3 or undergoing transcriptional activity. The reason for the former is obvious. The reason for the latter (i.e., its depletion in transcribed regions) is most likely twofold: (1) nucleosomes are less densely distributed in such regions because of increased instability and turnover, and (2) transcriptionally active regions are targeted by the UTX H3K27 demethylase, producing H3K27me1 and H3K27me0. Very similar results have been recently reported for mouse embryonic stem cells (Ferrari et al. 2014). In these cells, H3K27me3, me2, and me1 constitute, respectively, 7%, 70%, and 4% of total H3, whereas H3K27ac is 2% and unmodified H3K27 is 16%. H3K27me2 is confined to transcriptionally inactive regions, except those that have stably bound PRC2, whereas H3K27me1 is found only in transcriptionally active regions. Most likely then, H3K27me2 is actively removed by demethylation, with H3K27me1 as an intermediate to complete demethylation. Not surprisingly, transcriptionally active regions are also enriched in UTX, the only known H3K27 demethylase in Drosophila. Removal of H3K27 methylation in active regions is required for H3K27 acetylation, which is generally found in the 5′ region of active transcription units and enhancers. But this leaves the twin questions: Why is there H3K27 methylation in the first place? And why is H3K27 acetylation specifically needed?
Unlike H3K27me3, which is found mainly at genomic sites that can stably recruit the PRC2 complex, the activity that produces H3K27me2 must target the whole genome. Although it might be associated with the replication fork, it is most likely explained by the transient interaction of free PRC2 with nucleosomes by a hit-and-run mechanism. It is not a completely random mechanism, however. The methylation activity of PRC2 is modulated by several inputs from the surrounding chromatin. One of these is dependent on a hydrophobic pocket in the PRC2 subunit extra sex combs (ESC)/Eed, which binds methylated H3K27 (Margueron et al. 2009). When this binding occurs, it effects a conformational shift in the catalytic subunit E(z) that greatly stimulates its methylation activity. Although H3K27me3 binds more strongly, H3K27me2 also binds to the aromatic pocket. Therefore, the presence of H3K27me2 or H3K27me3 in surrounding nucleosomes promotes the methylation of newly deposited nucleosomes. Mutations in the ESC hydrophobic pocket drastically reduce the global level of both H3K27me3 and H3K27me2. Other mechanisms that modulate PRC2 activity probably contribute, although they have not been tested in vivo. Thus, the nucleosome density surrounding a target nucleosome appears also to stimulate methylation activity (Yuan et al. 2012), whereas the presence of H3K4me3 or H3K36me2/me3 on the target nucleosome reduces the methylation activity of PRC2 (Schmitges et al. 2011 Yuan et al. 2011). As a consequence, regions that already contain H3K27 methylation are better targets for PRC2, whereas regions that have a lower nucleosome density or nucleosomes that bear H3K4me3 or H3K36me2/me3, all marks of transcriptional activity, are poor targets. The discovery of these several devices in PRC2 and other methyltransferases has made it clear that feedback and feed-forward mechanisms can be incorporated in chromatin-modifying machines to both self-renew a chromatin mark and avoid regions marked with certain other histone modifications. It is worth pointing out here that these mechanisms not only help to maintain Polycomb repression from one cell cycle to the next through the maintenance of H3K27me3, but, by modulating the deposition of H3K27me2, they also provide a memory of transcriptional activity. Regions that have been recently transcribed have lower nucleosome density and are enriched in the H3K4me3 and H3K36me2/me3 marks that favor renewed transcriptional activity and, at the same time, inhibit H3K27 methylation. In other words, in the case of a globally distributed histone mark such as H3K27me2, the absence of the mark is itself a mark that carries information of previous transcriptional activity ( Fig. 1 ). ).
Transcriptional memory of the chromatin state. The schematic drawing illustrates some key changes in the chromatin marks associated with a chromatin region that has recently been transcribed or becomes stably repressed by Polycomb mechanisms. A region that has not been recently transcribed is marked by heavy H3K27me2. A region recently transcribed has lost H3K27me2, but instead gained H3K27ac and H3K4me3 marks in the promoter-proximal part and H3K36me3 (which, in turn, recruits deacetylating complexes) to control the excessive access allowed by the loss of H3K27me2. Regions that can recruit stable binding of Polycomb complexes PRC1 and PRC2 acquire H3K27me3. For simplicity, other histone marks are not shown.
A model for the control of DNA accessibility in chromatin. (A) The model proposes that antagonistic roaming activities transiently interact with genomic chromatin: one, caused by PRC2, deposits the H3K27me2 mark. Another removes this methylation mark and remodels nucleosomes, allowing transient access to the DNA sequence. These activities are attributed to UTX, CBP, and BRAHMA. (B) A DNA-binding factor A binds to its cognate binding motif in the DNA, transiently made accessible, and recruits stable binding of CBP together with a remodeling activity (BRAHMA) and the TRR/MLL3,4 complex containing UTX. These activities remove H3K27 methylation, depositing instead the H3K27ac and H3K4me1 marks. (C) The remodeling activity provides stable access to the DNA, leading to the binding of additional factors B and C to an enhancer region (or other regulatory element on the DNA). The region of DNA made accessible can also be opportunistically targeted by RNA polymerase, which may produce short transcripts from both DNA strands.
PRC2, therefore, provides a global mechanism to mark chromatin according to its recent usage. But what does the H3K27me2 mark do and how is it interpreted? We have become accustomed to thinking of histone modifications as marks that are “read” by chromatin proteins that possess appropriate binding domains. This is possible, but unlikely for H3K27me2. The “reader” approach is suitable for marks that distinguish a region from the rest of the chromatin. A global mark such as H3K27me2 would bind the “reader” virtually everywhere. A more economical interpretation is that, rather than being “read,” the presence of the mark provides both the reading and response at the same time H3K27 methylation preempts the lysine so that it cannot be acetylated. H3K27ac is a mark associated with the 5' region of active genes. As we have seen, it is also potentially present at all sites containing CBP—that is, sites such as enhancers that involve the access of DNA-binding proteins to DNA. In principle, monomethylation of H3K27 would serve the same purpose. H3K27me1 has been often considered to be associated, not with repression, but with transcriptional activity. H3K27me1 is found in transcriptionally active genes most likely because these are sites in which H3K27me2 is demethylated by UTX. The monomethylated state is most likely a stage in demethylation or remethylation.
2.6. The Accessibility Hypothesis
A hypothesis that would integrate these various findings is that access to the DNA content of nucleosomes is a major limiting factor in controlling all sequence-specific activity in the genome. Control of the access to DNA is therefore a key regulatory principle. According to the hypothesis (Fig. 2), access is provided intermittently by a roaming nucleosome remodeling activity that “turns over” nucleosomes, temporarily making their DNA accessible. For some reason that is not clear at this stage, this activity involves H3K27 acetylation, perhaps transiently, and is blocked by preventing H3K27 acetylation. In general, this acetylation is constantly removed by roaming histone deacetylases, as has been shown in yeast (Vogelauer et al. 2000). The remodeling that provides accessibility is counteracted by a roaming PRC2 activity that dimethylates H3K27 genome-wide this PRC2-mediated H3K27 dimethylation thereby preempts this position and blocks its acetylation. The H3K27 dimethylation can be removed by the UTX demethylase, whose main activity is therefore to remove the block to acetylation and allow more stable access to the DNA. This is needed at sites such as enhancers, promoters, PREs, and others in which multiple DNA-binding proteins need to see the nucleotide sequence. The binding of these factors recruits stable CBP and is associated with remodeling activity whose longer-term presence displaces nucleosomes, producing a nucleosome-depleted region. Such regions are typically hypersensitive to DNaseI treatment and found associated with enhancers, promoters, PREs, sites whose edges are also enriched for histone H3.3 relative to their surroundings. This, as analyzed by Mito et al. (2007), is caused by nucleosome replacement, which means that nucleosomes found there are not formed as part of the replicative process, but are due to continuous turnover.
2.7. Transcription of Nucleosome-Free Regions and the RNAi Response
The accessibility hypothesis requires that sites of nucleosomal depletion or remodeling, or any region that is not densely populated by nucleosomes, will have a high probability of binding RNA polymerase on an opportunistic basis and producing some transcriptional products. The amount and length of such transcripts is likely to be very variable and dependent on the sequence, the vicinity to some enhancer-like activity, and probably many other factors. There should be little strand specificity for these transcriptional starts, which, therefore, are likely to result in production of RNA from both strands and thus would be targeted by RNAi mechanisms.
Promoters, which have a short nucleosome-depleted region, are well known to produce short transcripts from both strands within a region of a few hundred nucleotides surrounding the transcription start called TSSa-RNAs (Seila et al. 2008 Affymetrix/Cold Spring Harbor Laboratory ENCODE Transcriptome Project 2009 Taft et al. 2009). Promoter regions have evolved ways to minimize the production of RNA from both strands by selecting a high frequency of polyadenylation signals such that productive elongation occurs predominantly in the direction downstream from the gene TSS (Almada et al. 2013 Ntini et al. 2013). Enhancers are also the source of transcripts, the so-called enhancer RNAs or eRNAs, from both strands (De Santa et al. 2010 Kim et al. 2010 Ørom and Shiekhattar 2013). DNA damage sites, where nucleosomes are removed for a considerable length surrounding a double-strand break, also produce transcripts from both strands. These RNAs are now known to be processed by Dicer and Drosha and required for the binding of ATM, a kinase that phosphorylates the histone variant H2AX to initiate the formation of DNA damage repair foci (Francia et al. 2012). RNAs produced from all such nucleosome-depleted regions are by-products of the nucleosome remodeling processes occurring at these sites. They need not have particular function, but it should not be surprising to find that they have acquired a function at certain sites.
Regions that are partially depleted of nucleosomes or become too easily accessible to RNA polymerase are prone to initiate transcription, which is not strand-specific. In the general case, therefore, most such accessible regions would produce RNA transcripts from both strands. One possible consequence of bidirectional transcription is the recruitment of the RNAi machinery. The bidirectional transcripts produced from DNA damage sites clearly recruit components of the RNAi machinery (Francia et al. 2012). RNAi proteins such as Dicer2 and AGO2 are associated with active promoters that produce small bidirectional RNAs (Cernilogar et al. 2011). It has been claimed that the RNAi protein, AGO2, associates with a variety of sites that are expected to be depleted of nucleosomes, including CTCF-binding sites, promoters, and PREs (Moshkovich et al. 2011). It is not clear what the function of AGO2 might be in these cases, but its loss leads to a decrease in insulator activity or Polycomb repression (Grimaud et al. 2006 Lei and Corces 2006).
RNAi mechanisms are often thought to be protectors of genome integrity against attacks by viruses or proliferating transposons. In the nucleus, they result in the recruitment of histone H3K9 methylation, the binding of heterochromatin proteins such as HP1 and histone deacetylases, and the stabilization of nucleosomes, in essence, the opposite of the process that opened up the chromatin and produced the bidirectional transcripts at enhancers, promoters, etc. The connection between DNA accessibility and the RNAi response is, I suggest, not accidental. Regions that are partially depleted of nucleosomes or become too easily accessible to RNA polymerase are prone to initiate transcription, which is not strand-specific. In the general case, therefore, most such accessible regions would produce RNA transcripts from both strands. If these RNAs recruit the RNAi response, this response is endemic and inseparable from the fundamental necessity of gaining access to the genomic DNA. It could be argued, therefore, that the RNAi response might be, in its basic form, a way to recruit proteins that stabilize nucleosomes (HP1, linker histone, histone deacetylases), restore nucleosomal occupation, or restore nucleosomal stability to sites that, for whatever reason, might have transiently become open. The fact that the RNAi response has become a valuable protection against invading genetic elements would not be incompatible with the even more basic function of keeping the genomic DNA covered and ensuring that transiently opened regions do not get out of hand.
2.8. PRC2 and Heterochromatin
RNAi mechanisms are thought to be important for the establishment of heterochromatin. This has been worked out in detail for the fission yeast Schizosaccharomyces pombe, but many aspects of this relationship apply to Drosophila and mammalian heterochromatin formation. The arguments presented above help to understand why E(z) has been found to play a role in the efficient establishment of heterochromatin and, in fact, is known as a suppressor of position-effect variegation in Drosophila (Laible et al. 1997). This role has been a puzzle for many years because there is no specific presence of E(z) or H3K27me3 in heterochromatin. This role is better understood in terms of accessibility. In Drosophila, the early embryonic stages are a time of extremely rapid and synchronous nuclear divisions. These slow down by the 14th cycle (3 h postfertilization), but the chromatin produced must now be the target of a massive H3K27 methylation effort. This is accomplished thanks to correspondingly massive amounts of PRC2 components that are deposited in the egg during oogenesis. By the time nuclear proliferation slows down and heterochromatin first becomes detectable, global dimethylation of H3K27 must be in place. At this stage, it is important to suppress H3K27 acetylation, remodeling, and adventitious transcriptional activity to allow the RNAi and other mechanisms to initiate and maintain heterochromatin. Access to DNA is never completely prevented even in heterochromatin, as shown by the fact that strong activators can prevent heterochromatic silencing of a reporter gene (Ahmad and Henikoff 2001), but the absence of H3K27me2 would certainly result in a level of access to activators and RNA polymerase that would interfere with the establishment of heterochromatic silencing.
2.9. Effects of Loss of PRC2
If H3K27 methylation plays such a global genomic role, loss of PRC2 function would surely have major consequences—increasing pervasive transcription, among others. Unfortunately, it has not been possible yet to separate the global H3K27 dimethylation function from the Polycomb-related and more specific H3K27 trimethylation. Loss of PRC2 is an early embryonic lethal both in mammals and Drosophila, and it produces embryos with classical homeotic derepression phenotypes (Struhl and Brower 1982). Loss of Polycomb repression of Hox genes and many other developmentally important genes would certainly be sufficient to account for lethality. In addition, it would make it difficult to determine whether any other effects should be attributed to indirect consequences of derepression or tloss of H3K27 dimethylation. Nevertheless, loss of PRC2 activity is not cell lethal. Mammalian embryonic stem cells with knockouts of Ezh2 or Eed are viable, although unable to differentiate. In mouse embryonic stem cells lacking PRC2 function, H3K27 acetylation appears at new sites together with H3K4me1, forming a signature typical of poised enhancer regions (Ferrari et al. 2014). This suggests that many normally silent regions become accessible and transcriptionally active. Activation of new transcription sites not normally associated with H3K27me3 was also observed. An increase in the accessibility of chromatin to RNA polymerase was also observed in Drosophila embryos lacking maternal and zygotic ESC, an essential component of PRC2 (Chopra et al. 2011). Promoters of thousands of genes became occupied by RNA polymerase II, whether or not they were transcriptionally activated.
Mutations affecting EZH2 and PRC2 activity are associated with a variety of aggressive cancers but, strangely, both hyperactivity and loss of activity appear to be oncogenic. The interpretation has generally been that these effects are mediated by the hyperrepression or derepression of Polycomb target genes, and this is undoubtedly true, at least in part. For example, genes that block cell cycle progression, such as INK4A/B, are regulated by Polycomb mechanisms and hyperrepression would remove brakes to cell proliferation. Because much evidence supports a cancer-promoting role of PRC2 activity, the discovery that loss of PRC2 function can also promote cancers, such as myeloid leukemia, has been puzzling (Hock 2012 Simon et al. 2012 Tamagawa et al. 2013).
A particularly interesting case is that of the recently characterized mutations converting K27 to methionine in histone H3 or H3.3 genes. This mutation was found associated with a particularly malignant glioblastoma, in which it has a dominant effect totally out of proportion with the relatively small fraction of total histone H3 that is produced by the mutated histone H3 gene copy (Chan et al. 2013 Lewis et al. 2013). The methionine at position 27 mimics, in part, the K27 methylation, but lacks the moderating positive charge that the amine nitrogen retains even when trimethylated. As a consequence, the EZH2 catalytic domain binds, but does not readily release, the H3K27M peptide. Although this has not been directly shown, one consequence might be that the PRC2 complex becomes effectively sequestered and unavailable, causing H3K27 to become undermethylated genome-wide. Attention has been focused on the partial loss of H3K27me3 and consequent derepression of Polycomb target genes. I suggest a somewhat different interpretation. Genome-wide H3K27 dimethylation would be even more powerfully affected because it is strongly dependent on a hit-and-run mechanism and, therefore, on the pool of free PRC2. The loss of H3K27 methylation would be expected to derepress a large number of genes whose silencing depends on the inability of activators and RNA polymerase II to access the promoter. More important, perhaps, is that transcription may start anywhere, including within gene bodies, producing partial proteins that would have unexpected neomorphic effects.
These observations support the global accessibility hypothesis, at least in part, but are a long way from providing substantial proof. They suggest, nevertheless, that the abundance and ubiquitousness of H3K27 dimethylation are not without a significance that could help to understand the way in which different chromatin modifications jostle with one another, the interplay that has provided the raw material from which evolution has shaped the chromatin landscape and its functions.