The Trouble with Transposable Elements: Data Analysis

Transposable elements, or so-called “jumping genes,” are highly repetitive sequences contributing to around half of the human genome. This vast ocean of transposable elements has countless roles in human development, health, and disease , with an ever-increasing number of discoveries being made thanks to the affordability of omics technologies used to investigate their diverse biological functions.

But, to shine a spotlight on transposable elements with omics data requires researchers to thoroughly consider the numerous challenges in analyzing millions of hard-to-place repetitive sequences that are often poorly annotated.

In this article, we discuss the potential problems researchers face when analyzing transposable elements in omics data and highlight how Nexco Analytics’ TEnex, our licensed methodology for transposable elements analysis, bring solutions.

An Overwhelming Ocean of Transposable Elements

The human genome is a mosaic of genes and transposable elements. While we only have around 20,000 protein-coding genes, we have approximately five million individual transposable elements scattered throughout our DNA (1).

The sheer number of transposable elements makes their analysis by omics approaches computationally intensive and complex. So much so that for next-generation sequencing-based (NGS) techniques, such as bulk and single-cell RNA-seq, ChIP-seq, or ATAC-seq, most data analysis strategies simply discard or “mask” transposable elements

This means crucial transposable element-derived biomarkers or novel contributors to health or disease pathology may be missed.

It’s All in The Name. Problems With Transposable Element Annotation.

With the vast number of transposable elements comes another problem. Their annotation.

Thanks to their deep evolutionary origins and continual diversification, transposable elements come in a bewildering variety of shapes and sizes. Their complexity has led to various annotation efforts based on phylogenetic relationships or sequence similarities between different elements (2). However, these annotations are often fragmented with conflicting nomenclature depending on the transposable element class, family, or subfamily.

This can cause transposable elements to be assigned names that aren’t actually phylogenetically correct, leading to misleading results with potential implications for future research.

To combat the challenges in transposable element annotation, at Nexco Analytics, we’ve used our extensive expertise in transposable element analysis, in collaboration with world-leading scientists such as Prof. Didier Trono who is leading the Laboratory of Virology and Genetics at EPFL, to develop a highly accurate, curated database containing comprehensive transposable element annotations, genomic locations, and sequences. Our database has been fundamental to over 30 high-impact publications.

Transposable Elements Are Hard to Place

Alongside causing difficulties in their annotation, the repetitive nature of transposable element-derived sequences makes it challenging to accurately map short reads generated by NGS to specific transposable element locations (3).

This is especially true of younger, human-specific transposable elements such as the SVA subfamily. Their sequences have had less time to diverge, so they remain very similar.

At Nexco Analytics, we employ two mapping methods of NGS reads to get the best possible resolution to assess transposable element activity, whether it is their expression or possible role as enhancers.

- Multi-mapping

Multi-mapping gives researchers an overview of the total amount of reads mapping to members of an entire transposable element family or subfamily.

This broad overview is useful when sequencing reads map ambiguously at multiple locations in the reference genome, as is typical for younger transposable elements like the SVA family, but reliable positional information about the active transposable element is lost. Some algorithms do try to assign positional information to multi-mapped reads, but this should be used with extreme care.

This approach is also helpful for shallower sequencing data common with single-cell RNA-seq technologies like 10X to establish how an entire transposable element family or subfamily behaves in a particular cell type (4).

- Unique mapping

When a read has sufficient differences in its base composition from all other reads, it is possible to map it non-ambiguously to a single location in the reference genome. The read can be uniquely mapped and provides accurate positional information about where that read originated. This could be from any location in the genome, such as a genic exon or transposable element.

This type of mapping is crucial to understanding how the position of transposable elements may influence gene expression via their roles as enhancers or alternative promoters (5).

If unique mapping and positional information are required, we highly recommend using the longest reads possible (e.g., 150 bp paired-end reads) to maximize the likelihood of reads being uniquely mappable to specific genomic locations.

Although this unique mapping approach is especially challenging for single-cell RNA-seq data, at Nexco Analytics, we have developed TEnex, a robust unique mapping pipeline vital for the discoveries made in an increasing number of peer-reviewed studies.

Can Long-Read Sequencing Solve These Problems?

Long-read sequencing produces reads that are thousands to hundreds of thousands of base pairs in length (6). This increased length of reads allows researchers to map repetitive regions of the genome more accurately than traditional short-read sequencing, thanks to the increased likelihood that each individual read will have a unique sequence.

This powerful technology improves the mapping of repetitive sequences but at the expense of lower genome coverage (for now). As most transposable elements are relatively lowly expressed compared to genes, long-read sequencing technologies are primarily suitable for more highly expressed elements and may miss those with lower expression.

Combining long-read and short-read sequencing approaches ultimately provides a more holistic view of transposable element functions than either technology in isolation.

Discoveries Powered by Advanced Analytics

Nexco Analytics has over ten years of experience in advanced transposable element data analysis. In addition to our transposable element database, we have used our expertise to develop a suite of peer-reviewed pipelines called TEnex, optimized to analyze data generated with long and short-read sequencing approaches for single and bulk-cell populations or tissues.

Our pipeline covers the entire data analysis workflow, from pre-processing and quality control to advanced, bespoke, and statistically relevant insights for comparative studies and biomarker discovery.

We’re here to help you overcome the challenges with your next transposable element analysis, so why not discuss your requirements with our team? Sit back and let us do the rest.

References

1. Cordaux R, Batzer MA. The impact of retrotransposons on human genome evolution. Nature reviews genetics. 2009 Oct;10(10):691–703.

2. Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA. 2021 Dec;12:1–4.

3. Lanciano S, Cristofari G. Measuring and interpreting transposable element expression. Nature Reviews Genetics. 2020 Dec;21(12):721–36.

4. Pontis J, Pulver C, Playfoot CJ, Planet E, Grun D, Offner S, Duc J, Manfrin A, Lutolf MP, Trono D. Primate-specific transposable elements shape transcriptional networks during human development. Nature Communications. 2022 Nov 23;13(1):7178.

5. Playfoot CJ, Duc J, Sheppard S, Dind S, Coudray A, Planet E, Trono D. Transposable elements and their KZFP controllers are drivers of transcriptional innovation in the developing human brain. Genome research. 2021 Sep 1;31(9):1531–45.

6. Kovaka S, Ou S, Jenike KM, Schatz MC. Approaching complete genomes, transcriptomes and epi-omes with accurate long-read sequencing. Nature Methods. 2023 Jan;20(1):12–6.

Monday, Apr 15, 2024, 2:16 PM

nextgeneration-sequencing, bioinformatics, transposons, transcriptomics

Share this post