Advertisement

Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical dia

阅读量:

Single molecule real-time (SMRT) sequencing achieves maturity: application areas and practical advantages for medical diagnostics.

Simon Ardui, Adam Ameur, Joris R Vermeesch, Matthew S Hestand

Author Notes

Nucleic Acids Research , Volume 46, Issue 5, 16 March 2018, Pages 2159–2168, https://doi.org/10.1093/nar/gky066

Published:

01 February 2018

Article history

  • PDF
  • Split View

Cite

Permissions Icon Permissions

Share

Abstract

Short-read massive-parallel sequencing has become established as a go-to diagnostic method in clinical settings. Despite these challenges, short-read technologies face issues like GC biasing, struggles aligning to repetitive regions, difficulty distinguishing paralogs, and hurdles in resolving allele phases. Long-read single-molecule sequencers address these limitations effectively. In addition to this improvement in accuracy levels, long-read technologies also enable detection of epigenetic changes within native DNA. The first widely adopted commercial long-read platform was the RS system developed by PacBio using their Single Molecule Real-Time (SMRT) sequencing technology which has since evolved into the RSII and Sequel systems These advancements are reshaping fields ranging from constitutional genetics to reproductive health cancer diagnostics microbial analysis and viral research through an enhanced understanding of SMRT sequencing capabilities.

Issue Section:

SURVEY AND SUMMARY

INTRODUCTION

Significant reliance on DNA sequencing is evident in modern medical genomics research and diagnostics. A wide array of applications across the lifespan include prenatal diagnostics, newborn screening, rare disease diagnosis, hereditary cancer identification, pharmacogenetic testing, and predisposition evaluations for numerous diseases. Future generations can also undergo carrier screening and pre-implantation genetic analysis.

可以将...划分为三个阶段:第一代、第二代和第三代测序技术(3)。尽管早期的第一代技术带来了重大发现[1][2],但测序领域的巨大革命源于链终止法或脱氧核苷酸终止技术的发明——即我们现在所说的Sanger测序[1][2]。化学的进步以及从凝胶电泳转向毛细管电泳导致了当前Sanger测序仪的出现——它们提供最高约1千bp的低通量、高质量读数。Sanger测序仍常被视为金标准,并广泛用于检测Mendelian遗传病[6]以及高通量测序结果的靶向验证。

In the first decade of the 21st century, multiple novel DNA sequencing methods emerged (6). Unlike earlier first-generation platforms, these newer second-generation technologies offer significantly shorter read lengths (up to several hundred base pairs), though they achieve much higher throughput (up to billions of reads per run). Short-read platforms based on fluorescence include Illumina's bridge amplification and sequencing by synthesis technologies (e.g., HiSeq and MiSeq), Roche 454 pyrosequencers, and Applied Biosystem's sequencing by oligonucleotide ligation and detection (SOLiD) platforms. Additionally, Ion Torrent sequencers detect nucleotides through pH differences caused by hydrogen ions emitted during polymerization, differing from light-based signals. While these short-read platforms enable rapid identification of causative mutations in disease genes, exomes, or entire human genomes in both research and clinical settings (7), they all share common challenges. The limitation of short read lengths hinders assignment of reads to complex genomic regions, phasing of variants, resolution of repetitive sequences, and introduces gaps and ambiguous regions in de novo assemblies (8-11). Furthermore, variability in amplification steps during library preparation or sequencing reactions introduces chimeric reads, inconsistent repeat sizes, and underrepresentation of GC-rich or GC-poor regions. In essence, these limitations constrain the utility of diagnostic variant detection.

Third-generation sequencing is generally characterized by single molecule approaches, which are fundamentally distinct from clonal-based second-generation methods relying on amplification. Helicos pioneered the first commercial application of single molecule sequencing using fluorescence detection and sequencing by synthesis. While this early approach lacked certain biases associated with GC-rich or poor regions, it still yielded relatively short read lengths, typically around 35 bp (14). More recent advancements include the SMRT technology developed by Pacific Biosciences (PacBio) offering exceptionally long read lengths exceeding 20 kb (15), as well as the nanopore technology from Oxford Nanopore Technologies providing even longer sequences. These next-generation platforms enable sequencing through repetitive elements, direct variant resolution, and even the direct detection of epigenetic modifications (17,18). Sequencing efforts using these methods typically last several hours, making them highly suitable for diagnostic applications. Although low-cost-based nanopore technologies (reviewed in (18–20)) are gaining traction and may represent viable future options, SMRT sequencing remains more mature for diagnostic purposes at present. This review focuses on the implementation of SMRT technology in human genetic diagnostics.

SMRT SEQUENCING TECHNOLOGY AND TERMINOLOGY

Before SMRT sequencing, a library must be prepared from double-stranded DNA input material (Figure 1A). Typically, this process often requires five or more micrograms of DNA, which can limit some applications. The library preparation involves simply ligating hairpin adapters onto DNA molecules to circularize them into a construct known as a SMRTbell (Figure 1B) (21). Next, a primer and polymerase are annealed to the adapter whereupon the library is loaded onto a SMRT Cell equipped with 150,000 nanoscale observation chambers (Zero Mode Waveguides (ZMWs)) for the RSII system and up to one million on the newer Sequel platform. The polymerase-bound SMRTbells are then loaded into the ZMWs (Figure 1C). Ideally, each ZMW should host exactly one SMRTbell to maximize throughput and read lengths. For optimal performance, this load ratio is typically between one-third to one-half of the ZMWs per SMRT cell. Consequently, a SMRT cell generally produces approximately 55,000 reads for the RSII system and 365,000 reads for the Sequel system (Table 1). The actual sequencing occurs within each ZMW, whose small diameter restricts only the smallest available volume for light detection (22). Within each ZMW, polymerase enzymes incorporate fluorescently labeled nucleotides that emit fluorescent signals captured in real-time by a camera (Figure 1C). These signals are converted into long sequences termed continuous long reads (CLR), linear reads, or polymerase reads. For short insert libraries, the circular structure results in multiple coverage of the insert sequence by CLR. Each pass of an original strand is referred to as a subread. All subreads from the same molecule can be combined into a highly accurate consensus sequence termed either circular consensus sequence (CCS) or reads-of-insert (ROI) (Figure 1F–H, left panel). While these terms are often used interchangeably in practice, they differ by definition: CCS requires two full sequencing passes of the insert whereas ROI can begin from even partial passes.

Figure 1.

Open in new tabDownload slide

Overview of SMRT Sequencing Technology. Sequencing starts with preparing a library from double stranded DNA (A) to which hairpin adapters are ligated (B). This library is thereafter loaded onto a SMRT Cell made up of nanoscale observation chambers (Zero Mode Waveguides (ZMWs)). The DNA molecules in the library will be pulled to the bottom of the ZMW where the polymerase will incorporate fluorescently labelled nucleotides (C). Note that not all ZMWs will contain a DNA molecule because the library is loaded by diffusion. The fluorescence emitted by the nucleotides is recorded by a camera in real-time. Hence, not only the fluorescence color can be registered, but also the time between nucleotide incorporation which is called the interpulse duration (IPD) (D , right panel). When a sequencing polymerase encounters nucleotides on the DNA strand containing an (epigenetic) modification, like for example a 6-methyl adenosine modification (E, left panel), then the IPD will be delayed (E , right panel) compared to non-methylated DNA (D, right panel). Due to the circular structure of the library, a short insert will be covered multiple times by the continuous long read (CLR). Each pass of the original DNA molecule is termed a subread, which can be combined into one highly accurate consensus sequence termed a circular consensus sequence (CCS) or reads-of-insert (ROI) (FH , left panel). Though SMRT sequencing always uses a circular template, long insert libraries typically only have a single pass and hence generate a linear sequence with single pass error rates (black nucleotides) (FG, right panel). Afterwards, overlapping single passes can be combined into one consensus sequence of high quality (H, right panel). Overall, CCS reads have the advantage of being very accurate while single passes stand out for their long read lengths (>20 kb).

Analysis of PacBio sequencing platforms compared to two current industry standards

Table 1.

An evaluation of PacBio sequencing systems in comparison with two existing benchmarks in the field. Alternatively, an assessment of PacBio sequencing technologies alongside two established standards within the industry.

Platform Read length Number reads Error rate Run rime
PacBio RSII (per SMRT cell) Average 10–16 kb ∼55 000 13–15% 0.5–6 hours
PacBio Sequel (per SMRT cell) Average 10–14 kb ∼365 000 13–15% 0.5–10 hours
Illumina HiSeq 4000 2 × 150 bp 5 billion ∼0.1% <1–3.5 days
Illumina MiSeq 2 × 300 bp 25 million ∼0.1% 4–55 hours

Numbers derived from personal experiences and the websites of the respective companies (links: www.pacb.com and www.illumina.com) were queried on 14 November 2017.

Open in new tab

Due to the real-time detection of the nucleotide incorporation rate, the polymerase's progress through the DNA strand is continuously monitored during sequencing (23). The interval between successive nucleotide incorporations is referred to as the interpulse duration (IPD) and exhibits variations linked to epigenetic modifications on the DNA (Figure 1D and E). Since a polymerase does not hold a single nucleotide during sequencing but typically processes approximately twelve nucleotides at once, an epigenetic change in one nucleotide can actually influence the incorporation rates of its neighboring nucleotides. This phenomenon produces a 'fingerprint,' some of which have been identified, such as those observed for 6-mA, 4-mC, and (Tet-converted) 5-mC.

除了较短但较长的读长(表1所示),PacBio数据与短读测序技术在多个方面存在显著差异。每条读长并非固定长度,而是根据聚合酶活性时间而呈现分布状态。由于在制图库准备和测序过程中无需扩增操作,在此过程中产生的GC偏差几乎不存在。与第二代平台不同的是,在 raw PacBio reads中存在更多的插入错误(而非配对错误)且其占比约为13–15%(表1),尽管它们随机分布在各条-read中(25,26)。这种随机性使得通过多次测序同一分子(CCS reads)或结合不同CLR...

此外,在某些情况下扩散加载会导致较短分子有偏好吸附的现象出现(图1G和H)。这种装载偏倚可通过磁珠装载法实现:即通过施加电场使带电分子被强制吸附至ZMW底部;再配合大小选择滤除较短分子;以及在装载时加入聚乙二醇以增强大分子装载密度。(27)

To address these fundamentally distinct reads, bioinformatics analyses necessitate the adaptation of existing tools or the creation of new methodologies, such as those designed for alignment (26-28) and assembly (199-205). Various PacBio-specific tools are accessible within the PacBio SMRT analysis suite. These include utilities for demultiplexing sample inputs; creating long-read single-molecule sequencing reads; conducting long amplicon amplifications; performing de novo genome assemblies; analyzing epigenetic profiles. These can be accessed via command-line operations or through the PacBio SMRT Portal or its Graphical User Interface.

CONSTITUTIONAL

Tandem repeat disorders

Tandem repeats cause more than 40 neurological, neurodegenerative or neuromuscular diseases when mutated (40). Unfortunately, sequencing those DNA elements is difficult with short-read platforms because the reads are too short to span most tandem repeats. The first tandem repeat studied by SMRT sequencing was the FMR1 CGG repeat (41). Healthy individuals carry around 30 CGG units which is mostly interrupted by one or two AGG units. An expansion of the repeat to more than 200 units causes the Fragile X Syndrome (FXS), which is one of the most frequent causes of inherited intellectual disability and autism. Loomis et al. (41) showed they could sequence through a long full mutation allele of 750 units which equals 2 kb of 100% GC and repetitive content. Interestingly, expansions to full mutations only occur upon maternal transmission whereby the risk directly correlates with increasing repeat size and fewer AGG interruptions (42). SMRT sequencing can be used to determine the repeat size and the detection of the number of interrupting AGG units (43). A main advantage of this approach is the unambiguous separation of the two CGG repeats on the different X chromosomes of females thereby outperforming all other (PCR) approaches. Afterward, the information generated by SMRT sequencing is used clinically for improved genetic counselling of woman weighing the risk of having a child with FXS (43–45). Another example of tackling a tandem repeat by SMRT sequencing is the ATTCT repeat embedded in intron 9 of the Spinocerebellar ataxia type 10 gene (SCA1 0) (10). For the first time the full length of an expanded ATTCT repeat was completely sequenced using SMRT technology. The repeat was reconstructed by assembly and both known and novel interruptions were detected (10). The presence of those interruptions influence the phenotype of SCA10 patients and hence knowing the exact repeat structure allows for better genotype-phenotype correlations. It will be interesting to use SMRT sequencing in the near future for other tandem repeats with interruptions like Myotonic Dystrophy (46) and Friedreich's Ataxia (47) to increase our knowledge on tandem repeat configuration, its influence on stability of the repeat, and phenotype of an individual.

While existing applications rely on PCR, researchers are now developing innovative amplification-free enrichment techniques. Amplification-based methods are prone to errors, particularly when dealing with tandem repeats (41), as they also eliminate epigenetic marks (48). This limitation hinders comprehensive genetic and epigenetic analysis of tandem repeats. Currently in development are two distinct approaches. The first method introduced by Pham et al. (48) utilizes type IIS restriction enzyme digestions alongside specialized hairpin adapters designed for targeted digest overhangs. These adapters enable precise annealing at the digestion sites before employing a 'capture-hook' technique to isolate specific sequences. The second approach incorporates SMRT sequencing paired with CRISPR/Cas9 technology. By strategically introducing a capture adapter at CRISPR/Cas9 cleavage sites, this method selectively enriches and sequences target molecules via magnetic bead capture based on adapter specificity. Thanks to the high-throughput capabilities of SMRT sequencing, multiple tandem repeat targets such as FMR1 CGG repeats, C9ORF72 GGGGCC repeats, HTT CAG repeats, Sca10 ATTCT repeats, etc., can be efficiently enriched and sequenced within a single run (bioRxiv, https://doi.org/10.1101/203919).

这两种方法已被用于针对 FMR1 CGG重复展开研究,并首次在人类细胞系中证实该区域存在真实的生物变异(48)。这些方法不仅能够避免扩增偏差问题还能实现对本源DNA的直接捕获从而实现表观遗传学特征的直接检测。未来该技术可能被用于诊断性 screening 来检测全面突变并评估 FMR1 CGG重复区的甲基化状态(49-51)。这两个因素都会影响FXS的表现型。传统上这通常通过Southern blotting来完成但这种方法耗时费力且不够准确因此采用更快捷更直接的方法如SMRT测序将大大提升 _FMR1_及其他相关重复 disorders 的诊断效率(49-52)。PacBio的增强技术也已用于研究具有扩大 Sca10 ATTCT重复的患者(53)。通过SMRT测序发现完全不存在中断现象这一发现可能与患者的帕金森症状相关。

Polymorphic regions

Genotyping the human leukocyte antigen (HLA) region, or the human major histocompatibility complex (MHC), is crucial for diagnosing autoimmune disorders and selection of donors in organ and stem cell transplantation. Genes in the region can be highly polymorphic, HLA-B being the most variable with >2000 alleles already annotated in 2012 (54). The high variability in sequence make this region exceptionally difficult to map with short reads (54). HLA can be divided into three molecule classes and regions, termed class I, II and III, though the first two are primarily studied. Amplicons of ∼400–900 bp have been used with 454 sequencing to target specific exons of class I genes (55,56). However, considering these genes are ∼3kb in length, entire alleles, as opposed to exons, can be sequenced in a single PacBio read. Class II genes can exceed 10kb making them more difficult, but still possible. Full length class I HLA alleles have been targeted in humans with hybrid PacBio-Illumina approaches (57) and PacBio only approaches (58,59). Many large HLA typing labs, such as the Anthony Nolan Research Institute (58,59), are utilizing or developing SMRT sequencing pipelines of their own or using commercial kits, such as those offered by GenDx (Utrecht, The Netherlands), to now target class I, as wells as many class II genes. This is rapidly expanding the number of known HLA alleles (57) and is becoming a gold standard for organ transplant genotyping and blood stem cell transplantation.

Similarly complex regions can also be analyzed using these approaches as well. The KIR region, which encodes genes that produce proteins recognizing HLA molecules, has recently been studied using SMRT sequencing techniques to determine haplotypes without imputation for the first time (60).

Pseudogene discrimination

The high sequence similarity observed between pseudogenes and their homologous functional genes renders the differentiation between these two entities particularly challenging when employing short-read sequencing technologies. Long-read sequencing approaches, which span the actual gene regions, can be employed to identify unique regions and/or phase variants as a reliable method for distinguishing pseudogenes from true genes. For diagnostic purposes, it is conventional to target specific loci or regions of interest as a cost-effective solution to address the limited throughput of current-generation SMRT sequencing platforms. The simplest approach to enrich for specific loci is through multiplex long-range PCR (up to 10 kb), which amplifies the targets. To differentiate between samples, barcodes can be incorporated during the PCR process via direct primer design (61,62), through a nested PCR strategy (57,61,63,64), or by incorporating barcodes into hairpin adapters during library preparation (Pacific Biosciences Product Note: www.pacb.com/wp-content/uploads/2015/09/ProductNote-Barcoded-Adapters-Barcoded-Universal-Primers.pdf). Consequently, when using multiplexed long-amplification tests, only a single library preparation is necessary after combining the barcoded amplicons. This eliminates the need for fragmentation and multiple library preparations typically required for short-read platforms. As a result, this method enables rapid and cost-efficient library preparation that can be sequenced within a few hours, facilitating subsequent steps in complex gene loci diagnoses.

Another application involves the use of barcoded amplicons spanning 6 to 8 kilobases, including potential nested products. The CYP2D6 gene, which harbors homologous paralogs and copy number variants, presents challenges for reliable genotyping with short-read platforms. Following SMRT sequencing, reads are aligned and variants identified through either alignment-based methods or the 'Long Amplicon Analysis' (LAA) process, which is integrated into SMRT's analysis suite. LAA stands out by enabling reference-free analyses and phased allele determination. The workflow begins with demultiplexing reads when necessary, followed by overlap detection, clustering to identify distinct amplicons, phasing these clusters to discern allele differences, and finally determining consensus sequences using Quiver (34). The optimization of LAA requires careful consideration of read numbers used for clustering: excessive reads may lead to false allelic identifications and extended processing times, whereas insufficient reads could result in allelic dropouts. Once assembled, alleles are compared within themselves or against a reference genome for functional annotation. This approach not only expands the scope of CYP2D6 variant targeting but also enables comprehensive analysis across the entire locus, including flanking regions and introns. Such detailed genotyping enhances metabolizer phenotype identification in patients under study and supports personalized medicine strategies. Similar long-range PCR applications have been employed with PacBio technology to analyze other genes such as PKD1 (for diagnosing autosomal-dominant polycystic kidney disease) and IKBKG (for detecting primary immunodeficiency in patients with life-threatening bacterial infections), as documented in Table 2 (61, 64-65).

Applications of human SMRT sequencing and clinical utility

Table 2.

Applications of human SMRT sequencing and clinical utility

Target Disease Ref.
Tandem repeat sequencing
FMR1 Fragile X Syndrome (43)a
HTT Huntington's Disease a
C9orf72 Amyotrophic Lateral Sclerosis (ALS) a
SCA10 Spinocerebellar ataxia type 10, Parkinson's disease (10,53)a
Highly polymorphic regions
HLA Autoimmune disorders & transplantation (57–59)
KIR Autoimmune diseases & transplantation (60)
Pseudogene discrimination
CYP2D6 Drug metabolism (61,63)
PKD1 Autosomal-dominant polycystic kidney disease (64)
IKBKG Primary immunodeficiency diseases (65)
Cancer
BCR-ABL1 Chronic Myeloid Leukemia (CML) (69)
TP53 Myelodysplastic Syndromes (MDS) and Acute Myeloblastic Leukemia (AML) (70)
Reproductive genomics
TCOF1 Treacher Collins syndrome (67)
PTPN11 Noonan syndrome (67)

a bioRxivhttps://doi.org/10.1101/203919.

Open in new tab

REPRODUCTIVE GENOMICS

Reproductive genomic medicine and associated counseling, including pre-implantation genetic diagnosis (PGD), heavily depend on the ability to haplotype or phase alleles in embryos, patients, and parents. Long-read technologies enable direct phasing of targeted amplicons, which can be used to identify parent-of-origin alleles in embryos or patients (66,67). Among families with one child exhibiting Treacher Collins syndrome, SMRT amplicon sequencing was employed to confirm paternal transmission of a TCOF1 variant linked to splicing defects in the gene and potential disease causation (67). For de novo mutations arising from germ line mosaicism, assessing the frequency of harmful alleles provides insights into future offspring recurrence risk. In couples experiencing multiple miscarriages and suspected feto-Noonian syndromes, SMRT amplicon sequencing identified a disease-causing PTPN11 variant in 37% of the father's sperm (67). Digital Droplet PCR showed no evidence of the variant in the father's blood but confirmed its presence at a 40% frequency in his sperm (67). This approach thus facilitated an estimation of recurrent risk for subsequent pregnancies. Whole-genome single-cell haplotyping based on arrays is already in clinical use for embryo selection prior to implantation; however, phasing still requires additional family members (68). We anticipate significant advancements in PGD applications by integrating long-read whole-genome sequencing for direct phasing to eliminate the need for analyzing additional family members.

CANCER

During the treatment of cancer patients, it is crucial to monitor low-frequency mutations that can confer a proliferative advantage to malignant cells. Chronic myeloid leukemia (CML), a blood cancer, results from a translocation between chromosomes 9 and 22, producing the BCR-ABL1 fusion protein. CML patients are typically treated with tyrosine kinase inhibitors (TKIs) to inhibit BCR-ABL1 expression, though this therapy can induce drug-resistant mutations. Therefore, it is essential to assess the BCR-ABL1 gene in CML patients who do not respond well to TKI therapy and examine the mutation landscape. In a study by Cavelier et al. (69), an approximately 1.5 kb amplicon was generated from BCR-ABL1 cDNA using PCR. SMRT sequencing enables detection of TKI resistance mutations as low as 1%, significantly below the 15–20% threshold achieved by Sanger sequencing. Additionally, this method allows for co-phasing of coexisting mutations, offering new insights into the clonal distribution of resistance mutations in BCR-ABL1, as well as identifying multiple splice isoforms of the protein. Beyond BCR-ABL1, several other cancer genes are suitable targets for clinical SMRT sequencing (Table 2). In a study examining loss-of-function mutations in the tumor suppressor TP53, SMRT sequencing revealed that tumors from acute myeloblastic leukemia (AML) and myelodysplastic syndrome (MDS) patients harbor multiple TP53 mutations distributed across different alleles (70). Future studies could leverage detailed information about subclonal heterogeneity of TP53 to guide treatment for these patients with leukemia or myelodysplasia. Furthermore, minor variants detected in somatic mosaicism could be associated with unrelated somatic variation, such as skin lesion repair in keratitis-ichthyosis-deafness syndrome patients treated with GJB2 mutations identified by Gudmunsson et al. (71).

Current advancements in whole genome and transcriptome sequencing techniques, as elaborated upon later sections, remain predominantly accessible to research purposes. However, these technologies are poised to transition into diagnostic applications within the near future. Notably, whole genome and transcriptome SMRT sequencing have already been successfully applied to breast cancer cell models, identifying unique gene fusion events involving the well-known oncogene Her2 (Case Study: www.pacb.com/wp-content/uploads/Case-Study-Scientists-deconstruct-cancer-complexity-through-genome-and-transcriptome-analysis.pdf). Additionally, whole transcriptome sequencing of prostate cell models has also uncovered novel RLN1 and RLN2 gene fusions associated with prostate cancer (72). Importantly, SMRT sequencing provides enhanced insights into the structural organization of cancer genes; for instance, it was demonstrated through a study by Kohli et al. that a previously undetected cryptic exon exists within AR-V9 (73). This exon was not anticipated to be present exclusively in AR-V7 prior to this discovery. AR-V7 has long been recognized as a biomarker for drug resistance in prostate cancer; however, this classification is contingent upon knockdown experiments that specifically target both isoforms of the gene. Consequently, AR-V9 now presents itself as a more promising predictive biomarker for resistance.

Global changes in epigenetics is also a hallmark in cancer. Single molecule real-time bisulfite sequencing (SMRT-BS) enables quantitative and highly multiplexed detection of methylation in 1.5–2 kb amplicons (74,75). This is an improvement of the previous technologies that could only target typical bisulfite PCR sizes (∼300–500 bp) and potentially enables ∼91% of CpG islands in the human genome to be evaluated (75). To date this has been applied to multiple cancer cell lines, including those from an acute myeloid leukemia, chronic myeloid leukemia, anaplastic large cell lymphoma, plasma cell leukemia, Burkitt lymphoma, B-cell lymphoma and multiple myelomas (75). Expanding to genome wide diagnostics, when whole genome SMRT sequencing is performed on non-amplified material it is theoretically possible to determine epigenetic status across all nucleotides based on IPD ratios. Therefore, we envision in the near future cancer genomes, transcriptomes and epigenomes will commonly be characterized at previously unparalleled resolution.

VIRAL AND MICROBIAL MEDICAL SEQUENCING

In infectious disease, SMRT sequencing has been used to analyse influenza viruses (76), hepatitis B viruses (HBV) (77), hepatitis C viruses (HCV) (77,78) and human immunodeficiency viruses (HIV) (79,80) (Table 3). HCV and HIV are RNA molecules of a length of approximately 9 kb, while HBV is a circular DNA virus of size 3 kb. These viruses are suitable subjects for SMRT sequencing, since the entire virus genome can easily be contained in a single read. For example, Bull et al. (77) developed an assay where the resulting reads covered nearly the entire sequence for all six major HCV genotypes. In addition to determining the genome sequence of the infecting viruses, it is also possible to monitor mutations that are developing as a result of drug treatment. For HCV, resistance associated variants (RAVs) in the NS5A gene occurring at a frequency of <0.5% were successfully identified in samples from patients undergoing treatment by direct acting antiviral drugs (DAAs) (78). By full-length sequencing of the HIV-1 provirus, a 9700 bp molecule that encodes nine major proteins via alternative splicing, Ocwieja et al. (80) detected at least 109 different spliced RNAs, including two of which encode new proteins. The fact that this relatively small study could generate a lot of novel information about HIV-1, a molecule that has already been studied in great detail, demonstrates the advantage of full-length RNA sequencing to study the distribution of splicing isoforms in specific genes. Results from these types of experiments could possibly open up novel therapeutic opportunities in infectious disease.

Medically relevant microbial SMRT sequencing

Table 3.

Medically relevant microbial SMRT sequencing

Target/disease Ref.
Hepatitis B/C virus (77,78)
HIV (79,80)
Influenza viruses (76)
Tuberculosis bacteria (85)
E. coli / Hemolytic–Uremic Syndrome (86)
Salmonella enterica subsp. enterica serovar/gastroenteritis (87)
Leishmania (88)
Leptospira interrogans/leptospirosis (90)
Helicobacter pylori strains/gastrointestinal diseases (91)

Open in new tab

For bacteria, a single SMRT Cell often provides enough data to de novo assemble Escherichia coli size genomes into single contigs. HGAP is the most widely used assembler and works by taking a selection of longest reads and error correcting them with all reads, followed by Celera assembly (81,82), and finalized by polishing with all reads aligned to the final assembly (34). These long reads and new algorithms enable PacBio assemblies to be more complete and accurate compared to second-generation sequencing methods (83,84). Clinically relevant bacterial assemblies include a strain of the Tuberculosis bacteria Mycobacterium tuberculosis (85), the E. coli strain that caused a Hemolytic–Uremic Syndrome outbreak in Germany in 2011 (86), and strains of Salmonella enterica subsp. enterica serovar that cause gastroenteritis in humans (87) (Table 3). Pacbio sequencing and HGAP have also been used to assemble pathogenic single-cell eukaryote genomes that are more complex than a single chromosome, such as for a new Leishmania reference genome (88), a protozoan parasite that kills >30 000 people each year.

Though long reads enable superb microbial assemblies, what truly differentiates SMRT Sequencing from second-generation machines is its capacity to directly assess epigenetic traits. DNA methylation is widespread in bacterial genomes (89), making SMRT analysis of these organisms' epigenetics more straightforward. Researchers can employ IPD ratios comparing case vs control groups or use a silico reference for analysis based on known methylation signatures for 6-mA, 4-mC and (Tet converted) 5-mC. This method has been successfully used to differentiate virulent from avirulent Leptospira interrogans (90), a pathogen causing leptospirosis in humans. Strain-specific differences are minimal but show higher methylation levels in avirulent strains (90). Additionally, this approach has helped identify virulence factor genotype-dependent motifs across eight distinct H. pylori strains, which are linked to gastrointestinal diseases (91). The simplicity of sequencing, assembling and calling nucleotide-level variation for complete genome data from a single SMRT Cell highlights the transformative potential of this technology in microbiology.

FUTURE: WHOLE TRANSCRIPTOME AND GENOME SEQUENCING

传统的做法是将RNA转化为cDNA并随后进行片段化处理以进行短读长测序(RNA-seq)。将RNA-seq检测到的宿主外显子组拼合成单独的转录单元的过程极其困难且容易出错。SMRT测序去除了片段化的必要性,在5'端至poly-A尾段序列化cDNAs并被命名为Iso-Seq。与第二代测序平台相比这一方法具有很高的效率(94)。尽管Iso-Seq在转录单元结构分析方面表现出色但其较低的通量限制了其在基因表达分析方面的应用目前随着成本下降通量提升这一无偏检测基因表达及转录单元的方法将成为常规操作(94)。类似基因变体Phasing Iso-Seq能够帮助确定特定基因位点上的单核苷酸变异是由哪种转录单元 isoforms产生的(94)。尽管如此Iso-Seq因其高精度而成为研究转录组结构的重要工具但其局限性在于当前通量不足限制了其广泛应用然而随着测序技术的进步Iso-Seq及其相关方法将在更多领域发挥重要作用

Whole genome sequencing (WGS) has become a widely used method to study variation in the human genome, and several 100’s of thousands of human genomes have been sequenced with short-reads during the last few years. However, the nature of these reads permit only relatively small assemblies and alignments provide only limited information on variation outside of SNPs and small insertions/deletions. SMRT sequencing is greatly expanding the utility of WGS, permitting a factor greater in assembly completeness (93,95) (BioRxiv: https://doi.org/10.1101/067447), even nearing reference genome contig sizes and including diploid aware assemblies by applying algorithms like FALCON-unzip (37). These PacBio WGS’s also demonstrate a vast repertoire of variation missed by short read WGSs. Low coverage (4–8×) sequencing recently was used to characterize structural variation in chromothrypsis-like chromosomes (96) and identify a pathogenic heterozygous 2184 bp deletion in a patient who presented with Carney complex that could not be identified by short-read sequencing (97). Higher coverage sequencing (∼60×) of two haploid genomes has also been used to identify a vast array of structural variations (461 553 from 2 bp to 28 kb in length), including >89% being missed in the analysis of data from the 1000 Genomes Project (98). From this study, Huddleston et al. (98) estimate a 5× increase in discovering indels >7 bp and additional SVs <1 kb which in total bps represents a majority of the difference between genomes. Additional remarkable findings from individual human de novo assemblies is that there seems to exist several megabases of novel sequence, i.e. sequences that are absent from the current (GRCh38) version of the human reference. For example, Shi et al. (93) reported 12.8 Mb of novel sequence in their de novo assembled individual genome, which would correspond to over 0.4% of the entire human genome of size ∼3 Gb. At this point, it is not known whether this novel sequence is common between all human individuals (and thereby missing from GRCh38) or if it mainly represents sequence variation found only in some specific individuals or population groups. Overall, these WGS studies demonstrate long-read sequencing can identify a substantial number of variation missed by short read platforms, including those relevant to clinical diagnoses.

CONCLUSIONS

The persistent misconception that SMRT sequencing is highly prone to errors and thus deemed impractical for diagnostic use is being challenged by mounting evidence highlighting its distinct advantages over short-read sequencing technologies. SMRT sequencing is bridging the gap between complex genetic analysis and accessible diagnostics, offering unprecedented capabilities such as determining repetitive element length, insertions or deletions(indels), and even epigenetic markers within a single test at base-pair resolution. While long-read sequencing has long been considered the benchmark for certain applications like HLA genotyping in organ transplantation, concerns about its widespread implementation persist due to high costs and limited technical expertise. However, ongoing price reductions, growing customer adoption, and emerging single-molecule technologies are expected to drive innovation in this field. Just as second-generation sequencing platforms expanded beyond Sanger technology, ushering in genomic medicine revolutions, third-generation single-molecule sequencing systems are poised to emerge as the next major breakthroughs in genetic diagnostics.

全部评论 (0)

还没有任何评论哟~