Biostatistics and Biometrics Open Access Journal

A Brief History of Sequence Logos

Kushal K Dey*

Department of Statistics, University of Chicago, USA

Submission: February 12, 2018; Published: April 25, 2018

*Corresponding author: Kushal K Dey, Department of Statistics, University of Chicago, 5747 S Ellis Ave, Chicago, IL 60637, USA. Tel: 312-7090680; Email: kkdey@uchicago.edu

How to cite this article:Kushal K Dey. A Brief History of Sequence Logos. Biostat Biometrics Open Acc J. 2018; 6(3): 555690. DOI: 10.19080/BBOAJ.2018.06.555690

Abstract

For nearly three decades, sequence logo plots have served as the standard tool for graphical representation of aligned DNA, RNA and protein sequences. Over the years, a large number of packages and web applications have been developed for generating these logo plots and using them to identify conserved patterns in sequences called motifs. Also, over time, we have seen a considerable upgrade in the look, flexibility of data handling and the overall scope of these plots in biological applications and beyond. Here I attempt to review some popular tools for generating sequence logos, with a focus on how these plots have evolved over time since their origin and how I view the future for these plots.

Keywords: Graphical representation; Sequence logo plots; Standard tool; Motifs; Biological applications; Flexibility of data; DNA sequence data; RNA sequence data; Visualize succinctly; Predictive power; Initial attempts; Widespread; Stylistic configurations; Multiple sequence alignment; Python library; Interdependencies; PLogo; Depletion of symbols; Alphanumeric strings; Visualizes pairwise; Oligonucleotide

Introduction

The seeds of the origin of sequence logos were planted in early 1980s when researchers, equipped with large amounts of DNA and RNA sequence data, started to develop keen interest in understanding how base (A, C, G and T/U) composition at different positions of these aligned sequences, together with the compositional patterns at neighboring positions, contribute to structural and functional properties. For example, a protein or a macromolecule tends to bind at a site in the DNA that has a characteristic pattern. Initial attempts to determine this pattern focused on building a consensus sequence from many aligned sequences-however, this approach was criticized for its lack of predictive power [1,2]. In 1986, Schneider et al. [3] proposed measures to estimate the amount of information at each site of the sequence. An obvious next step was to visualize succinctly, this flow of information along the sequence and the proportional contribution of the different bases to the information content at each position. This resulted in the origin of sequence logo plots in 1990 in a seminal work by Schneider & Stephens [4].

Despite their origin in 1990s, the current widespread use of sequence logos is largely the result of the development of software packages (in R/python) and web applications since mid to late 2000s that has made it extremely simple to generate these logos. A user currently has access to a plethora of open source tools that allow one to generate these sequence logos for DNA, RNA and protein amino acid sequences (Figure 1), using different model assumptions and stylistic configurations for the plots. They also enable one to easily identify conserved patterns in the sequences called motifs and perform downstream motif based comparisons and predictions. In the next section, we review the modeling frameworks and functionalities of some of these tools [5].

Discussion

Modelling Framework

Consider a set of n aligned sequences, with each element of the sequence corresponding to one in a cohort of J symbols. Suppose each aligned sequence is L symbols long. Then the information content at each position l = 1,2,L is defined as

Where, P_ij is the relative frequency of symbol j in position l of aligned sequences and e is a correction term adjusting for the number of samples. The value of J equals 4 for DNA (RNA) sequences corresponding to the 4 bases A, C, G, T/U and equals 20 for protein sequences, corresponding to the 20 amino acids. The height of the stack of logos at position l equals the information content IC(l) and the proportional height of each symbol P_ij in the stack equals P_ij Instead of the sequence data, if only the positional weights P_ij are available, then the correction term e is removed from the expression of IC(l) in (1). When prior probabilities of symbols q_j are known, then we replace IC(l) by KL(l;q) reported below for determining stack heights.

Software review

The first breakthrough in terms of software for creating logo plots took place in 2004 with Web Logo [6], which was designed to generate logo plots from multiple sequence alignment of DNA, RNA and proteins following the same model as (1). Figure 1c for an example protein logo plot. Web Logo provides the flexibility to run commands to generate logos from command line, or as an web application or as part of a Python library. It also has a R package interface now called RWebLogo [7]. This software is currently in its 3rd version, is still very actively maintained and has already been cited more than 6000 times.

The package that introduced R users to logo plots was seqLogo [5]. Nearly 11 years old, this package has been downloaded more than 100,000 times already. seqLogo assumes the input to be the position weight probabilities of bases obtained from aligning DNA sequences and follows the model assumptions in (1) sans the correction factor. Figure 1a for an example seqLogo plot. The web application Seq2Logo [8] improved on seqLogo in terms of the flexibility to handle sequence sets and position specific scoring matrices for both DNA, RNA and protein sequences. Seq2Logo also allows for non-uniform prior probabilities for the symbols and proposes other alternatives besides (1) and (2) to determine stack heights.

Several packages have been suggesting improvements on the model in (1). k-mer probability logo (kpLogo) uses positional interdependencies of aligned sequences to visualize conserved patterns (motifs) [9]. dagLogo can visualize conserved amino acid sequence patterns in groups defined by charge, hydrophobicity etc [10]. Web servers pLogo and iceLogo use probability based methods to determine the stack heights [11,12]. R package Logolas uses median adjustment of log positional weights to highlight both enrichment and depletion of symbols at each position in the logo plot [13].

Besides the modeling framework, in recent times, we have also observed upgrades in plotting styles, flexibility and scope of these sequence logos. Some packages ggseqlogo14 and gglogo15 have integrated ggplot2 graphics with logo plots, thereby generating fancy publication ready logo plots (Figures 1b & 1d) [14,15]. While the initial packages were restricted to using either A, C, G, T or English alphabets as symbols in a logo plot, ggseqlogo has the flexibility to use numbers and Greek letters as symbols as well. R package Logolas even allows one to plot alphanumeric strings as symbols, thereby extending the scope of logo plots beyond DNA, RNA and protein sequences to more general compositional data (Figures 1e & 1f).

One primary application of logo plots is in detecting motifs in a set of aligned sequences and comparing these motifs across many such sets. Several packages have been developed in this context. motifStack R package uses a distance metric based on STAMP to calculate distance between motifs and visualize the alignment of multiple motifs on a tree or circle based on the similarity scores [16,17]. R package DiffLogo visualizes pairwise differences in motifs corresponding to multiple sequence logos, with an intuitive visualization of the difference patterns [18]. Web application two samples Logo also uses statistical tests to compare between two aligned sets of sequences and visualize the differences [19]. R package motifcounter can match a sequence with previously known motifs and then perform enrichment of the number of motif hits based on match scores, with respect to random DNA sequences [20].

Conclusion

First conceptualized nearly 30 years back, sequence logo plots have now become the standard for visualizing patterns in aligned oligonucleotide and protein sequences. With the growing size of multiple sequence alignment data generated through wet lab experiments (HT-SELEX, Chip-Seq, Chip-chip) from various model organisms (humans, mouse, plants, fungi), the need for sequence logo plots to visualize and interpret motif patterns keeps on increasing. In fact, several current databases, for e.g. - JASPAR, ENCODE-motifs, HOCOMOCO, that host multiple alignment related data for different transcription factors and proteins, also prefer to display the sequence logo plots alongside the data [21-25].

In near future, I see the applications of sequence logos extending to identification and visualization of DNA or protein structural motifs, besides the usual sequence motifs. While the broad underlying structure of sequence logo plots is expected to stand the test of time, there is ample scope to improve on the modeling framework to highlight motifs in a batter way. Also, with so many motifs reported by various studies for the same transcription factor, I see extensive work happening in near future centering around grouping of such motifs to build more consensus sequence logos [26,27]. One thing is for certain- sequence logos are here to stay!