Example Candidate gene list:

To demonstrate the potential use case of the user candidate gene analysis, we gathered a set of fifty stress genes that were differentially expressed between the control and salt stress samples and used them to identify unique characteristics common among salt stress genes .

Please go through the below explanation for details on table and plot interpretations

Table column interpretation:

1. kmer

Basic kmer is the simplest approach to represent the DNAs, in which the DNA sequences are represented as the occurrence frequencies of k neighboring nucleic acids.

2. Autocorrelation

Autocorrelation, as one of the multivariate modeling tools, can transform the DNA sequences of different lengths into fixed-length vectors by measuring the correlation between any two properties. Autocorrelation results in two kinds of variables: autocorrelation AC between the same property, and cross-covariance CC between two different properties.

2.1. Dinucleotide-based auto covariance(DAC)

2.2. Dinucleotide-based cross covariance (DCC)

2.3. Dinucleotide-based auto-cross covariance (DACC)

2.4. Trinucleotide-based auto covariance (TAC)

2.5. Trinucleotide-based cross covariance (TCC)

2.4. Trinucleotide-based auto-cross covariance (TACC)


3. Pseudo nucleic acid composition

PseNAC is a kind of powerful approaches to represent the DNA sequences considering both DNA local sequence-order information and long range or global sequence-order effects.

3.1. pseudo dinucleotide composition (PseDNC)

3.2. pseudo ktuple nucleotide composition (PseKNC)


4. Labels

The labels are the classes or the groups the genes are mapped into.The labels can act as both target variable or feature as per the need of the user for solving their specific problem

4.1 No Label

This selection is provided to enable users to view the properties of all genes without labeling them into different gene categories or annotations. This is to let users examine the features of multiple genes and identify common patterns among them. As it involves the inspection of all the genes therefore they work only for "Submit for analysis" button .

4.2 Classical Genes

Classical genes can be defined as the most well-studied genes mainly for their visible mutant phenotype (for example: liguleless3).

4.3 Pan-genome Genes

A gene in a given taxonomic group is either present in every individual (core), or absent in at least a single individual (dispensable).

4.4 Origin Genes

Gene duplication is an important evolutionary mechanism allowing new genetic material and thus opportunities to acquire new gene functions for an organism. There are different origins of duplications such as whole-genome duplications, tandems, etc.


Graph interpretations:

To the top right corner of the plots/graphs, there are options to download plot, zoom-out/zoom in, reset axes, autoscale, toggle spike lines, show closest data on hover, compare data on hover, box select,pan and lasso. Users can also select specific legends to view data only for the selected legends. Details on the interactive plot options are available here:
Interactive graph features

1. Marginal Plot

The Marginal plots are box plot showing the frequency distribution of the selected gene features alonghwith higlighting the candidate genes. This plot will enable the user to easily identify where thier candidate gene lie among the other maize genes for the selected feature.

2. Count Dinucleic box plot

The count Dinucleic box plot describes the number of dinucleotides within a DNA sequence.The plot helps to visualize the counts of the occurrences of each possible di-mer for the chosen label sequences .

3. Count Tri nucleic box plot

The count Tri nucleic box plot describes the number of tri nucleotides within a DNA sequence.The plot helps to visualize the counts of the occurrences of each possible tri-mer for the chosen label sequences .

4. Frequency Dinucleic box plot

The Frequency Dinucleic box plot describes the fraction of each dinucleotide type within a DNA sequence. The fractions are calculated as:
f(r) = Nrs/N
where Nrs is the number of dinucleotide represented by dinucleotide type r and type s and N is the length of the chosen labeled sequence.

5. Frequency Tri nucleic box plot

The Frequency tri nucleic box plot describes the fraction of each tri nucleotide type within a DNA sequence. The fractions are calculated as:
f(r) = Nrsh/N
where Nrsh is the number of tri nucleotide represented by tri nucleotide type r,type s and type h and N is the length of the chosen labeled sequence.

For further details on these features please go through the below pipeline :

All the features are generated using the R package: rDNAse Generating Various Numerical Representation Schemes of DNA Sequences and custom python scripts .