Example Candidate gene list:

To demonstrate the potential use case of the user candidate gene analysis, we gathered a set of fifty stress genes that were differentially expressed between the control and salt stress samples and used them to identify unique characteristics common among salt stress genes .

Please go through the below explanation for details on table and plot interpretations

Table column interpretation:

1. Monopeptide (Amino acid composiiton AAC)

The amino acid composition describes the fraction of each amino acid type within a protein sequence. The fractions of all 20 natural amino acids are calculated as:
f(r)= Nr/N   r = 1,2,…,20.
where Nr is the number of the amino acid type r and N is the length of the sequence.

2. Dipeptide Composition Descriptor (DC)

Dipeptide composition gives a 400-dimensional descriptor, defined as: The fractions of all 20 natural amino acids are calculated as:
f(r,s)= Nrs/N-1   r,s = 1,2,…,20.
where Nrs is the number of dipeptide represented by amino acid type r and type s.

3. Tripeptide Composition Descriptor (TC)

Tripeptide composition gives a 8000-dimensional descriptor, defined as: The fractions of all 20 natural amino acids are calculated as:
f(r,s,t)= Nrst/N-2   r,s,t = 1,2,…,20.
where Nrst is the number of tripeptides represented by amino acid type r, s, and t.

4. Autocorrelation Descriptors

Autocorrelation descriptors are defined based on the distribution of amino acid properties along the sequence. The amino acid properties used here are various types of amino acids index.Three types of autocorrelation descriptors are :

4.1. Normalized Moreau-Broto autocorrelation descriptors (Monreau)

4.2. Moran autocorrelation descriptors (Moran)

4.3. Geary autocorrelation descriptors (Geary)


5. Composition/Transition/Distribution

The amino acids are categorized into three classes according to its attribute, and each amino acid is encoded by one of the indices 1, 2, 3 according to which class it belongs. The attributes used here include hydrophobicity, normalized van der Waals volume, polarity, and polarizability. Three types of descriptors, Composition (C), Transition (T), and Distribution (D) can be calculated for a given attribute as follows :

5.1. Composition (CTDC)

5.2. Transition (CTDT)

5.3. Distribution (CTDD)


6. Conjoint Triad Descriptors

Conjoint triad descriptors are proposed by Shen et al. (2007). The conjoint triad descriptors were used to model protein-protein interactions based on the classification of amino acids. In this approach, each protein sequence is represented by a vector space consisting of descriptors of amino acids.

7. Quasi-sequence-order Descriptors

The quasi-sequence-order descriptors are proposed by Chou (2000). They are derived from the distance matrix between the 20 amino acids. Two types of descriptors are :

7.1. Sequence-order-coupling number (SOCN)

7.2. Quasi-sequence-order descriptors (QSO)


8. Pseudo-Amino Acid Composition (PseAAC)

This group of descriptors are proposed by Chou (2001). PseAAC descriptors are also named as the type 1 pseudo-amino acid composition. Let Ho1(i), Ho2(i), Mo(i) (i=1,2,3,…,20) be the original hydrophobicity values, the original hydrophilicity values and the original side chain masses of the 20 natural amino acids, respectively.


9. Amphiphilic Pseudo-Amino Acid Composition (APseAAC)

Amphiphilic Pseudo-Amino Acid Composition (APseAAC) was proposed in Chou (2001). APseAAC is also recognized as the type 2 pseudo-amino acid composition. The definitions of these qualities are similar to the PAAC descriptors.


10. Labels

The labels are the classes or the groups the genes are mapped into.The labels can act as both target variable or feature as per the need of the user for solving their specific problem

10.1 No Label

This selection is provided to enable users to view the properties of all genes without labeling them into different gene categories or annotations. This is to let users examine the features of multiple genes and identify common patterns among them. As it involves the inspection of all the genes therefore they work only for "Submit for analysis" button .

10.2 Classical Genes

Classical genes can be defined as the most well-studied genes mainly for their visible mutant phenotype (for example: liguleless3).

10.3 Pan-genome Genes

A gene in a given taxonomic group is either present in every individual (core), or absent in at least a single individual (dispensable).

10.4 Origin Genes

Gene duplication is an important evolutionary mechanism allowing new genetic material and thus opportunities to acquire new gene functions for an organism. There are different origins of duplications such as whole-genome duplications, tandems, etc.


Graph interpretations:

To the top right corner of the plots/graphs, there are options to download plot, zoom-out/zoom in, reset axes, autoscale, toggle spike lines, show closest data on hover, compare data on hover, box select,pan and lasso. Users can also select specific legends to view data only for the selected legends. Details on the interactive plot options are available here:
Interactive graph features

1. Marginal Plot

The Marginal plots are box plots showing the frequency distribution of the selected gene features alonghwith higlighting the candidate genes. This plot will enable the user to easily identify where thier candidate gene lie among the other maize genes for the selected feature.

2. Frequency Dipeptide box plot

The Frequency Dipeptide box plot describes the fraction of each dipeptide type within a peptide sequence. The fractions are calculated as:
f(r) = Nrs/N
where Nrs is the number of dipeptide represented by dipeptide type r and type s and N is the length of the chosen labeled peptide sequence.

3. Frequency Tripeptide box plot

The Frequency tripeptide box plot describes the fraction of each tripeptide type within a peptide sequence. The fractions are calculated as:
f(r) = Nrsh/N
where Nrsh is the number of tripeptide represented by tripeptide type r,type s and type h and N is the length of the chosen labeled peptide sequence.

For further details on these features please go through the below pipeline :

All the features are generated using the R package: protr for generating various numerical representation schemes of protein sequences .