User Specified Candidate genes data analysis



What is User Specified Candidate Genes Data Analysis?

The user candidate gene analysis section will allow users to do comparative study of their genes of interest. Users can either enter a single gene of interest or a group of candidate genes linked to specific biological pathway or functions and compare them with other downsampled set of maize genes.

Random undersampling of the other maize genes are done based on the number of the candidate genes. This involves randomly removing samples from the majority class until a balanced class distribution is achieved.

For more detailed information on using the "User Specified Candidate Gene Analysis" module, please refer to the flow diagram or watch the video tutorial linked below.





Lorem ipsum
Data Standardization
Omics datasets come in a diverse range, scale, and follow their own statistical distributions as they are collected from disparate sources, therefore data standardization becomes crucial for omics datasets. Outputs generated from non-standardized features are often skewed, deviated, and filled with outliers and anomalies. Thus, the advanced exploratory analysis such as Dendrograms, Hierarchical Heatmaps, Hierarchical Scatter plots, Heatmaps and PCA analysis demands high-level data preprocessing and normalization to balance out disproportionate weights across multiple variables. Data normalization transforms the multiscaled data all to the same scale, thereby improving the stability and performance of the learning algorithm. The Maize Feature Store application allows for the normalization of omics numerical features by centering the features with their mean and the standard deviation between 0 and 1 using the most common normalization method called Z-score normalization. In standardized z-score normalization, each feature is normalized as Z = ( X - X' ) / S, where X, X' and S are the feature, the mean and the standard deviation respectively.
Data Imbalance
The ratio of assigned to genes in a reference genome is frequently uneven, resulting in a bias favoring majority samples. For example, 72% of AGPv5 genes are marked as core in the maize pan-genome, and 28% are annotated as non-core (near-core, dispensable and private genes ). Therefore we used the random downsampling method to address the issue of unbalanced data and provided users with the option of “Downsampled analysis”. Random undersampling is a technique in which random samples are deleted from the majority class. This process can be repeated until the desired class distribution is achieved, such as in our case equal number of each categories in each labels.
Maize Feature Store has three tools
  • Data Tables
  • Data Visualization
  • Data Modeling
Questions we are trying to answer
  1. What is common among these genes in a given class?

  2. What are the relationships between: gene phenotype and gene length, copy number, expression levels and patterns, epigenetic markers, cross-species conservation, and SNP densities ?

  3. Apply machine learning methods to predict important biological classifications.

  4. Provide tools to discover if genes within a class have distinct features and utilize it for constructing within- and cross-species prediction models.