Frequently Asked Questions



The ratio of label categories such as core to non-core genes or classical to other genes in each reference genome is frequently uneven, resulting in a bias favoring majority samples. For example, seventy-two percent of our genes are marked as core in the maize reference genome version B73v5, and twenty-eight percent are annotated as non-core ( near-core, dispensable and private genes). Therefore, we used the random down-sampling method to address the issue of unbalanced data during exploratory analysis and provided users with the option of “Downsampled analysis”. Random undersampling is a technique in which random samples are deleted from the majority class. This process can be repeated until the desired class distribution is achieved, such as an equal number of each category in the selected labels. It is important to note that the size of the downsampled data is different for each label (Classical/Pan-genome/Gene-Origin) selection as the size of the minority class is different in each label.

A user can always download the selected subsection dataset or the complete dataset via the “Download Source” or “Download All” choices.

The "Modeling" module of MFS offers an "Advanced Model" form and a "Basic Model" form which allows users to make predictions for their genes based on certain inputs. The "Advanced Model" utilized all omics features (a total of 14,407 features) while the "Basic Model" utilized a subset of omics features (a total of 10,271 features) comprised of only the gene structure, gene sequence, and protein sequence data. Therefore the "Basic Model" is a more generalized model.

In each modeling form ("Advanced" or "Basic"), users enter the necessary information for their gene of interest to classify it as core or non-core. Prediction results are displayed at the bottom of the same form with a probability score between 0 and 1 (See Basic, Advanced). While the "Advanced" model is more accurate and more efficient, the input features required by the "Advanced" model are specific to maize or the associated maize inbred lines. Therefore, the "Advanced" model may only work on maize genes or genes from closely associated species of maize. In addition, the generation of some of these features is time-consuming and requires programming skills. Therefore, to provide a highly versatile prediction platform for both novices and experts, we developed the "Basic" model that relies only on gene and protein sequences and structural features. These genomic features are easily accessible or readily available in the form of GFF files. For users lacking the necessary sequence features, the “Basic” model (http://mfs.usda.iastate.edu/model_basic) also includes an input box that will auto-fill the form by taking in only the protein sequence and the coding sequence from the user. Because of this, the “Basic” model acts as a hassle-free prediction platform for a wide range of users.

Maize Feature Store (MFS)systematically integrates over 14407 gene-based features based on the most recent maize multi-omics dataset (version 5 of the B73 reference genome, or B73v5). Therefore providing non-experts with a suite of methods, modeling modules, and datasets to find meaningful patterns from the maize omics data. Users can freely download the clean, intergated maize omics dataset to carry out their own specific analysis.

Details on the usage and interpretation of all the plots and tables are available on each of the MFS website webpages. For example: (http://mfs.usda.iastate.edu/Structure).

In each modeling form ("Advanced" or "Basic"), users enter the necessary information for their gene of interest to classify it as core or non-core. Prediction results are displayed at the bottom of the same form with a probability score between 0 and 1 (See Basic, Advanced). A probability score closer to 0 or less than 0.5 indicate that the gene is non-core wheareas a probability score closer to one or greater than 0.5 indicate that the gene is core.

As of now our modeling module web interface allows the classification or predicition of single gene at a time as core or non-core. But users willing to classify a set of genes at once may need to download the model from github and run the model on their dataset using a python script.

Go Back