PRESGENE SERVER

USER GUIDE

(Version: PRESGENE - v1.0.0)

The tutorial is divided into three sections.

Navigating through the PRESGENE server webpage

Understanding the Input Files required by the Server for the Prediction

Essential Gene Prediction using PRESGENE- Step-By-Step Guide

A. Navigating through the PRESGENE server webpage

1. Home Page

Click here to know more about the PRESGENE server

2. Machine Learning Strategies Implemented

PRESGENE server offers two ML Strategies [1,2] for prediction of essential genes in prokaryotes

and eukaryotes.

➢ For information on the Supervised Learning based Strategy: Click on ML Strategy 1

➢ For information on the Semi-supervised Learning based Strategy Click on ML Strategy 2

3. Services of the PRESGENE server

The server provides the users with three channels or ways of predicting the essential genes via

the PRESGENE server.

Channel I: This provides the option to the user to test the pipelines on 14 sample model

organisms, including both prokaryotes and eukaryotes. The user can choose to vary the

percentage of labeled data to be used for the prediction of the essential genes. The results

produced for these model organisms through the server can be directly incorporated by the users

in their own study for prediction of drug targets or other applications.

Channel II: This helps the user to predict essential genes for a new organism using the

PRESGENE server in four simple steps. To prepare the training dataset, the user needs to provide

the name of the organism and five input files (discussed in section B).

Channels I and II direct the user to the Dataset Preparation (Feature Matrix Calculation) tab to

calculate and predict essential genes using the ML1 or ML2 strategies using 289 biological

features.

Channel III: The server provides the user with an option to incorporate and test the influence of

other biological features (apart from the existing 289), calculated and provided to the server in the

form of a User Created Dataset with Feature Matrix through Channel III. This matrix forms the

training dataset of the pipeline and should include the various features as columns and the

reaction-gene combinations (samples) of the metabolic network as rows. The last column of the

matrix should contain the gene essentiality information as E (Essential), N (Non-Essential), or UD

(Undefined) as target variables. Channel III will directly take the user to the Training and

Prediction tab of the ML pipeline.

The step-by-step tutorial for essential genes prediction using Channels I, II and III have been

provided in Section C.

Frequently Asked Questions

Hover to the Frequently Asked Questions section on the Home Page to get answers to the

common queries.

B. Understanding the Input Files required by the Server for the

Prediction

We have provided a set of sample input files (in Download Sample Organisms Tab) for the ease

of first-time users of the server to understand the file formats required as input in prediction for

Channels I or II. The five input files for each organism have been provided in .zip format.

A description of each input files has been provided below.

Input File 1: Metabolic Network File

This is a genome scale reconstructed metabolic network (GSRMN) file (*.mat). The genome-

scale reconstructed metabolic network contains the information of metabolites, reactions, and

genes. These networks are available throughout the literature and the BIGG database [3].

Input File 2: Nucleotide sequence File

This is a multi-fasta file which includes nucleotide sequences of all the protein coding genes of

your organism of interest arranged in the form of FASTA format.

Please make sure that each sequence starts with the below indicated header line which starts

with “>” followed by the Gene ID. Users can use any type of Gene ID (e.g., Refseq, NCBI, EMBL,

GenBank, DDBJ, etc) to construct their input files.

It is important that this same GeneID is maintained across all the other input files for

uniformity in collecting, calculating and assembling the features for the final features’ dataset

matrix.

Input File 4: Protein sequence File

This is a multi-fasta sequence file of the proteins encoded by the coding nucleotide

sequences(cds) of the genes of your organism of interest. It is essentially a sequence of amino-

acids in a FASTA format, similar to the above two files; Nucleotide sequence file and Ribosomal

sequence file

Input File 5: Gene Essentiality File

This file contains known experimental information about the essentiality of the genes of the target

organism for model training. This is the labeled dataset in .csv format which essentially contains

two columns- Gene, Class

In the column ‘Class’, mention the essentiality information of the Gene list of the target organism.

It is to be noted that a minimum of 1% labeled dataset is required for model training to perform

essential genes prediction using PRESGENE.

✓ Essential gene is to be denoted as: E

✓ Non-essentiality is to be denoted as: N

C. Essential Gene Prediction using PRESGENE- Step-By-Step Guide

User can start gene essentiality analysis by clicking ML Strategy tab.

The two broad steps in the prediction pipeline are:

1) Dataset preparation

(Feature matrix table) from the essential sequence, flux and

essentiality information of the user query organism (sample/ new organism)

2) Model Training and Prediction using supervised or semi-supervised ML Strategies.

PRESGENE provides users three options for feature matrix calculation (dataset preparation) for

gene essentiality prediction - Channel I, Channel II and Channel III.

Channel I: Can be used when user wants to test or work on any of the sample organism data

provided readily by PRESGENE

Channel II: Can be used when user wants to perform prediction on new organism for which the

user has available the five required input data files and nominal (minimum 1%) essentiality

information for the novel organism of their interest.

Channel III: Can be used if the user already has a ready feature matrix (Master Dataset) and

want to directly run. See below in Channel III section to know about the format of a ready feature

matrix file.

JOB ID: Each submission of the user is assigned a unique JOB ID which is an 8-digit

number such as ‘00000143’. User should note this JOB ID to track and retrieve results

which will be available in the server for next 15 days.

New JOB

By clicking on New JOB, ongoing JOB will be terminated and a new JOB ID will be assigned.

Channel I: Running PRESGENE on Sample Organisms

STEP 1: Select the organism of your choice from the list of 14 available sample organisms. Select

the precent gene essentiality information available. It is kept at default value of 30%. User can

vary it according to their choice.

STEP 2: Next step is Dataset (Feature matrix) preparation. Click on the tab on the left side. The

Dataset Preparation tab allows the user to choose the set of biological features that the user

wishes to consider for the gene essentiality prediction. However, it is recommended to consider

all 289 biological features for higher accuracy and better prediction of essential genes. The list of

features has been enlisted in Table 1.

Table 1. List of Features and the software packages used feature calculation

Feature

Abbreviation of

# of

Programming

Features name

Software Packages

Types

features name

features

Languages

Topological analysis of reactions and flux-coupled sub-networks

Degree Centrality

TF_RN_DC

Eigenvector

TF_RN_EC

Centrality

Eccentricity

TF_RN_ET

The COBRA Toolbox to

generate the reaction

Hub Score

TF_RN_HS

network from Genome

scale metabolic

Reaction

MATLAB, R,

Authority Score

TF_RN_AS

network (.mat)

Network

Perl

Page Rank

TF_RN_PR

"igraph" for network

Betweenness

analysis[4]

TF_RN_BC

Centrality

Number of triangle

TF_RN_NT

Degree Centrality

TF_FC_DC

Eigenvector

TF_FC_EC

Centrality

Eccentricity

TF_FC_ET

F2C2 tool v0.95b (Flux

Couple Analysis)

Flux

Hub Score

TF_FC_HS

MATLAB, R,

Coupled

Perl

Network

Authority Score

TF_FC_AS

"igraph" for network

analysis[4]

Page Rank

TF_FC_PR

Betweenness

TF_FC_BC

Centrality

Number of triangle

TF_FC_NT

Features derived from the coding nucleotide sequences

Nucleotide content

NS_DF_NC

In house Perl script

Perl

Effective Number of

EMBOSS package

Derived

NS_DF_ENC

Perl

Codons

version 6.6.0-1[5]

features

Codon Adaptation

EMBOSS package

NS_DF_CAI

Perl

Index

version 6.6.0-1[5]

Mutual Information

Informati

NS_ITF_MI

in house Perl script

Perl

(MI)

on-

theoretic

Conditional Mutual

features

NS_ITF_CMI

in house Perl script

Perl

Information (CMI)

Features derived from protein sequences

Frequencies of the

EMBOSS package

PS_DF_FA

Perl

twenty amino acids

version 6.6.0-1[5]

EMBOSS package

Derived

Protein length

PS_DF_PL

Perl

version 6.6.0-1[5]

features

Paralogy based

features (Paralogy

PS_DF_PS

BLAST [version 2.2.26]

Perl

score)

Fourier sine

PS_ITF_FSC

in house Perl script.

Perl

coefficient

Informati

on-

Fourier cosine

PS_ITF_FCC

in house Perl script.

Perl

theoretic

coefficient

features

Average Kidera

PS_ITF_AKF

in house Perl script.

Perl

Factor

Once the user has five input files ready for their organism of interest, please follow the following

steps to carry out essential gene prediction.

• Upload the Genome-Scale Reconstructed Metabolic Network (GSRMN) file by clicking on

the Choose File. Once the file is selected, click on Upload GSRMN File.

• Next, input the Gene ID Type and click on “I confirm the gene ID type”. This step is

optional.

• User can take a look at the Gene ID list here to confirm if all the genes have been read

by the algorithm correctly for further processing.

• Next step is to upload the other four required input files (see above input files 2-5).

Navigate to each next input file to upload the required category of file.

Once all the input files are uploaded successfully, you proceed to the next step in the pipeline

Dataset Preparation (Feature Matrix Calculation).

STEP 2, STEP 3 and STEP 4: repeat as Channel I

Download Results

User can click on the Download Result and enter JOB ID to retrieve previously calculated results.

These results can be retrieved within 15 days of the prediction.

References

Nandi S, Subramanian A, Sarkar RR. An integrative machine learning strategy for improved

prediction of essential genes in Escherichia coli metabolism using flux-coupled features.

Mol Biosyst. 2017;13: 1584-1596. doi:10.1039/C7MB00234C

Nandi S, Ganguli P, Sarkar RR. Essential gene prediction using limited gene essentiality

information-An integrative semi-supervised machine learning strategy. Mirjalili S, editor.

PLoS One. 2020;15: e0242943. doi:10.1371/journal.pone.0242943

Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021.

Nucleic Acids Res. 2021;49: D884-D891. doi:10.1093/nar/gkaa942

Csardi G, Nepusz T, others. The igraph software package for complex network research.

InterJournal, Complex Syst. 2006;1695: 1-9. Available: https://igraph.org

Rice P, Longden I, Bleasby A. EMBOSS: The European Molecular Biology Open Software

Suite. Trends Genet. 2000;16: 276-277. doi:10.1016/S0168-9525(00)02024-2