1.    Introduction

1.1.          Bioinformatics

The term “Bioinformatics” is defined as a sub-discipline of biology and computer science related to storage, dissemination, acquisition, and analysis of biological data, usual sequences of nucleotides, and protein. It is a computational science field linked with biological molecules’ sequences analysis. Bioinformatics generally refers to genome, protein or RNA, and is mainly suitable for comparison of different genes and other protein sequences and sequences between organisms or within an organism, observing the evolutionary relationships between the organisms, and studying the patterns existing across lines of DNA and protein to determine their functions.

A variety of computational tools are used for several purposes, such as identifying the functions of proteins and genes, predicting the three-dimensional shapes of proteins, and establishing evolutionary relationships.

1.2.          Bioinformatics software

The purpose of designing bioinformatics software programs or tools is to retrieve the significant data from biological databases or molecular biology to perform structural or sequence analysis.

1.2.1.     Classification of Bioinformatics Software

In bioinformatics, there are customized and standard tools to fulfill the needs of specific tasks. They are:

  • visualization tools that examines and extracts the information from proteomic database(s)
  • data-mining software, they extract the information from genomic database(s)

All the tools or software can be classified into different categories; including:

  • protein analysis software
  • homology software
  • structure analysis software
  • sequence analysis software

Molecular modelling software such as WHATIF and RasMol and; structure prediction software (Threader); sequence analysis programs, including Staden packages and EMBOSS or sequence search programs like BLAST, is used every day.

Sequence Analysis software:

These tools allow the user to perform more comprehensive study on their query such as mutations identification, evolutionary analysis, compositional biases and CpG islands and, hydro-pathy regions. Identifying all these biological features are indicators that help search to determine the particular function of user’s sequence.

Homology software:

A homologous sequence is a sequence related to a single ancestor. So the similarity level between the two sequences can be calculated through their homology is a matter of true/false. This software can be utilized to classify the similarities between known function and structure database sequences and new query sequences whose function and structure are unknown.

Protein Analysis software:

Software used for protein analysis allows users to carry out the comparison between their protein sequence to the secondary protein database(s) which comprises of signatures’ information, protein domains, and motifs. A critical hit against the various pattern database(s) allows users to estimate the biochemical function of their sequence.

Structure Analysis software:

They allow users to do comparison against the structures against the databases (known structure). The function of a protein is a direct result of its formation instead of its sequence with the structural homologues that normally share functions. Identification of the 3D or 2D structures of a protein are important in the study of its function.

2.    Statistical Analysis of Bioinformatics Data

2.1.          CheS-Mapper (Chemical Space Mapper)

Visualizing the QSAR information in chemical datasets is an effective field of research in cheminformatics (Ertl P, 2012) (Awale M, 2013) (Guilloux VL, 2012) (Skoda P, 2013). Several techniques are being established that help to understand the relationship between chemical structure, their physicochemical properties, and biological or toxic effects.

Visual verification with CheS-Mapper enables analysis of QSAR information and demonstrates how this data is used in the QSAR model. This indicates whether the endpoint is modeled too generic or too specific and underlines the general features of the components that are not well separated. In addition, the researcher can use the tool to test how the QSAR model is used in predicting the activity cliffs.

Visual examining of QSAR model verification results may help to understand the model itself and the model data, and may produce the following advantages:

  • Data curation: It is significant to check substances that are not properly classified (classification case), or associated with a high predictive error (regression). Examining probable causes of wrong predictions may help to identify errors in training data, such as mis-calculated endpoint values. The researcher may also find that the misclassification is external or that additional training data are needed.
  • Model improvement: One of the other possible reasons for the poor model performance can be the selection of the wrong feature, such as the features that are available might not be utilized to differentiate between active or inactive substances. In addition, the model that is selected may be more explicit (overfitting) or more frequent (underfitting). In addition, visual validation can display the effect of the parameters of various models.
  • Mechanistic interpretation: It is possible to retrieve the information from well-organized compounds’ groups. Combinations that have the same endpoint values ​​and feature values ​​may have the same mode of action. Therefore, visual validation can help the researcher in obtaining the mechanistic interpretation. Mechanical validation and interpretation of the appropriate model are the fulfilments of the OECD guidelines for approved models QSAR (OECD, 2004). To date, visual validation can also aid in improving the acceptance of QSAR models by regulatory authorities as an substitute to testing.

CheSMapper is a standard, open-source, and interactive tool which is used for testing the chemical data sets for small molecules. It also maps compounds into 3D visual space and it was developed to help researchers in investigating the chemical substances and their properties. Unlike the existing methods, this software is a distinctive combination of 3D viewer, dimensionality reduction, and clustering. The unique feature of this software is that each chemical substance is characterized by its own 3D structure 3D rather than altering it with a dot or node. Unlike other open-source tools that are restricted to a different operating system, dependent on additional site installation or requiring a particular installation format, it is a standalone platform, does not require installation, and accepts an extensive array of chemical formats.

The workflow of CheS-Mapper is consisted of two parts:

  • data preprocessing,
  • visualization

The data preprocessing part is referred to as Chemical Space Mapping. This step can be constituted with a key wizard that guides the pre-processing steps to be carried out using the tool.

  • After the loading of the chemical dataset into the software, 3D structures for compounds are figured in case they are not available.
  • Afterwards, the user is guided to select the features of chemical compound(s) used within the mapping process.
  • As a result, depending on the user-defined chemical or biological similarity the compounds are grouped together into clusters and embedded into 3D space.
  • A variety of structural features and chemical descriptors can be calculated using the CheS-Mapper, and there are many algorithms available that are used for clustering and embedding purposes.
  • At the last, the chemical substances present in each cluster can be aligned in 3D space depending on the common substructures…..