The application of information technology to chemistry
Chemometrics
The application of statistical methods to chemical data in order to derive predictive models or descriptors
Chemoinformatics, chemometrics and chemical informatics are related but distinct fields of research
The term "chemoinformatics" was introduced in 1998 by Brown, who defined it as the combination of "all the information resources that a scientist needs to optimize the properties of a ligand to become a drug"
Chemoinformatics focuses on decision support by computer and drug discovery relevance, while chemical informatics lacks the specific drug discovery focus
Spectrum of chemoinformatics
Chemical data collection, analysis, and management
Data representation and communication
Molecular modeling and simulation
Structure-activity relationship analysis
Virtual screening and compound selection
Reaction informatics
As an interdisciplinary field, chemoinformatics involves computational scientists, chemists, and biologists
Sources that a scientist needs to optimize the properties of a ligand to become a drug
Chemoinformatics
The application of information technology to chemistry, with a specific focus on drug discovery
Chemical informatics
The application of information technology to chemistry, without a specific drug discovery focus
It is increasingly difficult to distinguish between chemoinformatics, chemical informatics, and chemometrics, particularly as far as method development is concerned
Spectrum of chemoinformatics
Chemical data collection, analysis, and management
Data representation and communication
Database design and organization
Chemical structure and property prediction (including drug-likeness)
Molecular similarity and diversity analysis
Compound or library design and optimization
Database mining
Compound classification and selection
Qualitative and quantitative structure-activity or – property relationships
Information theory applied to chemical problems
Statistical models and descriptors in chemistry
Prediction of in vivo compound characteristics
Chemoinformatics includes all concepts and methods designed to interface theoretical and experimental programs involving small molecules
The evolution of chemoinformatics as an independent discipline will much depend on its ability to demonstrate a measurable impact on experimental chemistry programs, regardless of whether these are in pharmaceutical research or elsewhere
Hierarchy of bio- and chemoinformatics research
DNA sequence
Molecular composition
Connectivity (graph)
Molecular similarity
Chemotype
Structure
Interaction
Specific activity
Drug
Protein sequence
Sequence similarity
Family
Structure
Interaction
Function
Intervention
Many algorithms and computational techniques used in chemoinformatics are also used for many applications in bioinformatics
Informatics research and development in the life sciences is expected to become much more global in the future
Computational descriptors of molecular structure, physical or chemical properties, or pharmacophores
Chemical space
dimensional reference space into which molecular data sets are projected for analysis or design
Types of molecular descriptors
Physical properties
Atom and bond counts
Pharmacophore features
Charge descriptors
Connectivity and shape descriptors
There are no generally preferred descriptor spaces for chemoinformatics applications and it is usually required to generate reference spaces for specific applications on a case-by-case basis
Similar Property Principle
Molecules having similar structures and properties should also exhibit similar activity
Similarity coefficients
Tanimoto coefficient
Dice coefficient
Cosine coefficient
Molecular similarity, dissimilarity, and diversity
Similar molecules can be identified by application of distance functions and analysis of nearest neighbors in chemical space
Dissimilar molecules can be identified by maximizing the distance between them in chemical space
Molecular diversity refers to the overall spread of a compound collection in chemical space
Molecular similarity analysis
The hallmarks of
Molecular similarity assessment
1. Descriptor combinations expressed as bit strings (fingerprints)
2. Test molecule assigned characteristic bit pattern
3. Pair-wise molecular similarity quantified by overlap of bit strings using similarity metrics (coefficients)
Similarity metrics (coefficients)
Shown in Table 1.4
ni and nj
Number of bits set on for molecules i and j, respectively
nij
Number of bits in common to both molecules
Similarity coefficient values
Range from zero (no overlap; no similarity) to one (complete overlap; identical or very similar molecules)
The most widely used metric in chemoinformatics is the Tanimoto coefficient
Molecular similarity
Identified by application of distance functions and analysis of nearest neighbors in chemical space
Molecular diversity
Attempts to either select different compounds from a given population or evenly populate a given chemical space with candidate molecules
Diversity selection and design
Using distance functions to select compounds at least a pre-defined minimum distance away from others or maximize average inter-compound distances
Diversity selection and design
1. Dividing descriptor axes into evenly spaced value intervals (binning) to produce n-dimensional subsections (cells) of chemical space
2. Selecting a representative compound from each populated cell or populating cells as evenly as possible with computed molecules
Molecular diversity is a global concept, while molecular similarity analysis explores pair-wise relationships
Dissimilarity
The inverse of molecular similarity, addressing which molecule in a collection is most dissimilar from a given compound or set of compounds
Dissimilarity-based compound selection
1. Initially selecting a seed compound, then calculating dissimilarity between the seed and all others and selecting the most dissimilar one
2. Repeating the process to obtain a subset of desired size
High-dimensional chemistry spaces might often be too complex for carrying out meaningful and interpretable analyses