Bioinformatics Glossary


Accession number (genbank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to genbank. The genbank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or proteinid, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field.

Algorithm - a fixed procedure embodied in a computer program. The Basic Local Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI uses to search sequence databases for optimal local alignments with a query sequence. FASTA is another type of algorithm used for database similarity searching.

Alpha helix - one of two types of protein secondary structure. An alpha helix is a tight helix that results from the hydrogen bonding of the carboxyl (CO) group of one amino acid to the amino (NH) group of another amino acid.

Amino acids – basic building blocks of proteins, 20 different types.

Base-pair – basic building block of double-stranded DNA. There are 4 different types of base. In DNA, usually A pairs with T and G pairs with C.

Bioinformatics – a discipline at the intersection of computer science, information technology, mathematics and biology. Bioinformatics encompasses the study of a broad range of biological data including gene maps, gene and protein sequences, protein structure/function relationships, genome organisation and gene expression profiles.

BLAST – BLAST (Basic Local Alignment Search Tool) is a set of programs designed to search for similar sequences in all of available sequences. The BLAST programs may be used for either protein or DNA sequences and the scores assigned have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits.

Cdna or complementary DNA - DNA that is synthesized in the laboratory from a messenger RNA.

CDS - The coding sequence or the portion of a nucleotide sequence that makes up the triplet codons that actually code for amino acids.

Chromosome – a discrete unit of the genome carrying many genes that has a specific morphology. Consists of DNA and proteins.

Clustalw – a general purpose program that identifies alignments within multiple DNA or protein sequences.

Coding sequence – in double-stranded DNA, the sequence that dictates the protein produced.

Codon – three bases on the DNA (RNA) sequence that “codes” for each amino acid.

Complementary strand – in double-stranded DNA, the strand that does not contain the “coding” sequence. It is “complementary” to the coding strand following the rules of base-pairing.

Consensus sequence – idealised sequence in which each position represents the base/amino acid most often found when many sequences are compared.

Conservation - when the substitution of one amino for another preserves the physico-chemistry properties of the original residue. For example, when a hydrophobic amino acid residue is replaced by another hydrophobic residue.

Conserved residue – at specific sites in sequence, non-identical bases/amino acids that belong to the same “class”. Can be grouped according to “biochemical functionality”, “chemical nature”, chemical charge or structural preferences.

Contig – a set of overlapping segments of DNA sequences. In DNA sequencing projects, a contig is a set of sequences (usually read from gels) that are related to one another by overlap of some part of their sequences. The readings can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig.

DNA (deoxyribonucleic acid) – polymer of deoxyribonucleotides. Usually double-stranded, the specific sequence of bases in each strand encodes genetic information. Frequently used as hereditary material, the sequence must be passed on accurately from one generation to the next.

Domain - a discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function

E value - the number of different alignments with a score equal to or better than S that can be expected to occur simply by chance. Also referred to as the expectation value.

Expressed sequence tag or EST - A short strand of DNA that is a part of a cdna molecule and can act as identifier of a gene. Used in locating and mapping genes.

Gap - A space introduced into an alignment to compensate for insertions or deletions in one sequence relative to another

Gene – section of DNA which, in its entirety, codes for a functional RNA molecule.

Gene locus (pl. Loci) - Gene's position on a chromosome or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.

Gene name - Official name assigned to a gene. According to the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee, it should be brief and describe the function of the gene.

Gene Ontology - A controlled vocabulary of terms relating to molecular function, biological process, or cellular components developed by the Gene Ontology Consortium. A controlled vocabulary allows scientists to use consistent terminology when describing the roles of genes and proteins in cells.

Gene symbol - Symbols for human genes are usually designated by scientists who discover the genes. The symbols are created using the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee. Gene symbols usually consist of no more than six upper case letters or combination of uppercase letters and Arabic numbers. Gene symbols should start with the first letters of the gene name. For example, the gene symbol for insulin is "INS." A gene symbol must be submitted to HUGO for approval before it can be considered an official gene symbol.

Genome – the full complement of genetic information within (every cell) of an organism.

Genomics – bioinformatic studies of genomic DNA, which include genome mapping, gene sequencing and gene function. High-throughput technologies generate increasingly large amounts of data.

GI (genbank) - A GI or "geninfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the genbank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases.

Global alignment - when two nucleic acid or amino acid sequences are lined up along their entire length. See also local alignment

Homologues – genes/proteins that have similar sequences or functions.

Homology - similarity in sequence that is based on descent from a common ancestor

Identical residue – identical bases/amino acids at specific sites in sequence.

Identity - the extent to which two sequences are invariant

Ligand - A small molecule noncovalently bonded to a larger macromolecule.

Local alignment - the alignment of portions (rather than the entire sequence length) of two nucleic acid or amino acid sequences

Masking - the removal of repeated or low complexity regions from a sequence so that sequences are compared

MIM number (also MIM#, OMIM number, or mckusick Code) - The unique six-digit number assigned to each entry listed in the catalog of human genes and genetic disorders, Online Mendelian Inheritance in Man (OMIM). The first digit of a MIM number describes a gene's mode of inheritance as outlined in the table below:

Motif - A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. Some common types of motifs are made up of two or more alpha helices or beta sheets.

Mrna – RNA molecule that is a copy of a gene. It is used in the process that synthesises proteins.

Open Reading Frame (ORF) – series of codons containing information to synthesise one protein.

Orthologous - homologous sequences in different species that result from a common ancestral gene during speciation. Orthologous genes may or may not have similar functions.

Paralogous - homologous sequences within a single species that are the result of gene duplication

Phylogenetic tree – prediction of evolutionary relationships between sequences of interest. The length of each pair of branches is an indication of the evolutionary distance between sequence pairs.

Polypeptide chain - a chain of peptides or amino acids. A polypeptide chain usually consists of 100 or fewer amino acids. A protein is made up of one or several polypeptide chains.

Primary structure -the amino acid sequence of a polypeptide chain. Of the four levels of protein structure, this is the most basic.

Protein (polypeptide) – polymer of amino acids. The specific sequence of amino acids causes the protein to favour specific conformations that are required for biological function.

Protein ID (genbank) - The Protein ID is an identification number assigned to the amino acid sequence data included within a sequence record. This sequence identifier uses the accession.version format. Each protein ID is made up of three letters followed by five digits, a period, and a version number. For example, in a sequence record M12345, the Protein ID for the sequence translation could be AAA35650.1. If the protein sequence data changes in any way (even by just one amino acid), the version number in the Protein ID will be increased by an increment of one, while the accession number base remains constant. For example, AAA12345.1 would become AAA12345.2. Each amino acid sequence change also results in the assignment of a new GI number to the altered protein translation.

Proteomics – recently developed research area that uses a range of bioinformatics approaches to analyse the expression (and function) of proteins within specific systems/cells/organisms. As for genomics, high-throughput technology generates large amounts of data.

Quaternary structure - the interconnection and arrangement of polypeptide chains within a protein. Only proteins with more than one polypeptide chain can have quaternary structure.

Query - the input sequence (in FASTA format or as bare sequence data) or sequence identifier with which all the sequences in a database are compared during a BLAST search

Reading Frame – one of 3 possible ways of translating a (single-stranded) nucleotide sequence into codons.

RNA (ribonucleic acid) – polymer of a specific sequence of ribonucleotides. Three types (mrna, rrna, trna) are used in different functions within the cell.

Secondary structure - the folded, coiled, or twisted shape of a polypeptide that results from hydrogen bonding between parts of a molecule. There are two types of secondary structure: alpha helix and a beta pleated sheet.

Sequence tagged site or STS - Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, stss are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks for developing physical maps of the human genome. Expressed sequence tags (ests) are stss derived from cdnas.

Similarity - how related one nucleotide or protein sequence is to another. The extent of similarity between two sequences is based on the percent of sequence identity and/or conservation.

Tertiary structure - the three-dimensional structure of a polypeptide chain that results from the way that the alpha helices and beta pleated sheets are folded and arranged.

Version (genbank) - Similar to the Protein ID for protein sequences, the version is a nucleotide sequence identification number assigned to each genbank sequence. The format for this sequence identifier is accession.version (e.g., M12345.1). Whenever the author of a particular sequence record changes the sequence data in any way (even if just a single nucleotide is altered), the version number will be increased by an increment of one, while the accession number base remains constant. For example, M12345.1 would become M12345.2. Each sequence change also results in the assignment of a new GI number [link to GI entry]. Whenever an individual searches an NCBI sequence database, only the most recent version of a record is retrieved.