The emerging field of bioinformatics merges computer science and biology in an attempt to make sense of the torrential data of human genomes and proteomes.
Sequencing the human genome has been one of the crowning achievements of modern biology, not only because of its sheer scale but also because of its potential impact on our understanding of human evolution, physiology, and disease. And yet, unraveling the sequence of bases was the “easy” part. Now comes the hard part of figuring out the meaning of this sequence of 3 billion A’s, G’s, C’s, and T’s.
The prospect of analysing such a vast torrent of data has led to the identification of a new discipline, called bioinformatics, which merges computer science and biology in an attempt to make sense of it all. For example, computer programmes that analyse DNA for stretches that could code for amino acid sequences are used to estimate the number of protein-coding genes.
Such analyses suggest the presence of about 30,000 protein-coding genes in the human genome, half of which were not known to exist. The fascinating thing about this estimate is that it means humans may have only about twice the number of genes as do worms or files! Computer analysis has also revealed that only about one to two per cent of the human genome actually codes for proteins.
While the remaining DNA contains some important regulatory elements as well as some genes that code for RNA products instead of proteins, most of it appears to consist of “junk” DNA with no apparent function.
Because the function of most genes is to produce proteins which are responsible for most cellular functions, scientists are now looking beyond it to study proteome — the structure and properties of every protein produced by a genome. The complexity of an organism’s proteome is considerably greater than that of its genome. For example, the roughly 30,000 genes found in human cells are thought to produce somewhere between 200,000 and a million or more proteins. This is why cells can produce so many proteins from a smaller number of genes.
In essence, it reflects the fact that an individual gene can be “read” in multiple ways to produce multiple versions of its protein product. The resulting proteins are subject to biochemical modifications that can significantly alter their structural and functional properties.
Identifying the vast number of proteins produced by a genome has been facilitated by mass spectrometry, a high speed, extremely sensitive technique that utilises magnetic and electric fields to separate proteins or protein fragments based on differences in mass and charge. One application of mass spectrometry has been to identify the peptides derived from proteins separated by gel electrophoresis and then digested with specific proteases, such as trypsin.
By comparing the resulting data to the predicted masses of peptides that would be produced by DNA sequences present in genomic databases, the proteins produced by newly discovered genes can be identified. Other techniques make it feasible to study the interactions and functional properties of the vast number of proteins found in a proteome.
For example, it is possible to immobilise thousands of different proteins (or other molecules that bind to specific proteins) as tiny spots on a piece of glass smaller than a microscope slide. The resulting protein microarrays can then be used to study a variety of protein properties, such as the ability of each individual spot to bind to other molecules added to the surrounding solution.