biology
Molecular Basis of Inheritance
A concise summary of the chapter covering key points, diagrams, and facts related to DNA, RNA, genetic material, replication, transcription, translation, genetic code, gene expression, Human Genome Project, and DNA fingerprinting.
The DNA
The nature of the genetic material was investigated for over a hundred years, culminating in the realisation that DNA (deoxyribonucleic acid) is the genetic material for the majority of organisms . Nucleic acids, like DNA and RNA, are polymers of nucleotides. While DNA primarily acts as the genetic material, RNA functions mostly as a messenger, and also as an adapter, structural, and sometimes catalytic molecule [2, 3].
DNA is a long polymer of deoxyribonucleotides. Its length is defined by the number of nucleotides or base pairs (bp) present . For example:
- Bacteriophage φ×174 has 5386 nucleotides .
- Bacteriophage lambda has 48502 base pairs (bp) .
- Escherichia coli has 4.6 × 10^6 bp .
- A haploid human DNA content is 3.3 × 10^9 bp .
Structure of Polynucleotide Chain
A nucleotide has three components :
- A nitrogenous base
- A pentose sugar (ribose for RNA, deoxyribose for DNA)
- A phosphate group
There are two types of nitrogenous bases :
- Purines: Adenine (A) and Guanine (G)
- Pyrimidines: Cytosine (C), Uracil (U), and Thymine (T)
Key differences in bases between DNA and RNA:
- Cytosine is common to both DNA and RNA .
- Thymine is present in DNA .
- Uracil is present in RNA in place of Thymine .
A nitrogenous base links to the 1’ C of a pentose sugar via an N-glycosidic linkage to form a nucleoside (e.g., adenosine, deoxyadenosine) . When a phosphate group links to the 5’ C of a nucleoside via a phosphoester linkage, a nucleotide is formed . Two nucleotides are linked by a 3’-5’ phosphodiester linkage to form a dinucleotide, and more join to form a polynucleotide chain .
A polynucleotide chain has a 5’-end with a free phosphate moiety and a 3’-end with a free -OH group on the sugar . The backbone of the chain is formed by sugar and phosphates, with nitrogenous bases projecting inwards .
In RNA, every nucleotide residue has an additional -OH group at the 2’-position in the ribose, and Uracil is found instead of Thymine .
Historical Discoveries:
- Friedrich Meischer (1869): First identified DNA as an acidic substance in the nucleus, naming it ‘Nuclein’ .
- James Watson and Francis Crick (1953): Proposed the Double Helix model for DNA structure, based on X-ray diffraction data from Maurice Wilkins and Rosalind Franklin [8, 9].
- Erwin Chargaff’s observation was key: for double-stranded DNA, the ratios of Adenine to Thymine (A:T) and Guanine to Cytosine (G:C) are constant and equal to one . This implies base pairing.
Base pairing makes the two polynucleotide chains complementary to each other . This means if the sequence of one strand is known, the other can be predicted . This also explains how DNA acts as a template for replication, producing identical daughter DNA molecules and clarifying its genetic implications .
Salient Features of the Double-helix structure of DNA
The double helix structure has several key features :
- It consists of two polynucleotide chains. The backbone is made of sugar-phosphate, and the bases project inside .
- The two chains have anti-parallel polarity. If one chain has 5’→3’ polarity, the other has 3’→5’ .
- Bases in the two strands are paired via hydrogen bonds (H-bonds), forming base pairs (bp) .
- Adenine forms two H-bonds with Thymine .
- Guanine forms three H-bonds with Cytosine .
- A purine always pairs with a pyrimidine, maintaining a uniform distance between the two strands .
- The two chains are coiled in a right-handed fashion .
- The pitch of the helix is 3.4 nm, with roughly 10 bp in each turn .
- The distance between a base pair is approximately 0.34 nm .
- The plane of one base pair stacks over the other, which, along with H-bonds, confers stability to the helical structure .
Central Dogma
Francis Crick proposed the Central Dogma in molecular biology, which states that genetic information flows from DNA → RNA → Protein . In some viruses, the flow of information is reversed, from RNA to DNA .
Packaging of DNA Helix
The length of DNA in a typical mammalian cell is approximately 2.2 metres (6.6 × 10^9 bp × 0.34 nm/bp), which is far greater than the nucleus’s dimension (approx. 10^-6 m) . This long polymer must be efficiently packaged.
-
In Prokaryotes (e.g., E. coli): They lack a defined nucleus. The negatively charged DNA is held with positively charged proteins in a region called the ‘nucleoid’. The DNA in the nucleoid is organized into large loops held by proteins .
-
In Eukaryotes: The organization is more complex .
- There are positively charged, basic proteins called histones . Histones are rich in basic amino acid residues like lysine and arginine, which carry positive charges .
- Histones organize to form a unit of eight molecules called a histone octamer .
- The negatively charged DNA wraps around the positively charged histone octamer to form a structure called a nucleosome . A typical nucleosome contains 200 bp of DNA helix .
- Nucleosomes are the repeating units of chromatin, which are thread-like stained bodies in the nucleus .
- When viewed under an electron microscope (EM), nucleosomes in chromatin appear as a ‘beads-on-string’ structure .
- The ‘beads-on-string’ structure in chromatin is further packaged to form chromatin fibers, which are then coiled and condensed at the metaphase stage of cell division to form chromosomes .
- Higher-level packaging requires additional proteins called Non-histone Chromosomal (NHC) proteins [18, 19].
- Euchromatin refers to loosely packed chromatin regions that stain light and are transcriptionally active .
- Heterochromatin refers to more densely packed chromatin regions that stain dark and are inactive .
The Search for Genetic Material
While DNA was discovered around the same time as Mendel’s principles of inheritance, proving that DNA is the genetic material took much longer [19, 20]. By 1926, the search for genetic inheritance mechanisms reached the molecular level, narrowed down to chromosomes, but the specific molecule remained unknown .
Transforming Principle (Frederick Griffith, 1928)
Frederick Griffith experimented with Streptococcus pneumoniae, the bacterium responsible for pneumonia [20, 21].
- Some bacteria produce smooth (S) shiny colonies (due to a mucous polysaccharide coat) and are virulent .
- Others produce rough (R) colonies (lacking the coat) and are non-virulent .
- Mice infected with the S strain died, while those infected with the R strain did not [21, 22].
- He observed that heat-killed S strain bacteria did not kill mice .
- However, when he injected a mixture of heat-killed S and live R bacteria, the mice died, and he recovered living S bacteria from the dead mice .
- Griffith concluded that the R strain bacteria were somehow “transformed” by the heat-killed S strain. A ‘transforming principle’ from the heat-killed S strain enabled the R strain to synthesize a smooth polysaccharide coat and become virulent, suggesting the transfer of genetic material [22, 23]. However, the biochemical nature of this principle was not defined .
Biochemical Characterisation of Transforming Principle (Oswald Avery, Colin MacLeod, Maclyn McCarty, 1933-44)
Before their work, protein was widely believed to be the genetic material . They aimed to determine the biochemical nature of Griffith’s ‘transforming principle’ .
- They purified biochemicals (proteins, DNA, RNA) from heat-killed S cells .
- They found that DNA alone from S bacteria could transform live R cells into S cells .
- They used enzymes to digest specific molecules:
- Protein-digesting enzymes (proteases) and RNA-digesting enzymes (RNases) did not affect transformation, indicating the substance was not protein or RNA .
- Digestion with DNase inhibited transformation, strongly suggesting that DNA caused the transformation .
- They concluded that DNA is the hereditary material, though not all biologists were immediately convinced .
The Genetic Material is DNA (Alfred Hershey and Martha Chase, 1952)
This experiment provided unequivocal proof that DNA is the genetic material .
- They worked with bacteriophages (viruses that infect bacteria) .
- Bacteriophages attach to bacteria and inject their genetic material, making the bacterial cell manufacture more virus particles [25, 26]. Hershey and Chase wanted to know if the injected material was protein or DNA .
- They grew some viruses on a medium with radioactive phosphorus (32P). Since DNA contains phosphorus but protein does not, these viruses had radioactive DNA .
- They grew other viruses on a medium with radioactive sulfur (35S). Since protein contains sulfur but DNA does not, these viruses had radioactive protein .
- Experiment Steps:
- Radioactive phages were allowed to attach to E. coli bacteria .
- The viral coats were then removed by agitation in a blender .
- Virus particles were separated from bacteria by centrifugation .
- Results:
- Bacteria infected with viruses containing radioactive DNA (32P) became radioactive, indicating DNA entered the bacteria [27, 28].
- Bacteria infected with viruses containing radioactive proteins (35S) were not radioactive, indicating proteins did not enter .
- Conclusion: DNA is the genetic material that is passed from virus to bacteria .
Properties of Genetic Material (DNA versus RNA)
While DNA is the predominant genetic material, RNA also functions as genetic material in some viruses (e.g., Tobacco Mosaic virus) . A molecule acting as genetic material must fulfill specific criteria :
- Ability to generate its replica (Replication): Both DNA and RNA can direct their own duplication due to base pairing and complementarity [30, 31]. Proteins cannot fulfill this .
- Chemical and structural stability:
- RNA is less stable. The 2’-OH group in every nucleotide residue makes RNA labile and easily degradable . RNA can also be catalytic, making it reactive .
- DNA is chemically less reactive and structurally more stable . Its double-stranded nature and repair processes resist changes . The presence of Thymine instead of Uracil also adds stability .
- Scope for slow changes (mutation) for evolution: Both DNA and RNA can mutate [30, 34]. RNA, being unstable, mutates at a faster rate, explaining why viruses with RNA genomes evolve faster .
- Ability to express ‘Mendelian Characters’:
- RNA can directly code for protein synthesis, thus easily expressing characters .
- DNA is dependent on RNA for protein synthesis . The protein-synthesizing machinery has evolved around RNA .
Conclusion: Both can function as genetic material, but DNA is preferred for storage of genetic information due to its greater stability. RNA is better for transmission of genetic information .
RNA World
Evidence suggests that RNA was the first genetic material . Essential life processes like metabolism, translation, and splicing evolved around RNA . RNA acted as both a genetic material and a catalyst (ribozyme) for important biochemical reactions [33, 36]. However, RNA’s catalytic nature also made it reactive and unstable . Therefore, DNA evolved from RNA through chemical modifications that made it more stable, including its double-stranded structure and the evolution of repair mechanisms .
Replication
When proposing the double helix structure of DNA, Watson and Crick immediately suggested a scheme for DNA replication [33, 37]. This scheme, termed semiconservative DNA replication, stated :
- The two strands of DNA would separate and each act as a template for the synthesis of a new complementary strand.
- After replication, each new DNA molecule would consist of one parental strand and one newly synthesized strand.
The Experimental Proof (Matthew Meselson and Franklin Stahl, 1958)
This experiment provided definitive proof for semiconservative DNA replication, first in Escherichia coli and later in higher organisms .
- They grew E. coli in a medium containing heavy isotope nitrogen (15NH4Cl) for many generations, so 15N was incorporated into their DNA, making it “heavy” [38, 39]. This heavy DNA could be distinguished by centrifugation in a cesium chloride (CsCl) density gradient .
- They then transferred the cells to a medium with normal nitrogen (14NH4Cl) and took samples at definite time intervals [39, 40].
- Results:
- After one generation (20 minutes) in 14N medium, the extracted DNA had a hybrid or intermediate density . This showed that new DNA contained both heavy (parental) and light (newly synthesized) nitrogen.
- After another generation (40 minutes, II generation), the DNA was composed of equal amounts of hybrid DNA and ‘light’ DNA .
- This confirmed the semiconservative model .
- Similar experiments by Taylor and colleagues (1958) using radioactive thymidine on Vicia faba also proved semiconservative replication in chromosomes .
The Machinery and the Enzymes
DNA replication in living cells requires a set of catalysts (enzymes) .
- The main enzyme is DNA-dependent DNA polymerase, which uses a DNA template to catalyze the polymerization of deoxynucleotides .
- These enzymes are highly efficient; E. coli replicates its 4.6 × 10^6 bp genome in about 18 minutes, an average rate of approximately 2000 bp per second [42, 43].
- They also must be highly accurate to prevent mutations .
- Deoxyribonucleoside triphosphates serve a dual purpose: as substrates and as a source of energy for polymerization .
- Many additional enzymes are needed for accurate replication .
- Due to high energy requirements, the two DNA strands do not separate entirely. Replication occurs within a small opening of the DNA helix called a replication fork .
- DNA-dependent DNA polymerases can only catalyze polymerization in one direction: 5’→3’ .
- This leads to continuous replication on the template strand with 3’→5’ polarity .
- On the other template strand (5’→3’ polarity), replication is discontinuous, producing fragments that are later joined by the enzyme DNA ligase .
- Replication does not initiate randomly. There are definite regions called origin of replication where the process begins [45, 46].
- In eukaryotes, DNA replication occurs during the S-phase of the cell cycle, requiring tight coordination with cell division [46, 47].
Transcription
Transcription is the process of copying genetic information from one strand of DNA into RNA .
- The principle of complementarity governs this process, with adenosine forming a base pair with uracil instead of thymine in the RNA .
- Unlike replication, where the entire DNA is duplicated, in transcription, only a segment of DNA and only one of its strands is copied into RNA .
Why only one strand is copied during transcription:
- If both strands acted as templates, they would code for RNA molecules with different sequences, which would then lead to different protein sequences. This would complicate the genetic information transfer machinery, as one DNA segment would code for two different proteins .
- If two RNA molecules were produced simultaneously from complementary strands, they would also be complementary to each other and form a double-stranded RNA. This would prevent the RNA from being translated into protein, rendering the transcription process futile .
Transcription Unit
A transcription unit in DNA is defined by three regions :
- A Promoter
- The Structural gene
- A Terminator
Strand Definition Convention:
-
Since RNA polymerase synthesizes RNA in the 5’→3’ direction, the DNA strand with 3’→5’ polarity acts as the template and is called the template strand .
-
The other DNA strand, with 5’→3’ polarity, has a sequence identical to the RNA (except for thymine instead of uracil) and is displaced during transcription. This strand is called the coding strand . All reference points for defining a transcription unit are made with respect to the coding strand .
-
The promoter is located towards the 5’-end (upstream) of the structural gene (with respect to the coding strand polarity) . It is a DNA sequence that provides a binding site for RNA polymerase and defines which strand is the template and which is the coding strand .
-
The terminator is located towards the 3’-end (downstream) of the coding strand and usually defines the end of transcription . Additional regulatory sequences may be present further upstream or downstream to the promoter .
Transcription Unit and the Gene
A gene is defined as the functional unit of inheritance . While genes are on DNA, their definition in terms of DNA sequence can be complex .
- DNA sequences coding for tRNA or rRNA molecules also define a gene .
- A cistron is defined as a segment of DNA coding for a polypeptide .
- Structural genes can be:
- Monocistronic: coding for a single polypeptide (mostly in eukaryotes) .
- Polycistronic: coding for multiple polypeptides (mostly in bacteria or prokaryotes) .
- In eukaryotes, monocistronic structural genes have interrupted coding sequences, meaning the genes are split .
- Exons: The coding sequences, or expressed sequences, that appear in mature or processed RNA .
- Introns: Intervening sequences that do not appear in mature or processed RNA .
- Regulatory sequences like promoters can also affect inheritance of a character and are sometimes loosely called regulatory genes, even though they don’t code for RNA or protein .
Types of RNA and the process of Transcription
In Bacteria: There are three major types of RNAs, all needed for protein synthesis :
-
mRNA (messenger RNA): Provides the template for protein synthesis [55, 56].
-
tRNA (transfer RNA): Brings amino acids and reads the genetic code .
-
rRNA (ribosomal RNA): Plays structural and catalytic roles during translation .
-
A single DNA-dependent RNA polymerase catalyzes the transcription of all types of RNA in bacteria .
-
Process of Transcription in Bacteria:
- Initiation: RNA polymerase binds to the promoter .
- Elongation: It uses nucleoside triphosphates as substrates and polymerizes RNA in a template-dependent fashion (5’→3’), following complementarity rules. It also facilitates the opening of the DNA helix [56, 57]. Only a short stretch of RNA remains bound to the enzyme .
- Termination: Once the polymerase reaches the terminator region, the nascent RNA falls off, and so does the RNA polymerase .
-
RNA polymerase itself is only capable of elongation. It transiently associates with an initiation-factor (σ) to start transcription and a termination-factor (ρ) to end it . These factors alter the enzyme’s specificity .
-
In bacteria, transcription and translation can be coupled because mRNA does not require processing, and there is no separation between cytosol and nucleus [58, 59]. Translation can begin before mRNA is fully transcribed .
In Eukaryotes: Transcription is more complex with two additional features :
- Multiple RNA polymerases in the nucleus:
- RNA polymerase I: Transcribes rRNAs (28S, 18S, and 5.8S) .
- RNA polymerase III: Responsible for transcribing tRNA, 5srRNA, and snRNAs (small nuclear RNAs) .
- RNA polymerase II: Transcribes the precursor of mRNA, called heterogeneous nuclear RNA (hnRNA) .
- Post-transcriptional Processing of hnRNA:
- The primary transcripts (hnRNA) contain both exons and introns and are non-functional .
- They undergo splicing, where introns are removed, and exons are joined in a defined order .
- hnRNA also undergoes capping (addition of an unusual nucleotide, methyl guanosine triphosphate, to the 5’-end) and tailing (addition of 200-300 adenylate residues to the 3’-end in a template-independent manner) .
- The fully processed hnRNA, now called mRNA, is then transported out of the nucleus for translation .
- The split-gene arrangement (presence of introns) is thought to be an ancient feature, and splicing represents the dominance of the RNA-world .
Genetic Code
Transferring genetic information from a polymer of nucleotides to synthesize a polymer of amino acids (proteins) is complex because no direct complementarity exists between nucleotides and amino acids . However, evidence showed that changes in nucleic acids (genetic material) caused changes in amino acids in proteins . This led to the proposition of a genetic code .
- George Gamow, a physicist, theorized that since there are only 4 bases and 20 amino acids, the code must be a combination of bases. He proposed a triplet code (4^3 = 64 codons, more than enough for 20 amino acids) .
- Deciphering the Code:
- Har Gobind Khorana developed chemical methods to synthesize RNA molecules with defined base combinations .
- Marshall Nirenberg’s cell-free system for protein synthesis helped decipher the code .
- Severo Ochoa enzyme (polynucleotide phosphorylase) helped polymerize RNA with defined sequences in a template-independent manner .
- Finally, a checker-board for the genetic code was prepared .
Salient Features of Genetic Code:
- The codon is triplet: 61 codons code for amino acids, and 3 codons do not code for any amino acids, functioning as stop codons .
- Degenerate: Some amino acids are coded by more than one codon .
- Contiguous fashion: The code is read in mRNA in a continuous manner, with no punctuations .
- Nearly Universal: For example, UUU codes for Phenylalanine (phe) from bacteria to humans. Some exceptions exist in mitochondrial codons and some protozoans [66, 67].
- AUG has dual functions: It codes for Methionine (met) and also acts as an initiator codon .
- Stop/Terminator codons: UAA, UAG, UGA .
Mutations and Genetic Code
The relationship between genes and DNA is well understood through mutation studies .
- Point mutations: A change of a single base pair in a gene. A classic example is sickle cell anemia, where a single base pair change in the beta globin chain gene results in glutamate changing to valine .
- Frameshift insertion or deletion mutations: Insertion or deletion of one or two bases changes the reading frame from the point of insertion or deletion [70, 71].
- Insertion or deletion of three or its multiple bases inserts or deletes one or multiple amino acids, but the reading frame remains unaltered from that point onwards .
tRNA – the Adapter Molecule
Francis Crick postulated the existence of an adapter molecule to read the genetic code and link it to specific amino acids, as amino acids have no structural specificities to read the code directly .
- tRNA (transfer RNA), previously known as sRNA (soluble RNA), was later assigned this adapter role .
- Structure and Function:
- tRNA has an anticodon loop with bases complementary to the mRNA codon .
- It also has an amino acid acceptor end to which it binds specific amino acids .
- Each tRNA is specific for a particular amino acid .
- There is a specific initiator tRNA for initiation .
- No tRNAs exist for stop codons .
- The secondary structure of tRNA resembles a clover-leaf, but its actual compact structure is like an inverted L .
Translation
Translation is the process of polymerization of amino acids to form a polypeptide, where the order and sequence of amino acids are defined by the sequence of bases in the mRNA . Amino acids are joined by a peptide bond .
Process:
- Charging of tRNA (Aminoacylation): Amino acids are activated in the presence of ATP and then linked to their cognate tRNA . This step requires energy .
- Ribosome: The cellular factory responsible for synthesizing proteins is the ribosome . It consists of structural RNAs and about 80 different proteins .
- In its inactive state, a ribosome exists as two subunits: a large subunit and a small subunit .
- When the small subunit encounters an mRNA, translation begins .
- The large subunit has two sites for subsequent amino acids to bind, bringing them close for peptide bond formation .
- The ribosome also acts as a catalyst for peptide bond formation; specifically, 23S rRNA in bacteria is an enzyme called a ribozyme .
- A translational unit in mRNA is the sequence flanked by the start codon (AUG) and the stop codon, coding for a polypeptide .
- mRNA also has untranslated regions (UTR) at both the 5’-end (before the start codon) and the 3’-end (after the stop codon). These UTRs are not translated but are required for efficient translation [76, 77].
- Initiation: The ribosome binds to the mRNA at the start codon (AUG), which is recognized by the initiator tRNA .
- Elongation: Complexes of amino acid-linked tRNAs sequentially bind to the appropriate codon in mRNA by forming complementary base pairs with the tRNA anticodon. The ribosome moves from codon to codon, adding amino acids one by one to the polypeptide chain [77, 78].
- Termination: When a release factor binds to a stop codon, translation terminates, and the complete polypeptide is released from the ribosome .
Regulation of Gene Expression
Regulation of gene expression is a broad term that can occur at various levels. Since gene expression ultimately results in polypeptide formation, it can be regulated at several stages .
In Eukaryotes, regulation can be exerted at:
- Transcriptional level (formation of primary transcript)
- Processing level (regulation of splicing)
- Transport of mRNA from nucleus to cytoplasm
- Translational level
Genes are expressed to perform specific functions . For instance, E. coli synthesizes the enzyme beta-galactosidase to hydrolyze lactose into galactose and glucose for energy [79, 80]. If lactose is absent, the bacteria do not need this enzyme, so its synthesis is not required . Thus, metabolic, physiological, or environmental conditions regulate gene expression . The development and differentiation of organisms also result from the coordinated regulation of gene expression .
In Prokaryotes:
- The predominant site for control of gene expression is the rate of transcriptional initiation .
- The activity of RNA polymerase at a promoter is regulated by interaction with accessory proteins (activators or repressors), which affect its ability to recognize start sites .
- The accessibility of promoter regions in prokaryotic DNA is often regulated by proteins interacting with sequences called operators .
- The operator region is usually adjacent to the promoter elements in most operons and binds a repressor protein [81, 82]. Each operon has its specific operator and repressor .
The Lac operon
The elucidation of the lac operon by Francois Jacob and Jacque Monod was a landmark in understanding transcriptionally regulated systems .
- The lac operon (referring to lactose) is a polycistronic structural gene regulated by a common promoter and regulatory genes .
- Such arrangements are common in bacteria and are called operons (e.g., lac operon, trp operon) .
Components of the lac operon: [83, 84]
- i gene (inhibitor gene): A regulatory gene that codes for the repressor of the lac operon [83, 84].
- z gene: Codes for beta-galactosidase (β-gal), which hydrolyzes lactose into galactose and glucose .
- y gene: Codes for permease, which increases the cell’s permeability to β-galactosides, allowing lactose entry .
- a gene: Encodes transacetylase . All three gene products of the lac operon are required for lactose metabolism .
Lactose as an Inducer:
- Lactose is the substrate for beta-galactosidase and acts as an inducer, regulating the switching on and off of the operon .
- A very low level of lac operon expression is always present to allow lactose entry via permease [85, 86].
- Mechanism of Induction:
- The repressor protein is synthesized constitutively (all the time) from the i gene .
- The repressor protein binds to the operator region, preventing RNA polymerase from transcribing the operon .
- In the presence of an inducer like lactose (or allolactose), the repressor is inactivated by interacting with the inducer .
- This inactivation allows RNA polymerase to access the promoter, and transcription proceeds .
- This regulation can be seen as the regulation of enzyme synthesis by its substrate .
- Regulation of the lac operon by the repressor is referred to as negative regulation .
Human Genome Project (HGP)
The HGP was launched in 1990 to determine the complete DNA sequence of the human genome, driven by the understanding that DNA sequence determines genetic information and differences in sequences make individuals unique [88, 89].
Why it was called a ‘mega project’:
- The human genome has approximately 3 × 10^9 base pairs (bp) .
- The estimated cost was about US $3 per bp, leading to a total estimated cost of approximately 9 billion US dollars .
- Storing the obtained sequences in typed form would require 3300 books, each with 1000 pages and 1000 letters per page, for a single human cell’s DNA .
- The enormous amount of data generated necessitated the use of high-speed computational devices for storage, retrieval, and analysis . HGP was closely associated with the rapid development of Bioinformatics .
Goals of HGP: [91, 92]
- Identify all the approximately 20,000-25,000 genes in human DNA.
- Determine the sequences of the 3 billion chemical base pairs that make up human DNA.
- Store this information in databases.
- Improve tools for data analysis.
- Transfer related technologies to other sectors like industries.
- Address the ethical, legal, and social issues (ELSI) that might arise from the project.
The HGP was a 13-year project, coordinated by the U.S. Department of Energy and the National Institute of Health, with major contributions from the Wellcome Trust (U.K.) and other countries . It was completed in 2003 .
- Impact: Knowledge about DNA variations promises revolutionary ways to diagnose, treat, and prevent thousands of human disorders .
- Sequencing non-human model organisms (e.g., bacteria, yeast, C. elegans, Drosophila, rice, Arabidopsis) also aids in understanding their natural capabilities for applications in healthcare, agriculture, energy, and environmental remediation [93, 94].
Methodologies:
Two major approaches were used:
- Expressed Sequence Tags (ESTs): Focused on identifying all genes that are expressed as RNA .
- Sequence Annotation: A “blind” approach of simply sequencing the entire genome (including coding and non-coding sequences) and then assigning functions to different regions .
Sequencing process:
- Total DNA from a cell was isolated and converted into random fragments of smaller sizes due to technical limitations in sequencing long pieces of DNA .
- These fragments were cloned in suitable hosts (like bacteria and yeast) using specialized vectors (e.g., BACs - bacterial artificial chromosomes, and YACs - yeast artificial chromosomes) to amplify each fragment for easy sequencing [95, 96].
- The fragments were sequenced using automated DNA sequencers based on a method developed by Frederick Sanger .
- These sequences were then arranged based on overlapping regions, which required specialized computer-based programs because manual alignment was impossible [96, 97].
- The sequences were subsequently annotated and assigned to each chromosome. Chromosome 1 was the last to be completed in May 2006 .
- Genetic and physical maps were generated using information on polymorphism of restriction endonuclease recognition sites and repetitive DNA sequences (microsatellites) .
Salient Features of Human Genome:
- The human genome contains 3164.7 million bp.
- The average gene consists of 3000 bases, but sizes vary, with the largest known human gene, dystrophin, at 2.4 million bases.
- The total number of genes is estimated at 30,000, much lower than previous estimates (80,000-1,40,000).
- Almost all (99.9%) nucleotide bases are exactly the same in all humans.
- Functions are unknown for over 50% of the discovered genes.
- Less than 2% of the genome codes for proteins.
- Repeated sequences make up a very large portion of the human genome. These stretches of DNA are repeated many times, sometimes hundreds to thousands of times. They are thought to have no direct coding functions but provide insights into chromosome structure, dynamics, and evolution.
- Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).
- Scientists have identified about 1.4 million locations of single-base DNA differences (SNPs – single nucleotide polymorphisms) in humans. This information is crucial for finding disease-associated sequences and tracing human history.
Applications and Future Challenges: [101, 102]
- A major future challenge is deriving meaningful knowledge from the vast DNA sequences to understand biological systems .
- The HGP enables a radically new approach to biological research. Instead of studying one or a few genes, researchers can now study all genes in a genome or how tens of thousands of genes and proteins work together in interconnected networks .
DNA Fingerprinting
Although 99.9% of base sequences are the same among humans, it is the differences in the remaining 0.1% that make every individual unique in their phenotypic appearance . DNA fingerprinting is a quick method to compare these DNA sequences between individuals, avoiding the daunting and expensive task of full genome sequencing .
The technique involves identifying differences in specific regions of DNA called repetitive DNA .
- Repetitive DNA: Small stretches of DNA that are repeated many times (hundreds to thousands of times) [99, 104].
- These sequences are separated from bulk genomic DNA during density gradient centrifugation, forming small peaks referred to as satellite DNA (while bulk DNA forms a major peak) .
- Satellite DNA is classified into micro-satellites, mini-satellites, etc., based on base composition, segment length, and number of repetitive units .
- They typically do not code for any proteins but constitute a large portion of the human genome .
- They show a high degree of polymorphism, which forms the basis of DNA fingerprinting .
- Since the degree of polymorphism is the same in DNA from all tissues (blood, hair, skin, bone, saliva, sperm) of an individual, they are highly useful as an identification tool in forensic applications .
- As these polymorphisms are inheritable, DNA fingerprinting is also used for paternity testing in disputes .
DNA Polymorphism:
- Refers to variation at the genetic level, arising due to mutations .
- An inheritable mutation is considered a DNA polymorphism if more than one variant (allele) at a locus occurs in a human population with a frequency greater than 0.01 [107, 108].
- Polymorphisms are more likely to be observed in non-coding DNA sequences, as mutations in these regions may not immediately affect an individual’s reproductive ability . These mutations accumulate over generations, contributing to variability .
- Polymorphisms range from single nucleotide changes to very large-scale changes and play a vital role in evolution and speciation .
Technique (initially developed by Alec Jeffreys):
- Jeffreys used a satellite DNA called Variable Number of Tandem Repeats (VNTR) as a probe, known for its very high degree of polymorphism .
- The earlier technique involved Southern blot hybridisation using a radiolabelled VNTR probe [109, 110].
Steps of DNA Fingerprinting using Southern Blot:
- Isolation of DNA.
- Digestion of DNA by restriction endonucleases (enzymes that cut DNA at specific sites).
- Separation of DNA fragments by electrophoresis (based on size).
- Transferring (blotting) of separated DNA fragments from the gel to synthetic membranes (e.g., nitrocellulose or nylon).
- Hybridisation using a labelled VNTR probe. The probe binds to complementary VNTR sequences on the membrane.
- Detection of hybridised DNA fragments by autoradiography.
- VNTRs are a class of mini-satellites where a small DNA sequence is arranged tandemly in many copy numbers .
- The copy number varies from chromosome to chromosome within an individual and shows a very high degree of polymorphism .
- VNTRs vary in size from 0.1 to 20 kb .
- After hybridization, the autoradiogram produces many bands of differing sizes, which create a characteristic banding pattern unique to an individual’s DNA (except for monozygotic/identical twins) [111, 112].
- The technique’s sensitivity has been significantly increased by the use of Polymerase Chain Reaction (PCR), making it possible to perform DNA fingerprinting analysis from a single cell .
Applications of DNA Fingerprinting:
- Forensic science (identifying suspects from crime scene samples)
- Paternity testing
- Determining population and genetic diversities
- Evolutionary biology