Jump to... Cells / Genome / DNA Replication, Translation, and Transcription / Mendelian Inheritance / Complex Traits / Gene Regulation / Analysis and Techniques / Miscellaneous
The cell life cycle can be split into four major parts:
Mitosis consists of:
Some organisms undergo binary fission, which is distinctly different from mitosis (which only occurs in eukaryotes, not bacteria) – specifically, in binary fission, DNA is just doubled and split, as there is no nucleus (usually) to worry about. Additionally, cells undergoing meiosis (producing gametes/sex cells) have an additional subset of stages that occurs during prophase. The process is a little more complicated:
To cover the basics, a genome is defined as all the DNA in a virus, organelle, or cell. The human genome is comprised of two other genomes – the DNA (deoxyribonucleic acid) found in our mitochondria (mtDNA) and the DNA found in the nuclei of our cells (nDNA) – and is over three million base pairs long, arranged in a double helix shape. A base pair is a pair of nucleic acids – a pyrimidine (thymine/uracil or cytosine) and a purine (adenine or guanine) – bonded together. Plants have three genomes (due to cpDNA in chloroplasts), and prokaryotes (bacteria) and viruses only have one genome (though prokaryotes may also have plasmids, rings of DNA floating independent of the DNA in their nucleoids).
Alongside other things, DNA is kept in chromosomes; humans usually have twenty-three homologous pairs of these (for a total of forty-six, or 23N = 46), making humans diploid organisms. One chromosome is maternal, the other paternal. Some organisms or cells (such as human gametes) are haploid, as they have only one of each chromosome. Aneuploidy describes an abnormal amount of chromosomes (more on that in Inheritance).
Chromosomes are packed incredibly tightly; in eukaryotes, DNA is wrapped around histone proteins via (rather weak) hydrogen bonds. This is because te three-dimensional structure and amide group of histone proteins gives them a positive charge; DNA is negatively charged due to its phosphate groups. These histones are further packaged together in octamers called nucleosomes. These then get folded into solenoids and arranged even more tightly as a scaffold loop along a protein. This collection of DNA, RNA, proteins, lipids, and polysaccharides is what makes up the chromatin that makes up a chromosome. The DNA in one cell alone could be stretched out to two meters; all your cells' DNA combined could stretch to the sun and back over sixty times, or be as long as twice the diameter of the Solar System.
The complete set of an individual's chromosomes is called a karyotype. To visualize it, a cytogeneticist may used FISH (fluorescent in situ hybridization) DNA/RNA probes that bind to a complementary region of nucleic acids. Euchromatin is loosely-packed and does not stain easily, but heterochromatin is tighter and stains darker. Regions of euchromatin are typically regions that are frequently read and therefore need to be kept "readily-accessible" in "chromosomal puffs". Topoisomerases are proteins that specifically control the winding of DNA for ease of transcription.
The centromere of a chromosome divides it into two parts. The shorter arm is the p-arm ("petite"), and the longer is the q-arm. The number of centromeres in a karyotype equals the number of chromosomes (a helpful tip to keep in mind when a chromosome develops sister chromatids during mitosis). When this centromere is in the very middle of a chromosome, it is metacentric; ones with a centromere closer to one end than another are submetacentric, acrocentric, and eventually telocentric (when the centromere is at the very end).
Genes are parts of DNA that code for specific proteins. These are transcribed by other proteins to form mRNA, and made up of introns and exons. Introns are removed before the mRNA is translated into proteins. Individuals that belong to the same species will have similar introns, but their genomes may vary when it comes to single nucleotide polymorphisms (SNPs), which are differences in a single base pair, and copy number variants (CNVs), which are differences in the amount of times a particular region appears in a single gene. An excessive amount of CNVs in the gene coding for huntingtin, for example, determine whether or not an individual may develop Huntington's. Genomes may also vary thanks to transposable elements, which are self-replicating mobile DNA sequences. We actually don't know for sure what their purpose is, and they are considered "genomic parasites" and "selfish/jumping" genes, but they alone make up an entire 44% of the human genome. As far was we know, only about 1% of the human genome actually codes for anything!
The first stage of DNA replication is initiation. The supercoiling of template DNA is unwound by gyrase, an enzyme that relies on ATP (as most parts of this process do). In prokaryotes, DnaA proteins then bind to a region near the ori (origination) sequence, where they break the hydrogen bonds between strands and their base pairs through added tension. The loader DnaC adds a donut-shaped enzyme called DnaB helicase, which formally unwinds the DNA, traveling from the 5' end to the 3' end. This creates a replication fork. In eukaryotes, initiation instead begins at an ARS sequence, which is recognized by an ORC (Origin of Replication Complex) and preRC (prereplication complex) that "approves" the procedure.
The next stage of DNA replication is elongation. The primase enzymes bound to the DnaB helicase make short RNA primers on the leading strand to begin the process. DNA polymerase (DNA Pol III) is attached by a sliding clamp to start at this primer and travel towards the forming fork, creating a 5' to 3' strand using free dNTPs that have the correct nucleotide bases attached (as dATP, dTTP dGTP, and dCTP). Along the lagging (5' to 3') strand, multiple short, discontinuous sections called Okazaki fragments form instead, and (after the primer is removed and filled in by DNA Pol I) are later connected by ligase. This process creates linear dsDNA.
DNA replication ends with termination. In prokaryotes, eventually, DNA Pol meets a Tus protein bound to a Ter sequence. In eukaryotes, the process will eventually reach a ribozyme (enzyme comprised of both protein and RNA) called telomerase. This enzyme keeps the end of DNA intact with repetitive, noncoding sequences. Shorter telomeres are correlated with increased aging, but some cell death is necessary, so while lengthening telomeres indefinitely could theoretically extend one's life expectancy, it'd also greatly increase one's risk of cancer.
Various factors – such as mistakes in replication, viral/bacterial infection, chemicals, and high-frequency radiation – may alter the structure of DNA, creating mutations. Usually, during replication, any mistakes detected will halt polymerase activity, and exonucleause activity will remove a few bases from the offending section before polymerization restarts, but some nonetheless get through. Slippage can occur, particularly at repetitive regions that may bind to themselves in loops, creating trinucleotide repeat disorders, like Huntington's. This phenomenon can also cause genetic anticipation when offspring inherit more and more repeats, so later generations will experience the related disorder more severely at a younger age. Other mistakes include deamination, when cytosine is turned into a uracil due to a high protein diet, or inversions, where entire sections of a chromosome are out of order (but rarely alter phenotypes, aside from fertility if a paracentric inversion leads to an abnormal amount of centromeres). Some mutations, however, may be beneficial, and there are many repair pathways specialized in catching mistakes.
In mismatch repair, several nucleotides are removed from the offending region by an endonuclease. DNA Pol and ligase will return to fill in the gaps. Nucleotide excision repair is similar, but recognizes entire mishappen segments of DNA. In base excision repair, only one base is removed; this tends to occur after DNA replication. The direct repair pathway differs in that it targets alterations that were made to nucleotides, such as misplaced bonds, and has many repair enzymes built to recognize different forms of alterations. This leaves the phosphate backbone intact. Non-homologous end joining deals with double-stranded breaks; ends are blunted, then ligated together. This differs from homologous repair, where a portion of intact DNA from a homologous chromosome helps fill in the gap. This can lead to a loss of heterozygocity.
Over ten thousand diseases are caused by mutations in a single gene. Gene therapy and genome editing are fields focused on adding correct forms of a gene or altering the genome, respectively, in a healthcare context. Precision of delivery and editing is a major concern, as is the ethics of the treatment. CRISPR (Clustered Regularly-Interspaced Short Palindromic Repeats) is a ribozyme used as an immune response in some bacterial cells and is the most modern approach to this issue. The complex "remembers" which viral DNA has entered the cell in order to prevent reinfection, and it can be engineered to target specific portions of DNA using a combination of tracrRNA and crRNA. This method is highly efficient and simple, but also overefficient as it can target portions of DNA that one may prefer not to target. Many different CRISPR pathways exist. RNAi (RNA interference) is an important related technology; these strands of DNA will bind to transcribed mRNA to prevent their translation.
Not all of the genome is transcribed into RNA, and not all of this RNA becomes protein-coding mRNA (which only makes up ~2% of RNAs!); there are tRNAs that transport amino acids to ribosomes, for example, or ncRNAs that code for nothing at all. Most RNA (~80%) is ribosomal rRNA, used to build proteins. While the genome is all the genetic material in a cell, and the transcriptome is all the RNA, the proteome describes all the protein in a cell. Each set of three nucleotides in mRNA is called a codon, and each codon represents the next amino acid that must be used to construct a protein coded for by the mRNA. The template strand is 3' to 5', and mRNA will be 5' to 3', meaning it will resemble the coding strand, aside from the fact that uracil is used in place of thymine. Just like DNA replication, transcription (from DNA to mRNA) has three phases (initiation, elongation, and termination), but uses different enzymes. The sigma factor is a subunit of RNA polymerase that binds to the -10 and -35 sections of the coding strand and bends the dsDNA to get a single-stranded -10 section. The RNA polymerase can initiate extension without a primer. UP sequences occur upstream of the -35/-10 promoters and may help regulate transcription. These will usually be similar in some way, and a sequence of the most common nucleotides at each position is called a consensus sequence. The process of transcription ends at a terminator sequence, which isn't necessarily the STOP codon of the protein coding sequence; this section tends to be palindromic so that it loops back around to form a stem/loop structure with itself. Transcription doesn't necessarily only focus on one gene at a time; operons are clusters of genes that are all transcribed on one polycistronic mRNA.
Ribosomes are the enzymes responsible for translating mRNA into proteins. They are contain a large and a small subunit. The former binds to incoming tRNAs and catalyzes the reaction, while the latter decodes and attaches to the mRNA via the 3' end of its rRNA at the Shine-Delgarno (SD) box in prokaryotes, and the fMet in the P-site/Kozak sequence in eukaryotes. A charged tRNA with a sequence complementary to the mRNA enters the ribosome's A site, moves to the P site to attach its amino acid to the growing chain as another tRNA enters, and is discharged from the ribosome via the E site.
This creates the primary structure of a protein – its amino acid sequence. Secondary structure occurs as hydrogen bonds between the groups in this sequence form to fold into either α-helices or β-pleated sheets. This then allows for even more bonds between R-grounds/the peptide backbone that make the protein three dimensional, forming its tertiary structure. Different polypeptides like this one can then form quaternary structures whith each other.
Alleles are two different versions of a gene found on homologous chromosomes. These make up an organism's genotype. Usually, there is a dominant allele that suppresses the expression of any recessive alleles. An organism must have a homozygous genotype (have two identical alleles) to express a recessive phenotype (physical traits), but a heterozygous organism will look identical to one with a homozygous genotype if it has at at least one dominant allele. This means that it's impossible to be sure of an organism's phenotype if it is expressing the dominant allele, at least not without analyzing the rest of their pedigree. Generally, recessive alleles code for a loss of function; to use a metaphor, as long as one foot is on the accelerator, it doesn't matter if the other one isn't.
There is also the case of incomplete dominance, where an organism heterozygous for two alleles at a particular locus expresses a phenotype that is a combination of both (like pink flowers occurring when a red-flowering plant and a white-flowering plant are bred together). This is distinct from codominance, when a heterozygote expresses both phenotypes simultaneously (like calico cats or roan cattle, which are speckled with both colors). Some traits are also sex-linked, attached to a sex chromosome (X or Y). Since XY individuals are heterogametic and only have one X chromosome to express, they will always express X-linked traits. XX individuals, however, need two copies (and therefore an affected father, carrier mother), unless the X-linked trait is dominant. It's worth noting, however, that homogametic XX individuals usually have one inactivated X chromosome that is referred to as a Barr body. This prevents XX individuals from expressing "double" the amount of X-linked traits.
Gametes (that is, haploid sex cells) are formed during meiosis. However, errors can occur in the process, like nondisjunction when chromosomes fail to separate properly. This can cause aneuploidy, when there is an abnormal amount of any chromosome. Polyploidy occurs when an individual has three or more complete sets of chromosomes rather than just a pair and is a form of euploidy. Polyploidy is viable in plants, but not humans. However, humans are very resilient to sex chromosome aneuploidies. Such cases are examples of intersex individuals (while those who aren't intersex are considered perisex, endosex, or dyadic).
Admixture occurs when two divergent or isolated genetic lineages from different gene pools mix. A gene pool is the set of all the genes in a population. Allele frequency in a diploid gene pool can be represented with p + q = 1, where p is the frequency of one allele and q the frequency of the other. When a large, dilpoid population is at Hardy-Weinberg equilibrium (experiencing no mutation, migration, or selection), the genotype frequency can be represented with p² + 2pq + q² = 1.
A pedigree chart illustrates inheritance of a gene and expression of a trait throughout an individual's ancestry. Male individuals are represented with squares, and female individuals are represented with circles. If an individual's sex is unknown – such as in the early stages of pregnancy, due to a lack of documentation, or in cases of intersexuality – they may be represented with a diamond. An individual that is partially colored is a known carrier for a trait, and one that is solid expresses that trait.
When two individuals that are dominant-recessive heterozygous at two loci have offspring, it helps to remember that their offspring will express phenotypes in a 9:3:3:1 ratio, where nine children express both dominant phenotypes, six (3+3) children express a recessive and a dominant phenotype, and one expresses both recessive phenotypes.
Usually, one can predict the probability of two events occurring simultaneously by multiplying together the probabilities of them occurring independently first, but sometimes, in cases of unordered events, it's necessary to use binomial expansion. Say that a couple, both parents heterozygous for sickle cell anemia, plans to have three children. The probability of a child having sickled cells is b, and normal cells is a. (a + b)³, the probability of having three kids with either type of cell, can be expanded to form a³ + 3a²b + 3ab² + b³, and the term 3a²b represents the probability of having two unaffected children and one affected child in no particular order.
Complex traits do not behave in ways predicted by typical Mendelian inheritance. There are a lot of reasons why this may be.
Recessive lethality can lead to missing genotypes when a particular genotype is nonviable, and additive traits are determined by how many additive alleles are present; more alleles means more expression (seen in height and melanin production). Conditional traits are only expressed in the right environments (seen in the pigmentation of Siamese cats). Recent research has also discovered maternal affect genes, cases where a mother's genotype determines the phenotype of their offspring, and a few examples of these have already been found in humans. Organelle inheritance is also typically maternal (and remember, the human genome includes the DNA in mitochondria).
Epistasis occurs when the expression of one epistatic gene masks another, hypostatic gene. A good example in humans is baldness, which masks hair color. This is contrasted with pleiotropy, which describes when a gene affects multiple traits at once. There are many forms of epistasis: recessive (recessive alleles at one locus mask another), dominant (dominant alleles at one locus mask another; "inhibitor"), duplicate recessive (recessives at either locus mask the other; often seen in linear pathways), duplicate dominant (dominants at either locus mask the other; often due to functional redundacy), one dominant/one recessive (dominant allele at one locus masks the other, which masks the first if it is homozygous recessive), and non-additive (alleles at different loci do not contribute equally to trait).
During mitosis, chromosomes are paired tightly together on top of one another in synapsis. As a result, recombination may occur, and the two homologous chromosomes may swap sections of themselves. The frequency of recombination is often underestimated since two recombination events may occur next to each other. However, linked genes are rarely inherited independently. They are found next to each other on the same chromosome. Completely linked genes are never found apart and have a recombination frequency (RF) of 0%. Partial linkage results in an RF below 50%; any RF equal to or greater than 50% signifies a total lack of linkage. RF may also be expressed in terms of centiMorgans or map units, where 1% RF = 1 c.M. = 1 m.u. You can calculate recombination frequency using the equation RF = [(# of recombination events) ÷ (total progeny)] × 100.
Phase is a way or expressing allele combinations from each parent in something that resembles a fraction. Ab/aB notation would signify the alleles A and b came from one parent, and a and B from the other. If the maternal alleles are the exact same as one of the alleles of the maternal parent, for example, they are considered in phase, parental, or nonrecombinant; no recombinant gametes had been formed during meiosis. If they aren't, they are out of phase or recombinant, nonparental. However, this cannot always be determined. It is also worth noting that phase does not signify linkage.
When a gene is expressed at a constant level and at all times, it exhibits constitutive expression, but this is not always efficient, so many genes are regulated. This can be done transcriptionally, altering the transcription rate; translationally, blocking or changing the translation rate and mRNA stability; and post-translationally, modifying proteins to alter their activity. The first, transcriptional, is usually best, though it is worth noting that transcription and translation are coupled in prokaryotes.
Prokaryotic genomes contain operons, or regions of genes with related functions that are regulated by one promoter together, transcribed into one mRNA. The lac operon contains an example of negative regulation. This operon allows the bacterium to use lactose as an alternative energy source when glucose is not available. When there is no lactose present, the genes are not transcribed. This is because the lac repressor protein binds to the transcription start site, a region after the promoter to which the RNA polymerase binds, known as the operator. This prevents the RNA Polymerase from moving along and reading the genetic information downstream, unless lactose binds to the repressor and makes it lose its ability to bind. Allolactose, an isomer of lactose, can also bind to the repressor. This makes allolactose an inducer, a molecule that triggers gene expression.
Upstream of the promoter is the CAP binding site, a site of positive regulation. In positive regulation, transcription only occurs when an activator is bound to the operator. No transcription occurs if the activator is missing. The activator protein that promotes gene expression here is CAP (catabolite activator protein). CAP helps the RNA polymerase bind to the promoter when glucose levels are low, so some transcription of this operon occurs nonetheless. This is leaky expression. In turn, CAP is regulated by cAMP (cyclic AMP), which detects when glucose levels are low and alters CAP so as to allow it to bind to DNA.
The trp operon synthesizes the amino acid known as tryptophan. When there is a lot of tryptophan present, the molecule binds to the repressor and keeps it active, preventing too much synthesis from occuring, which is another form of negative regulation. This makes tryptophan a co-repressor. This operon also exhibits attenuation. To terminate transcription, stem-loop structures may form. Anti-attentuation structures form to prevent normal attentuation and do cause the ribosome to stall, but transcription continues anyway.
Eukaryotic gene regulation is more complicated, and very diverse, especially in multicellular organisms. It can occur in the nucleus or the cytoplasm during many stages, and there are always multiple layers to the process. There are multiple parts of the promoter region alone, with a TATA box (-25bp, 25 bases upstream), GC box (-100bp to -300bp), and CAAT box (-80bp). Specific proteins – activators and repressors – will bind to certain DNA fragments – enhancers and silencers – to control basal transcription, or just any fundamental transcription at all. Mutations in these regions are "promoter bashing" and make transciption hard, though TATA box mutations are less impactful than you'd think; they only impact start site position.
Proteins will bind to DNA in many ways, including include a helix-turn-helix (most simple/common) and helix-loop-heli that attach to DNA, then turn around to do it againx, zinc fingers and copper fists that use metal ions, and leucine zippers that twist around themselves.
Chromatin remodeling is one other form of regulation, where acetylation of histone tails (adding an acetyl group) makes histones less compact. Methylation, however, makes DNA more densely packed and lowers expression, and it primarily happens in 5'-CG-3'-rich sites, also called CpG islands. In addition, repressors may be recruited by insulators, and mRNA can be processed to alter how much of a protein is produced. Temperature can also be a factor in when a gene is expressed, and riboswitches block transcrption by making terminator sequences, recruiting RNA cleavage factors, and more. RNAi is interference RNA, complementary to a specific mRNA so that it binds and makes it disintegrate.
This is all regulated by writer genes that introduce modifications to DNA and histones, readers that recognize the changes and recruit remodeling enzymes/transcription factors, and erasers that remove the writers' modifications.
Epigenetics creates heritable changes in gene expression, but doesn't alter the sequence itself – just how accessible the genes and their promoters are. Chromatin gets bound more tightly and condensed to hinder expression. Epigenetic changes may be triggered by the environment or programmed (such as in cell differentiation, a form of somatic epigenetics). These can have transgenerational effects, best seen in studies involving long-term famine.
Imprinted genes affect expression. Counterintuitively, a maternally-imprinted allele will be heavily methylated on the maternally chromatid so that the paternal one is expressed instead.
As mentioned earlier, hydrogen bonds are pretty weak, and yet, this is the way two strands of DNA remain attached to each other. This actually means that the strands will separate at high temperatures. This is referred to as melting or denaturing. Normally, DNA absorbs light at 260 nm, but it can absorb even more once the bases become more accessible this way (a process known as hyperchromic shift). RNA aborbs a slightly higher 280 nm. Using the hyperchromic shift, one can determine the melting temperature of DNA, which in turn can help estimate its composition. This is because G/C pairs require more energy to break, increasing the melting temperature required. A strand of DNA with a lot of guanine and cytosine will take more energy to denature than one with more adenine and thymine of the same length.
Salt has a similar effect, as it protects the negative charges of the nucleotide backbone, but decreases its polarity. It, combined with an alcoholic environment, will cause the DNA to simple come out of solution and be used for analysis. This technique can even be used to get DNA out of fruits at home, not that you can do much with it.
If you do this in a lab setting, the DNA can then be used in a polymerase chain reaction, or PCR. It's a way to may the minute amount of DNA extracted replicate in vitro to obtain a testable/detectable amount and confirm its existence. Specifically, one first obtains primers that correspond to (slightly upstream of) a region of DNA, a gene of interest. These primers are mostly GC and very short. A mixture of template DNA, primers, and dNTPs are heated so that they denature. The temperature is lowered slightly so the primers can anneal (bind) to these single strands, and then slightly warmed again so that a specialized-but-imprecise Taq polymerase that can handle high temperatures (derived from thermophile Thermus aquaticus) then extends the DNA starting at the primers. This is done between 20-30 times to produce millions of copies of the template.
You may remember PCR from the height of the COVID pandemic, as one way to test for the novel virus was to take a sample from a patient and load it with primers corresponding to a unique strand of DNA found in the coronavirus genome. If there is coronavirus protein DNA within the sample, PCR will multiply it so it can be detected. If the resulting mixture is basically the same as it was before PCR, that means there was no coronavirus DNA within the sample to multiply, so the test is negative and the patient is healthy.
Of course, for PCR, you need to know the sequence of DNA you're interested in. Fortunately, there are many ways to sequence DNA. The earliest of these is Sanger sequencing. Usually, when synthesizing DNA, a dNTP forms a phosphodiester bond with a dNTP, but if a ddNTP (with an H where an OH should be) is integrated, it will terminate the process then and there. By putting a mix of dNTPs and ddNTPs in with some template DNA and radioactive primers, one can obtain a mix of strands that are a wide variety of sizes. These are then put in wells within a polarized gel that pulls the negatively-charged DNA closer through the gel. Shorter strands will travel a greater distance because of their lower mass. If one makes four solutions, each with a different radioactive ddNTP exclusive to that solution, and runs them through the same electrophoresis gel at the same time in four different wells, one can get the precise sequence of the strand complementary to the template DNA.
Comparable is the Southern blot test, while (after gel electrophoresis) transfers the DNA fragments from the gel onto a nitrocellulose filter which is then labeled with radioactive DNA probes. While this was named initially for its inventor, Edward Southern, similar techniques include Northern blot (DNA probes for expression), Western blot (using a fluorescing antibody to signal when a protein is being produced), Southwestern blot (DNA binding to protein), Northwestern (RNA binding to protein), and Far Western (interprotein interaction), because scientists love to run jokes into the ground.
However, Sanger sequencing is inefficient. Fortunately, there exist many next generation forms of sequencing. Thousands of automated Sanger processes can be replaced nowadays with a single machine, like Illumina. Purified DNA is fragmented and loaded onto a flow cell of spaced-out nanowells. When fluorescent primers and nucleotides are washed onto the chip, the DNA glows, and a camera photographs the result. Based on the type of fluorescence, it can tell which was added to which fragment and get an idea of its sequence. Microarrays work very similarly.
Click here to return to the main page.