# Summary Class notes - Algorithms in Sequence Analysis

##### Course
- Algorithms in Sequence Analysis
- 2020 - 2021
305 Flashcards & Notes
1 Students

# Remember faster, study better. Scientifically proven.

• ## 1603666800 1 Intro + Pairwise alignment

• What does sequence analysis exploit?
Patterns arising during evolution.
• Specifically: during divergent evolution, starting at a common origin, leading to discernible similarity (conservation patterns) in many cases.
• * Patterns are not uniformely tractable, since different selective pressures lead to very different degrees of conservation of features, due to functional constraints
• Name 4 requirements for evolution.
• Template structure providing stability (DNA)
• Copying mechanism (meiosis)
• "errors": mechanisms providing variation (mutations, insertions, deletions, crossing over, etc)
• Selection: some traits lead to greater fitness of one individual relative to another
• Evolution is a conservative process: the vast majority of mutations will not be selected. In other words: will not make it, as they lead to worse performance or are lethal. This is called negative or purifying selection.
• Which 3 components of the human exposome did Wild describe?
• A general external environment including urban environment, climate factors, education, social capital and stress.
• A specific external environment with specific contaminants, radiation, infection, lifestyle factors, diet, physical activity etc.
• An internal environment including internal biological factors such as metabolic factors, hormones, gut microbiome, inflammation, oxidative stress.
• Homolgy means common ancestry. So homologous genes have a common ancestor. There are two forms of homology. Explain.
1. Orthology: orthologous genes are homologous genes in different species (genomes) relating to the speciation event.
2. Paralogy: Paralogous genes are homologous genes (repeats) within the same species (genome)
• What is meant by Burst after Duplication? (BAD)
After a gene duplication, the selection pressure is low on this region. Therefore, mutations and genetic drift can increase.

A related phenomenon is adaptive radiation: a process in which organisms diversify rapidly into a multitude of new forms, particulary when a change in the environment makes new resources available, creates new challenges and opens environmental niches.
• What is horizontal gene transfer? (xenology)
• Also called lateral gene transfer
• Occurs when organisms incorporates genetic material from other organism without being the offspring of that organism.
• (receiving material from ancestor = vertical transfer)
• Most thinking in genetics has focussed on vertical transfer but increasing awareness of horizontal transfer.
• Artificial horizontal gene transfer is form of genetic engineering.
• What are transitions and transversions?
Mutation type.

We have purines and pyrimidines:
• A & G = purine
• C & T/U = pyrimidine

Transition:
• purine to purine
• pyrimidine to purine
Transversion:
• purine to pyrimidine
• pyrimidine to purine
• What is a synonymous mutation? And a non-synonymous one?
• Synonymous: mutation that does not lead to an amino acid change
• Non-synonymous mutation: mutation that does lead to an amino acid change
• missense mutation: one aa replaced by other aa
• nonsense mutation: aa replaced by stopcodon
• Many proteins consist of repeats. Sometimes to gain function, sometimes leading to disease (eg single residue repeats). Name some features.
• Evolution reuses developed material.
• Multiple stochiometric and spatially close combined structure function relationships
• In proteins, repeats vary from a single aa to complete domains
• Many types of (near) identical repeats exist in genomes. Human genome > 50%.
• eg DNA transposons
• What is a transposon?
Also: "springend gen". One or multiple genes, with small sequences (Insertion sequences)(inverted in respect to one another) on both sides and a gene for transposition. Insertion sequences contain no information.

Transposase enzyme binds to both IS, and to the location in DNA where it should go. Cuts DNA. Sticky ends are completed again. Some transposons require specific site, others can go everywhere.
• What is meant by the structure/function gap?
There are far more sequences than solved tertiary structures and functional annotations. This gap is growing, so there is a need to predict structure and function.
• If we want to find the function of a newly sequenced gene with a 'lazy approach' (only bioinformatics, no biological experiments), how would we do this?
• Find a set of protein sequences similar to the unknown sequence.
• Identify similarities and differences.
• For long protein sequences: first identify domains and then use corresponding subsequences.
• Note: homology is a binary property: yes or no. Boolean term. Similarity can occur in degrees, or probability of homolgy. These are scalars.
• Name 3 things we look at for reconstructing evolutionary and functional relationships.
• Based on sequence
• identity (simplest)
• similarity
• Homology (ultimate goal)
• Other information such as 3D structure
• What did a study on 3D structure and protein evolution show?
The distance from the active site determines the rate of evolution. Close: slow evolution, Far: fast evolution
• What is a frame shift mutation?
An insertion or deletion leading to a different reading frame, shifting all codons. Often results in shortened protein. Often nonfunctional.
• What is an inversion?
In an inverison mutation, an entire section of DNA is reversed. From a few bases to a large region of a chromosome involving multiple genes.
• What is a DNA expression mutation?
A mutation that does not change the protein itself but it's expression, eg where a protein is made and how much of a protein is made. Can lead to proteins being made at the wrong time or in the wrong cell type. Or under/overproduction.
• What can you entounter when reconstructing "evolution" with sequences?
See slide.
• Structure is more conserved than sequence.
• Name conditions for aligning  sequences.
• Sequences should be related trough divergent evolution
• so they should be homologous
• and preferablly orthologous:
• paralogous sequences can become too distant for correct alignment (think of BAD)
• Analogous sequences should not be aligned!
• Sometimes a short functional motif can be detected.
• Note that the sequence of the common ancestor is not available in most cases.
• What should an alignment scoring method do? How is alignment score defined.
• Produce reasonable alignments
• Must assign scores to:
• substitutions (match/mismatch)
• DNA
• Proteins
• Gap penalties
• linear
• affine
• concave

Alignment score is defined as the summed score of all alignment columns.
• Explain the concept of combinatorial explosion and the solution we use.
• 1 gap in 1 seq: n+1 possibilities for alignment
• 2 gaps in 1 seq: (n + 1)n
• 3 gaps in 1 seq: (n + 1)n(n - 1)
• *check formula later

explodes!

Solution = dynamic programming:
• breaks up alignment problem in smaller subproblems, solve them iteratively.
• Alignment is simulated as a Markov process. All sequence positions are seen as independent and identically distributed.
• Chanches of sequence events are independent
• Therefore probabilities per aligned position are multiplied
• AA matrices contain log odds --> sum
• Name 2 alternative alignment methods (so not global, semiglobal local).
• De Novo sequencing
• tracks overlap between millions of short seq reads coming from seq experiment. N is number of reads --> N ^2 overlap matches required.
• Reference based sequencing
• aligns short reads against reference genome

These algorithms are not based on evolutionary considerations per se, but match (near)identical fragments
This summary. +380.000 other summaries. A unique study tool. A rehearsal system for this summary. Studycoaching with videos.

What is the problem with very distant homologs?
Will probably fall into twilight zone
Why are very short sequences not suitable for database searching?
Hits are selected based on cut-off score. Very short sequences will not yield score that is high enough, even if they are very similar
How can redundancy be a problem for homology searching?
Many many sequences in database for a certain protein --> redundant hits obscure more distant hits (if you are looking for those)
Why can multi-domain proteins be tricky in searching for homolgy?
Imagine the protein consists of domain A and B, and B is common --> many matches made with B. Even though A might be the interesting one.

In iterative searching: image you detected many BC proteins in the first search  --> search may drift to CD
What is the idea of iterative database searching?
• Homologous proteins are classified into families
• If a homologous relationship can only be inferred using structure and/or function (because sequence similarity is too low), they are classified into so-called super-families
• In real homology searches, many homologous relationships are missed due to the fact that sequence similarity is lowered by divergent evolution beyond recognition.
• The idea of iterative searching is building a more information-rich description of the query sequence using the hits from an earlier iteration, so the search becomes more sensitive
• This can be repeated until the search converges; i.e. no more hit sequences are found
In which three phases can you split databank searching?
1. Matching phase: query sequence is compared by (partial) aligment with databank sequence. Most programs pre-select hits . Comparison is usually heuristic and fast
2. Scoring phase: if matching phase is passed: pair is realigned and scored
3. selection phase: based on statistical criterion: sequences with a score above a user defined threshold are returned as hits.
How to assess homology searchmethods?
• We need an annotated database, so we know which sequences belong to what homologous (super)families
• Examples of databases of homologous families are PFAM, Homstrad or Astral
• The idea is to take a protein sequence from a given homologous family, then run the search method, and then assess how well the method has carried out the search (i.e. recognised the family members)
• This should be repeated for many query sequences and then the overall performance can be measured
Why do we formalise ( = represent as string) genome information?
• The cellular machinery is exceedingly complex
• The transformation of genomic information in the cell to text sequences (character strings) is a reduction in complexity
• This formalisation makes genomic information accessible and tractable
What do errors mean for homology searching?
• Database searching algorithms just need to decide if the alignment score is good enough for inferring homology
• Sometimes, alignments can be incorrect but the score can be close enough for the database searching method to correctly identify the DB sequence as a homolog (or not)
• However, for more distant hits alignment becomes crucial as alignment scores are becoming more different relatively
What do errors mean for alignment?
• Alignments need to be able to match distantly related sequences, skip secondary structural elements to complete domains (i.e. putting gaps opposite these motifs in the shorter sequence).
• Depending on the residue exchange matrix and gap penalties chosen, the algorithm might have difficulty with aligning distant homologs or inserting long gaps (for example when using high affine gap penalties), resulting in incorrect alignment.