Machine Learning in Bioinformatics (I529/B659) -- Spring 2019

RNA-seq & clustering algorithms

RNAseq: full-transcript sequencing protocols (e.g., Smartseq2) vs tag-based protocols (e.g., 10X Chromium); bulk RNAseq vs single-cell RNAseq
Preprocessing of the sequencing data
Quantification
Reads mapping -> read counts -> normalization (check out some slides about reads mapping) (the key: a good indexing technique)
"Sample-specific reads were aligned to the mouse reference genome (GRCm38.p3; Ensembl V.80) and genomic features determined using featureCounts." (paper)
Postprocessing
"Low-quality cells were filtered resulting in normalised data from 325 cells and 34,769 genes being passed onto downstream clustering analyses. " (paper)
Feature selection.
Paper: M3Drop: Dropout-based feature selection for scRNASeq

Unsupervised ML algorithm: clustering. Clustering is to discover the inherent groupings in the data, such as grouping students by their learning behavior, grouping customers by purchasing behavior, grouping cells according to their expression profiles, identifying cancer subtypes using RNA-seq data.
See some old slides about clustering algorithms & applications in bioinformatics (key factors: what distance measure is used, and what principle is used to construct clusters)
SC3: consensus clustering of single-cell RNA-seq data (a user-friendly tool for unsupervised clustering, which achieves high accuracy and robustness by combining multiple clustering solutions through a consensus approach)
Semisoft clustering of single-cell data
Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data ("... of a dozen clustering methods")
check out all the data from the paper

"We defined six cell clusters within the LIN–HLA-DR+CD14– population using unsupervised analysis that did not rely on known markers of DCs. Briefly, we identified 595 genes exhibiting high variability across single cells, reduced the dimensionality of these data with principal components analysis (PCA), and identified five significant PCs using a previously described permutation test (6, 9). We used these PC loadings as input to t-distributed stochastic neighbor embedding (t-SNE) (10) for visualization, and clustered cells using a graph-based approach similar to one recently developed for mass cytometry data (6, 11). " (paper (Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors))
(this image is taken from here)
PCA
t-SNE