Stochastics at Scale

Course Description

We will study a collection of interesting stochastic algorithms and models, that serve to illustrate the following idea:

Size is a critical consideration in the design of useful stochastic models and algorithms.

A formal way to express this is through the notion of scaling - given a model/algorithm for some system, how does it behave when we grow some aspect of the system. This course will try to build intuition behind the importance of scaling by presenting examples where understanding scaling is crucial for good system design.

Course Information

References

There is no required textbook for the course. Different topics will be covered from different sources, and notes and links to the relevant material will be periodically posted here. In particular, the course will cover selected topics from the following textbooks:

  • There are two good textbooks for the first unit of the course:
    1. Randomized Algorithms by Rajeev Motwani and Prabhakar Raghavan
    2. Probability and Computing by Michael Mitzenmacher and Eli UpfalThe Cornell library (see links above) has access to online versions of both books.
  • A good introduction for the material in the second unit is Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman and Jeff Ullman. An online version is freely available on the book’s webpage.
  • A good introduction for some of the material in the third unit is Networks, Crowds and Markets (Sections V, VI) by David Easley and Jon Kleinberg. For a more technical treatment, see _Epidemics and Rumours in Complex Networks _by Moez Draief and Laurent Massoulie. Online versions are again available.

Prerequisites

Knowledge of basic probability (at the level of ORIE 3500): in particular, random variables, conditional probability and expectation, common probability distributions and their properties (binomial, geometric, exponential, Poisson, Gaussian). The latter part of the course (after the prelim) will require knowledge of stochastic processes, in particular, Markov chains (at the level of ORIE 3510). There will be a recitation session covering the essentials, and students may be able to manage without the required background. Prior exposure to algorithms and graph theory will also be useful, but not essential. Send a mail to the instructor if you are concerned about having the appropriate prerequisites.

Lectures

  • Unit 1: Basic probability and randomized algorithms (Notes)

    • Lecture 1: Introduction, linearity of expectation, indicator random variables, coupon collector
    • Lecture 2: Birthday paradox, law of total probability and deferred decisions, verifying matrix multiplication (Freivald’s algorithm)
    • Recitation 1: More applications of linearity of expectation - Fixed points of random permutations (derangements), triangles in random graphs, k-length runs in DNA sequences
    • Lecture 3: Quicksort, introduction to graph cuts
    • Lecture 4: Simple randomized Max-cut, Min-cut (the CONTRACT algorithm)
    • Lecture 5: Analysis of CONTRACT, the Karger-Stein algorithm
  • Unit 2: Concentration inequalities

    • Lecture 6: Markov’s and Chebyshev’s inequalities, union bound, concentration for coupon collector
    • Lecture 7: Randomized selection
    • Lecture 8: The Chernoff bound, basic “sampling theorem”
    • Recitation 2: Completed analysis of randomized sampling, concentration for Quicksort
    • Lecture 9: Chernoff bound for Rademacher variables (random walk on a line), randomized set balancing
    • Lecture 10: Maximum loaded bin
  • Unit 3: Randomized algorithms for big data

    • Lecture 11: Hashing - introduction, chaining, fingerprinting, the Bloom filter
    • Lecture 12: Hashing (contd) - Bloom filter analysis, perfect hashing, the FKS hashing scheme
    • Lecture 13: Consistent Hashing, intro to distributed hash tables
      -Bloom filter
      -FKS scheme
      -Consistent hashing
    • Lecture 14: Locality Sensitive Hashing - introduction and similarity metrics, LSH for Hamming distance
    • Lecture 15: Locality Sensitive Hashing - Jaccard similarity, MinHash, probability amplification
    • Lecture 16: Locality Sensitive Hashing for L2 distance, the Johnson-Lindenstrauss Lemma
      -The chapter on similar items in the MMDS textbook has a good introduction to the topic - Section 3.6 onwards covers LSH. You can also see the slides from Jure Leskovec’s course - part 1, part 2.
      -Wikipedia articles: LSH; MinHash; JL Lemma.
      -For the most recent information on LSH, see Alex Andoni’s website on LSH and its applications.
    • Lecture 17: Streaming - Reservoir sampling, Morris’ approximate counting
    • Lecture 18: Streaming - Frequency estimation, heavy-hitters (the Count-Min sketch)
    • Lecture 19: Streaming - Moment estimation (the AMS sketch), distinct elements (Flajolet-Martin sketch)
      -The chapter on streaming in the MMDS textbook, and associated slides - part 1, part 2. These are a great introduction, although there is some mismatch between the topics covered.
      -Wikipedia articles: Streaming algorithms; Reservoir sampling; Morris counter; Count-Min sketch.
      -Website on the Count-Min sketch and its applications.
    • Lecture 20: F2 estimation via linear sketches. Introduced graph sketches
    • Lecture 21: All-pairs distance sketch (from the Das Sarma et al. paper)
  • Prelim exam (Solutions)

  • Unit 4: Large-scale stochastic models and phase transitions

    • Lecture 22: The Galton-Watson branching process
    • Lecture 23: Phase transition in the GW branching process
    • Lecture 24: DFS in the GW branching process. The giant component in the G(n,p) random graph
    • Lecture 25: Revision of discrete-time Markov Chains. The Gambler’s Ruin problem
    • Lecture 26: Revision of the Poisson process and continuous-time Markov Chains
    • Lecture 27: Rumor spreading in a clique (the SI epidemic model)
    • Lecture 28: The SIS epidemic: introduction
    • Lecture 29: The SIS epidemic: phase-transition on the clique
    • Lecture 30: The SIS epidemic: phase-transition on general graphs (via coupling and graph conductance)
    • Lecture 31-32: CTMC transition-rate matrix and Kolmogorov equations
    • Lecture 33: Diff-eqn approximations to MCs (Phase transition in the G(n,N) graph)
    • Lecture 34: Diff-eqn approximations to MCs (SIS epidemic threshold and network spectral radius)
    • Lecture 35: The power of two choices
    • Lecture 36: Oblivious routing on hypercubes - the Valiant scheme
    • Lecture 37: Review

Homeworks

Siddhartha Banerjee
Siddhartha Banerjee
Associate Professor

Sid Banerjee is an associate professor in the School of Operations Research at Cornell, working on topics at the intersection of data-driven decision-making and stochastic control, economics and computation, and large-scale network algorithms.