Messages

  • Aug. 8, 2013: Webpage created
  • Sept. 9, 2013: Project instruction posted
  • Sept. 11, 2013: Project proposal due on Sept. 23, 2013, 5pm EST (You are allowed to modify/change the project proposal after that, but shound be before Oct. 7, 2013).
  • Sept. 23, 2013: Data Collection Report (if you choose direction 2) due on Oct. 14, 2013, 5pm EST
  • Sept. 23, 2013: Homework 1 is out. Due on Oct. 21, 2013, 5pm EST
  • Oct. 3, 2013: Intermediate Report due on Oct. 28, 2013, 5pm EST
  • Oct. 3, 2013: Homework 2 is out. Due on Nov. 18, 2013, 5pm EST
  • Oct. 3, 2013: Final Report due on Dec. 2, 2013, 5pm EST
  • Oct. 3, 2013: Project presentations on Dec. 4, Dec. 9, Dec. 11, 2013, in class

Course summary

In this course we will talk about sublinear algorithms, which has its roots in the study of Big Data that occur more and more frequently in various applications, e.g., analyses of financial transactions, internet traffic, social networks, genome sequences, etc. Concretely, we will talk about:
1. Sublinear space algorithms. In particular, data stream algorithms, namely, algorithms that solve a problem by making one pass over the data set while using small memory. These algorithms are important in many application areas such as databases and networking, where data arrives at a high speed and there is no time and/or need to store it for offline processing.
2. Sublinear time algorithms, that is, algorithms that do not even read the whole input when outputting the answers.
3. Sublinear communication algorithms. The data is stored in multiple machines, who want to jointly compute functions defined on the union of the data sets via communication.
4. Random topics: Topics VOTED by students.
Participants are expected to have a basic background in algorithm design and probability.
The evaluation will be based on a set of theoretical exercises and individual project/presentation. The list of questions will be handed out in the middle of the course.
Detailed list of topics is available in the course plan below.

Lecturer

Qin Zhang
Email: qzhangcs@indiana.edu
Office hours: By email appointment

Time and place

9:30AM-10:45AM Mon, Wed.
Informatics East, Room 122.

Textbooks

Course schedule

(subject to adjustments as we go along).

 Week   Date   Section   Content   Literature   Slides   Comments 
  1   Aug. 26     0. Introduction   New models for Big Data   slides
  1   Aug. 28     Interesting problems     slides   Basic probability tools 
  2   Sep. 2   Labor day. No class
  2   Sep. 4     Basic probabilistic tools   slides   Read Chapter 3, 4 of [MU]
  3   Sep. 9   1. Sublinear in space     Distinct elements, FM sketch     [FM]   slides   Read Section 2 of this notes
  3   Sep. 11     Improvement on FM sketch   [BYJKST]     same   Read Section 3 of this notes
  4   Sep. 16   Heavy hitters, Space-Saving     [MAA]   same   Guest lecture by Prof. Ergun
  4   Sep. 18   Count-Min   [CM]   same   Guest lecture by Prof. Ergun
  5   Sep. 23   Linear sketch     same   Read this note
  5   Sep. 25   Count-sketch   [CCF]   same   Read Section 4 of this notes
  6   Sep. 30   Alternative for L2 point query   [GKMS]   same   Read this note
  6   Oct. 2   2. Sublinear in comm.   Connectivity   [AGM-1]   slides   Read [AGM-1] for details
  7   Oct. 7     L0 sampling   [JST]   same   Read [JST] for details
  7   Oct. 9     Min-cut, Bipartiteness, MST     [AGM-1], [AGM-2]     same   Read [AGM-1], [AGM-2] for details  
  8   Oct. 14     Sparsification   [AGM-2]   same   See this note
  8   Oct. 16   3. Sublinear in time   Average degree   [Fei]   slides   Read Section 3.1 in [CS]
  9   Oct. 21     Average degree (cont.)   [GR]   same   See this hand-written note
  by Ronitt Rubinfeld
  9   Oct. 23     Minimum spanning tree   [CRT]   same   See this note
  10   Oct. 28   4. Random topics
  VOTED by students
  Compressive sensing, intro     slides  
  10   Oct. 30     Matching pursuit,
  L1/L1 recovery
    same   Read this note
  11   Nov. 4     RIP, L2/L1 recovery     same   Read this note
  11   Nov. 6     Subspace embedding,
  Lp regression
  [WZ]   slides  
  12   Nov. 11     Sampling in distributed monitoring   [CMYZ]   slides  
  12   Nov. 13     Lower bounds     slides  
  13   Nov. 18     Lower bounds (cont.)     same  
  13   Nov. 20     AMS sampling and
     entropy in data streams
  [CCM]   slides  
  14   Nov. 25           Thanksgiving Break. No classes
  14   Nov. 27           Thanksgiving Break. No classes
  15   Dec. 2   5. Homework discussion         
  15   Dec. 4   6. Student presentations    Chen: [Space] Triangle-counting  
  Dawoud: [Communication]
        Distributed monitoring entropy 
  Firoz: [Space] Geometric problems  
     
  16   Dec. 9     Jazayeri: [Time] Average degree  
  Kanakamedala:
        [Space] Frequent items  
  Narayanan: [Time] Matching  
     
  16   Dec. 11     Taduri: [Space] Frequent items  
  Undrakonda: [Space]
       Hierarchical heavy hitters  
     

Literature

(will add more as we go along)

Grading

  • Assignments 42%

    There will be 2 homework assignments, each with about 3-5 questions. Assignments will be posted in the middle of the course. The answers should be typeset in LaTeX; here is a template to start with.

  • Projects 58%

    The project consists of three components: 1. Write a proposal. 2. Make a presentation. 3. Write a report. The proposal and report should be typeset in LaTeX.
    The specifics of the project will be very flexible. During the course many problems will be introduced in various computational models for Big Data. A few of them will be discussed in detail, and the rest will only be mentioned briefly. For the project, you can for example:
    1. Pick a problem that is only briefly mentioned in the class and make a survey of its state-of-art results.
    2. Pick some algorithms that are mentioned in the class, implement them and compare with other algorithms that you can think of. (Some datasets that you can use will be posted soon)
    3. Propose new algorithms for problems in models that are discussed in the course. You can either analyze them theoretically (that is, prove some bounds on space/time/communication), or implement them and compare with existing algorithms.
    The grade of the projects will depend on how difficult the task is (e.g., proposing good new algorithms will generally be more difficult than understanding/implementing existing ones), and how well it is done.
    See here for some detailed instructions.

Course policies

For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and write-ups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.

Prerequisites

    One is expected to know basics on algorithm design and analysis as well as probability. E.g., have taken B403 ``Introduction to Algorithm Design and Analysis" or equivalent courses.