Efficient Algorithms for Querying Noisy Distributed/Streaming Datasets


PI : Qin Zhang
NSF CCF-1525024 (June 2015 - May 2018)


Abstract

This project aims to study the design of efficient query algorithms for noisy datasets in distributed and streaming applications. Noisy data is universal in today's world. Imprecise and varying references to the same real-world entities are ubiquitous in scientific and commercial databases. This noise poses significant obstructions to accurate data analytics. As an example of "noisy data," consider YouTube videos. YouTube tracks the views of individual videos. However, there are frequently many similar versions of the same event and answering a basic question such as "How many people viewed this event?" is challenging using current techniques. This project will provide new techniques and insights to combat the noisy nature of large datasets, and hence will enhance our ability to process the ever-increasing quantity of business and scientific data. The products of this project will be integrated into a trilogy of graduate and undergraduate courses on algorithms, databases, and data mining. The PI will disseminate research outcomes by giving talks at conferences/workshops, universities, industrial labs, as well as online media.

More technically, this project tries to answer the following question: can we run distributed and streaming algorithms directly on the noisy datasets, resolve the noise "on the fly", and retain communication and space efficiency compared with the noise-free setting? The PI plans to study statistical, relational and graph problems. This project has the potential to impact a wide range of active research areas in theoretical computer science, including distributed and streaming algorithms, group testing, compressed sensing, communication complexity, clustering, and locality sensitive hashing.


Publications

  1. Bias-Aware Sketches
    with J. Chen
    Proc. International Conference on Very Large Databases (VLDB 17), to appear. Munich, Germany, August-September 2017.

  2. Communication-Optimal Distributed Clustering
    with J. Chen, H. Sun and D. Woodruff
    Proc. Annual Conference on Neural Information Processing Systems (NIPS 16). Barcelona, Spain, December 2016.

  3. Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
    with D. Belazzougui
    Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16). New Brunswick, NJ, U.S.A., October, 2016.

  4. Streaming Algorithms for Robust Distinct Elements
    with D. Chen
    Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 16). San Francisco, CA, U.S.A., June 2016.

  5. On Sketching Quadratic Forms (preliminary full version, 47 pages)
    with A. Andoni, J. Chen, R. Krauthgamer, B. Qin and D. P. Woodruff
    Proc. Innovations in Theoretical Computer Science (ITCS 16). Cambridge, MA, USA, January 2016.

  6. Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy
    with J. M. Phillips, E. Verbin
    SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016 [Link].
  7. Communication-Efficient Computation on Distributed Noisy Datasets [conf. talk]
    Q. Zhang
    Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 15), pages 313-322. Portland, Oregon, U.S.A., June 2015.

Invited/Conference Talks

  • Efficient Algorithms for Streaming Datasets with Near-Duplicates
    At Theory and Applications of Hashing , Dagstuhl Seminar, Germany. May 2017.

  • Some New Questions in Communication Complexity
    At Communication Complexity and Applications, II , Banff, Alberta, Canada. March 2017.

  • The Communication Complexity of Distributed Set-Joins
    At Computer Science Colloquium, University of Houston, Houston, TX, USA. November 2016.

  • Edit Distance: Sketching, Streaming and Document Exchange
    At Aarhus University , Aarhus, Denmark. September 2016.
    And at FOCS 16, New Brunswick, NJ, U.S.A., October, 2016.

  • Lower Bound Techniques for Multiparty Communication Complexity
    At Nexus of Information and Computation Theories , Paris, France. February 2016.

  • Streaming Algorithms for Robust Distinct Elements
    At Workshop on Multi-dimensional Proximity Problems, University of Maryland, College Park, MD, USA. January 2016.

  • Communication-Efficient Computation on Distributed Noisy Datasets.
    At SPAA 15, Portland, OR, USA. June, 2015
    And at Sublinear AlgorithmsWorkshop, Johns Hopkins University, Baltimore, MD, USA. January 2016.

Educational and Other Development

A Trilogy of Courses: Code libraries for streaming algorithms:

Personnel