(Computer) Vision for Intelligent Robotics, Fall 2015

Course number: Info I590 / CS B659

Meets: Tuesday/Thursday 4:00-5:15pm

Location: Info 107

Website: http://homes.soic.indiana.edu/classes/fall2015/csci/b659-mryoo/ 

Instructor: Prof. Michael S. Ryoo

Email: mryoo "at" indiana.edu

Office: Informatics E259

Office hours: by appointment (send email)

Course description:

In this graduate seminar course, we will review and discuss state-of-the-art computer vision methodologies while also focusing on their applications to robots (i.e., robot perception). Specific topics will include object recognition, activity recognition, deep learning for videos, and first-person vision for wearable devices and robots. The objective of the course is to understand important problems in computer vision and intelligent robotics, discuss advantages and disadvantages of existing approaches, and identify open questions and future research directions.

Prerequisites:

Interest in computer vision; basic programming skills; ability to read and understand conference papers. This course will focus on video-based techniques and their robotics applications, which will extend topics covered in other computer vision courses including B490/B659. Any previous experience in computer vision, machine learning, and robot vision will be a plus.

Please talk to me if you are unsure if the course is a good match for your background.

(tentative) Schedule:

Date

Description

Papers

Presenters

8/25
8/27

Course introduction
Research overview and general background

pdf

pdf

M. Ryoo

M. Ryoo

1. Object recognition - understanding images

9/1
9/3

Image features, matching, and basic classification

Invariant local features, bag-of-visual words, spatial pyramid, ...

1 2

pdf

[1,2]
[3,4]

J. Firoz

A. Seewald

9/8
9/10

Object detection

Histogram of gradients, deformable part models, ...

pdf

pdf

[5,6]
[7]

T. Alowaisheq

C. Zhou

9/15 9/17

Deep learning

Convolutional neural networks (CNNs), CNN-segmentation, ...

pdf

pdf

[8]

[10]

Z. Chen
T. Wanyan

2. Activity recognition - understanding videos

9/22

9/24
9/29

Videos as frame sequences vs. space-time volumes

Hidden Markov models (HMM), space-time volumes, local XYT features, ...

pdf

pdf

1 2

[12]

[16]

[15,47]

C. Sanders

C. Achgill
C. Fan, A. Seewald

10/1

10/6

Deep learning for videos

CNNs for videos, recurrent neural networks (RNNs), ...

pdf

pdf

[24]
[27]

A. Piergiovanni

D. Dhami

10/8 10/13

Hierarchical activity recognition

Multi-layer HMMs, stochastic CFG, logic-based methods, ...

pdf

pdf

[21]

[19]

S. Gupta

E. Hassan

3. Visual perception for robots

10/15
10/20

10/22
10/27

First-person object, action, and activity recognition

Object detection based first-person video understanding

Ego-action recognition and video summarization

First-person interaction recognition

pdf

pdf

pdf

pdf

[29]
[30]

[31,32]
[33]

T. Alowaisheq

Z. Chen

C. Achgill, S. Gupta

M. Ryoo

10/29
11/3

Learning “actionable” activity representations

Robot behavior “learning from imitation”, syntactic approaches, ...

pdf

1 2

[35]

[36,37]

D. Dhami

J. Lee, D. Sashikanth

11/5

Social cues
Detecting human gaze orientations from first-person videos

pdf

[38]

T. Wanyan

5. Understanding surrounding environments

11/10

11/12

3-D scene understanding

Estimating 3-D scene geometry from images

pdf

pdf

[39]

[40]

E. Hassan

C. Zhou

11/17
11/19

Context
Object and activity recognition using contextual information

1 2

pdf

[42,43]

[44,45]

A. Seewald, M. singh

A. Piergiovanni

11/19

Affordance

Action possibilities with objects and scene

pdf

[46]

D. Sashikanth

11/24
11/26

No class - Thanksgiving

12/1
12/3

No class - final project preparation; individual project discussions

Final discussion and Quiz - solution

12/8
12/10

Final project presentations

Course requirements and grading:

Paper/experiment presentations (30%): each student is expected to provide 2~3 presentations throughout the course. A student may choose to provide either (1) paper presentation or (2) experiment presentation (i.e., presenting the results obtained by testing the method's code on existing datasets) for their presentations.

Paper review and class participation (20%): the students are required to choose a paper per class and submit its short review before the class.

Final project (50%): each student will choose his/her individual research topic and conduct research on it. This can be as simple as implementing several previous methods and comparing them, and can be as serious as proposing new concepts and algorithms, implementing them, and evaluating them with public datasets to advance the state-of-the-arts.

References

  1. D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004.
  2. J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003.
  3. S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.
  4. Y. Jia, C. Huang, and T. Darrell, Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features. CVPR 2012.
  5. N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection. CVPR 2005.
  6. P. Gehler and S. Nowozin, On Feature Combination for Multiclass Object Classification, ICCV 2009.
  7. P. Felzenszwalb,  D.  McAllester and D. Ramanan, A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008.
  8. A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet Classification with Deep Convolutional Neural Networks. NIPS 2012.
  9. C. Szegedy et al., Going Deeper with Convolutions. CVPR 2015.
  10. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014.
  11. J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.
  12. J. Yamato, J. Ohya, and K. Ishii, Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model. CVPR 1992.
  13. N. Oliver, B. Rosario, and A. Pentland, A Bayesian Computer Vision System for modeling human interactions. T PAMI 2000.
  14. A. Bobick and J. Davis, The Recognition of Human Movement Using Temporal Templates. T PAMI 2001.
  15. M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions and Space-Time Shapes. ICCV 2005.
  16. I. Laptev, On Space-Time Interest Points. IJCV 2005.
  17. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features. VS-PETS 2005.
  18. I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, Learning Realistic Human Actions from Movies. CVPR 2008.
  19. M. S. Ryoo and J. K. Aggarwal, Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities. ICCV 2009.
  20. Heng Wang, Cordelia Schmid, Action Recognition with Improved Trajectories, ICCV 2013.
  21. Y. Ivanov, and A. Bobick, Recognition of Visual Activities and Interactions by Stochastic Parsing. T PAMI 2000.
  22. J. M. Siskind, Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic. JAIR 2001.
  23. M. S. Ryoo and J. K. Aggarwal, Stochastic Representation and Recognition of High-level Group Activities, IJCV 2011.
  24. D. Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767.
  25. A. Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks. CVPR 2014.
  26. A. Graves, A. Mohamed, and G. Hinton, Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013.
  27. J. Ng et al., Beyond Short Snippets: Deep Networks for Video Classification. CVPR 2015.
  28. J. Donahue et al., Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR 2015.
  29. A. Fathi, A. Farhadi, and J. M. Rehg, Understanding Egocentric Activities. ICCV 2011.
  30. H. Pirsiavash and D. Ramanan, Detecting Activities of Daily Living in First-Person Camera Views. In CVPR, 2012
  31. Y. J. Lee, J. Ghosh, and K. Grauman, Discovering Important People and Objects for Egocentric Video Summarization. CVPR 2012.
  32. K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, Fast Unsupervised Ego-action Learning for First-Person Sports Videos. CVPR 2011.
  33. M. S. Ryoo and L. Matthies, First-Person Activity Recognition: What Are They Doing to Me? CVPR 2013.
  34. M. S. Ryoo et al., Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? HRI 2015.
  35. P. Das et al., A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. CVPR 2013.
  36. K. Lee et al., A syntactic approach to robot imitation learning using probabilistic activity grammars. RAS 2013.
  37. Y. Yang et al., Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos from the World Wide Web. AAAI 2015.
  38. A. Fathi, J. Hodgins, J. Rehg., Social Interactions: A First-Person Perspective. CVPR 2012.
  39. A. Saxena, M. Sun and A. Y. Ng, Make3D: Learning 3D Scene Structure from a Single Still Image. T PAMI 2009.
  40. A. Gupta, A. Efros, and M. Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. ECCV 2010.
  41. D. Lin, S. Fidler, and R. Urtasun, Holistic Scene Understanding for 3D Object Detection with RGBD cameras. ICCV 2013.
  42. D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects in Perspective. CVPR 2006, IJCV 2008.
  43. Y. J. Lee and K. Grauman, Object-Graphs for Context-Aware Category Discovery, CVPR 2010.
  44. A. Gupta and L. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception. CVPR 2007.
  45. M. Marszalek, I. Laptev, and C. Schmid, Actions in Context, CVPR 2009.
  46. H. S. Koppula and A. Saxena, Physically Grounded Spatio-Temporal Object Affordances. ECCV 2014.
  47. J. C. Niebles, H. Wang, and L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, IJCV 2008.