(Computer) Vision for Intelligent Robotics, Fall 2016
Course number: Info I590 / CS B659
Meets: Tuesday/Thursday 4:00-5:15pm
Location: Info 107
Instructor: Prof. Michael S. Ryoo
Email: mryoo "at" indiana.edu
Office: Informatics E259
Office hours: by appointment (send email)
***11/3 class will be replaced with the HRI seminar at Walnut Room, IMU***
In this graduate seminar course, we will review and discuss state-of-the-art computer vision methodologies while also checking their applications to robots (i.e., robot perception). Specific topics will include object recognition, activity recognition, deep learning for both images and videos, and first-person vision for wearable devices and robots. The objective of the course is to understand important problems in computer vision and intelligent robotics, discuss advantages and disadvantages of existing approaches, and identify open questions and future research directions.
Interest in computer vision; basic programming skills; ability to read and understand conference papers. This course will focus on video-based techniques and their robotics applications, which will extend topics covered in other computer vision courses including B490/B659. Any previous experience in computer vision, machine learning, and robot vision will be a plus.
Please talk to me if you are unsure if the course is a good match for your background.
Research overview and general background
1. Object recognition and Activity recognition
Image features, matching, and basic classification
Invariant local features, bag-of-visual words, spatial pyramid, ...
Elli , Zaman 
Histogram of gradients, deformable part models, graph-based segmentation, …
Ryoo , Iyer 
Action recognition from videos
Hidden Markov models (HMM), space-time volumes, local XYT features, …
Lee , Jiang 
Kathawate , Wu 
Hierarchical activity recognition
Multi-layer HMMs, stochastic CFG, logic-based methods, ...
Ryoo , Varamesh 
2. Deep learning
Deep learning for images and objects
Convolutional neural networks (CNNs), CNN-segmentation, ...
Meda , Kotak 
Deep learning for videos and events
CNNs for videos, recurrent neural networks (RNNs), ...
Shou , Maity 
More deep learning architectures/methods
Siamese neural networks, attention filters, region proposals, LSTMs, …
Tosi , Schlegel 
Kotak , Spears 
3. Visual perception for robots
First-person object, action, and activity recognition
Object detection based first-person video understanding
Ego-action recognition and video summarization
First-person interaction recognition
Wu , Tosi 
Naha , Devadiga , Boggaram 
Learning “actionable” activity representations
Robot “learning from imitation”, syntactic approaches, ...
Zhang , Meda , Schlegel 
Social cues and affordances
Detecting human gaze orientations from first-person videos
Action possibilities with objects and scene
Shou , Elli 
4. Understanding surrounding environments
3-D scene understanding
Estimating 3-D scene geometry from images
Zaman , Doosti , Varamesh 
No class - Thanksgiving
Object and activity recognition using contextual information
Iyer , Maity , Shahivand 
Final project presentations
Course requirements and grading:
Paper/experiment presentations (30%): each student is expected to provide ~2 presentations throughout the course. A student may choose to provide either (1) paper presentation or (2) experiment presentation (i.e., presenting the results obtained by testing the method's code on existing datasets) for their presentations.
Paper review and class participation (20%): the students are required to choose a paper per class and submit its short review before the class.
Final project (50%): each student will choose his/her individual research topic and do research. This can be as simple as implementing several previous methods and comparing them, and can be as serious as proposing new concepts and algorithms, implementing them, and evaluating them with public datasets to advance the state-of-the-arts.
- D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints. IJCV 2004.
- J. Sivic and A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos. ICCV 2003.
- S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. CVPR 2006.
- Y. Jia, C. Huang, and T. Darrell, Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features. CVPR 2012.
- N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection. CVPR 2005.
- P. Gehler and S. Nowozin, On Feature Combination for Multiclass Object Classification, ICCV 2009.
- P. Felzenszwalb, D. McAllester and D. Ramanan, A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR 2008.
- A. Krizhevsky, I. Sutskever, and G. Hinton, Imagenet Classification with Deep Convolutional Neural Networks. NIPS 2012.
- C. Szegedy et al., Going Deeper with Convolutions. CVPR 2015.
- R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014.
- J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation. CVPR 2015.
- J. Yamato, J. Ohya, and K. Ishii, Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model. CVPR 1992.
- N. Oliver, B. Rosario, and A. Pentland, A Bayesian Computer Vision System for modeling human interactions. T PAMI 2000.
- A. Bobick and J. Davis, The Recognition of Human Movement Using Temporal Templates. T PAMI 2001.
- M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions and Space-Time Shapes. ICCV 2005.
- I. Laptev, On Space-Time Interest Points. IJCV 2005.
- P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, Behavior Recognition via Sparse Spatio-Temporal Features. VS-PETS 2005.
- I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, Learning Realistic Human Actions from Movies. CVPR 2008.
- M. S. Ryoo and J. K. Aggarwal, Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities. ICCV 2009.
- Heng Wang, Cordelia Schmid, Action Recognition with Improved Trajectories, ICCV 2013.
- Y. Ivanov, and A. Bobick, Recognition of Visual Activities and Interactions by Stochastic Parsing. T PAMI 2000.
- J. M. Siskind, Grounding the Lexical Semantics of Verbs in Visual Perception using Force Dynamics and Event Logic. JAIR 2001.
- M. S. Ryoo and J. K. Aggarwal, Stochastic Representation and Recognition of High-level Group Activities, IJCV 2011.
- D. Tran et al., Learning Spatiotemporal Features with 3D Convolutional Networks. arXiv:1412.0767.
- A. Karpathy et al., Large-scale Video Classification with Convolutional Neural Networks. CVPR 2014.
- A. Graves, A. Mohamed, and G. Hinton, Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013.
- J. Ng et al., Beyond Short Snippets: Deep Networks for Video Classification. CVPR 2015.
- J. Donahue et al., Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. CVPR 2015.
- A. Fathi, A. Farhadi, and J. M. Rehg, Understanding Egocentric Activities. ICCV 2011.
- H. Pirsiavash and D. Ramanan, Detecting Activities of Daily Living in First-Person Camera Views. In CVPR, 2012
- Y. J. Lee, J. Ghosh, and K. Grauman, Discovering Important People and Objects for Egocentric Video Summarization. CVPR 2012.
- K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, Fast Unsupervised Ego-action Learning for First-Person Sports Videos. CVPR 2011.
- M. S. Ryoo and L. Matthies, First-Person Activity Recognition: What Are They Doing to Me? CVPR 2013.
- M. S. Ryoo et al., Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? HRI 2015.
- P. Das et al., A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching. CVPR 2013.
- K. Lee et al., A syntactic approach to robot imitation learning using probabilistic activity grammars. RAS 2013.
- Y. Yang et al., Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos from the World Wide Web. AAAI 2015.
- A. Fathi, J. Hodgins, J. Rehg., Social Interactions: A First-Person Perspective. CVPR 2012.
- A. Saxena, M. Sun and A. Y. Ng, Make3D: Learning 3D Scene Structure from a Single Still Image. T PAMI 2009.
- A. Gupta, A. Efros, and M. Hebert, Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics. ECCV 2010.
- D. Lin, S. Fidler, and R. Urtasun, Holistic Scene Understanding for 3D Object Detection with RGBD cameras. ICCV 2013.
- D. Hoiem, A.A. Efros, and M. Hebert, Putting Objects in Perspective. CVPR 2006, IJCV 2008.
- Y. J. Lee and K. Grauman, Object-Graphs for Context-Aware Category Discovery, CVPR 2010.
- A. Gupta and L. Davis, Objects in Action: An Approach for Combining Action Understanding and Object Perception. CVPR 2007.
- M. Marszalek, I. Laptev, and C. Schmid, Actions in Context, CVPR 2009.
- H. S. Koppula and A. Saxena, Physically Grounded Spatio-Temporal Object Affordances, ECCV 2014.
- J. C. Niebles, H. Wang, and L. Fei-Fei, Unsupervised learning of human action categories using spatial-temporal words, IJCV 2008.
- J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.
- K. Simonyan and A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, NIPS 2014.
- S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei, End-to-end Learning of Action Detection from Frame Glimpses in Videos, CVPR 2016.
- S. Bell and K. Bala, Learning visual similarity for product design with convolutional neural networks, SIGGRAPH 2015.
- K. Gregor, I. Danihelka, A. Graves, D. Jimenez Rezende, and D. Wierstra, DRAW: A Recurrent Neural Network For Image Generation, arXiv:1502.04623
- T. Lan, L. Sigal, G. Mori, Social Roles in Hierarchical Models for Human Activity Recognition, CVPR 2012.
- Y. Li, A. Fathi, J. Rehg. Learning to Predict Gaze in Egocentric Video, ICCV 2013.
- M. Ibrahim et al., A Hierarchical Deep Temporal Model for Group Activity Recognition, CVPR 2016.
- S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, End-to-End Training of Deep Visuomotor Policies, Journal of Machine Learning Research, 2016.
- S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection, arXiv:1603.02199