Three-Dimensional Context from Linear Perspective for Video Surveillance Systems

Status: Available/Completed
Type: USRA/4080/408*…
Professor: Elder, James
Research Area:
Year: 2015

To provide visual surveillance over a large environment, many surveillance cameras are typically deployed at widely dispersed locations. Making sense of activities within the monitored space requires security personnel to map multiple events observed on two-dimensional security monitors to the three-dimensional scene under surveillance. The cognitive load entailed rises quickly as the number of cameras, complexity of the scene and amount of traffic increases.

This problem can be addressed by automatically pre-mapping two-dimensional surveillance video data into three-dimensional coordinates. Rendering the data directly in three dimensions can potentially lighten the cognitive load of security personnel and make human activities more immediately interpretable.

Mapping surveillance video to three-dimensional coordinates requires construction of a virtual model of the three-dimensional scene. Such a model could be obtained by survey (e.g., using LIDAR), but the cost and time required for each site would severely limit deployment. Wide-baseline uncalibrated stereo methods are developing and have potential utility, but require careful sensor placement, and the difficulty of the correspondence problem limits reliability.

This project will investigate a monocular method for inferring three-dimensional context for video surveillance. The method will make use of the fact that most urban scenes obey the so-called .Manhattan-world. assumption, viz., a large proportion of the major surfaces in the scene are rectangles aligned with a three-dimensional Cartesian grid (Coughlan & Yuille, 2003). This regularity provides strong linear perspective cues that can potentially be used to automatically infer three-dimensional models of the major surfaces in the scene (up to a scale factor). These models can then be used to construct a virtual environment in which to render models of human activities in the scene.

Although the Manhattan world assumption provides powerful constraints, there are many technical challenges that must be overcome before a working prototype can be demonstrated. The prototype requires six stages of processing: 1) The major lines in each video frame are detected. 2) These lines are grouped into quadrilaterals projecting from the major surface rectangles of the scene. 3) The geometry of linear perspective and the Manhattan world constraint are exploited to estimate the three-dimensional attitude of the rectangles from which these quadrilaterals project. 4) Trihedral junctions are used to infer three-dimensional surface contact and ordinal depth relationships between these surfaces. 5) The estimated surfaces are rendered in three-dimensions. 6) Human activities are tracked and rendered within this virtual three-dimensional world.

The student will work closely with graduate students and postdoctoral fellows at York University, as well as researchers at other institutions involved in the project. The student will develop skills in using MATLAB, a very useful mathematical programming environment, and develop an understanding of basic topics in image processing and vision.

For more information on the laboratory: www.elderlab.yorku.ca

Requirement: Good facility with applied mathematics.