Smart/Intelligent video surveillance technology plays the central role in the emerging smart city systems. Recent advances in computer vision, such as deep learning, multi-modal analysis, large-scale spatio-temporal analysis, have shown great potential for some high level understanding tasks in smart video surveillance application. These include high performance human/object detection and tracking, cross camera human identification and re-identification, and action/activity/event detection. These novel techniques require large scale surveillance datasets to model of various visual understanding tasks as well as evaluation of algorithmic performances. However, current benchmark datasets focus on one or two tasks usually, and many existing datasets are constructed using self-sampled or internet-collected videos and images, most of which are not the real surveillance data. These simulated datasets cannot evaluate the intelligent surveillance algorithms perfectly.