SHAD: Surveillance Human Action Dataset


As a subset of BEST2016 dataset, the SHAD dataset was collected for the human action research works using surveillance videos. About 300 video clips, which are all taken from on-using surveillance cameras, are included in SHAD. These clips are recorded by more than 20 cameras distributed around SJTU-SEIEE buildings, and more than 40 people participated in these actions.

There are 6 types of actions are included in SHAD: walking, sitting, bending, squatting, falling, and bicycling, and there are about 50 video clips for each category. Each action is performed by different people and in cluttered backgrounds, which guarantees variations of performed action, viewpoints, illuminations and so on.In some video clips, there are more than one people performing actions (e.g., two people are walking and one people is sitting), which makes it a complicated multi-action recognition task.


Fig. 1: Examples of six types of human actions in SHAD



To give benefits for dataset users, we also provide some annotations for our dataset.  For one video clip, frame-level annotations are included. In the annotated frames, the bounding box corresponding to the action performer is annotated and thus the location layout of the people is available, which can serve as the ground truth of the spatial information. Each people in one annotated frame is assigned with a unique human ID. The human ID for the same people throughout the whole clip will be kept the same which make it possible to keep the trajectory of each people.  To better analyze whether one action is recognized or not, we give clear definitions of both the start and the end frame number of corresponding action in the very clip, which specifies the ground truth of temporal information. There are 87087 frames annotated in SHAD.

The groundtruth images contain 5 labels namely

  • Frame number:  The  "framenum" item in each annotated file, is the frame number in one video clip.
  • Size: The resolution (width and height) of one image frame.
  • Objectlist : The number of the objects in one annotated frame.
  • Human ID: Each people in one annotated frame is assigned with a unique human ID in one video clip.
  • Bounding box:  For each pedestrian,  we label it with bounding boxes in the form of (xmin, ymin, xmax, ymax).



Each training set has about 30 action video clips, and is size of ~800MB. Each testing set  is ~500MB size, and contains about 20 action clips. The annotation files are included in each set. 



Bend-train Fall-train Squa-train Bicycle-train Sit-train Walk-train
Bend-test  Fall-test  Squa-test  Bicycle-test  Sit-test  Walk-test 



Usage Policy


  • No commercial reproduction, distribution, display or performance rights in this work are provided.
  • If you use this facility in any publication, we request you to kindly acknowledge this website ( and cite the following paper:
  • Chongyang Zhang, Bingbing Ni, Li Song, Guangtao Zhai, Xiaokang Yang, and Wenjun Zhang, BEST: Benchmark and Evaluation of Surveillance Task, in the 13th Asian Conference on Computer Vision Workshop on Benchmark and Evaluation of Surveillance Task (BEST2016), Taipei, Taiwan ROC, November 20-24, 2016. (pdf)

2016 © SJTU-BEST 沪交ICP备20160083

Support by : Wei Cheng