The problem of human pose estimation in the field of video surveillance is tougher than for classical scenarios. This difference comes from many people at small scale, which is accompanied by lots of ambiguities and occlusions. Typically, one would train a network on suited data to improve the performance. However such annotated data does not exist. This thesis builds upon earlier work and examines state-of-the-art techniques to tackle this problem and look for further improvements.