This master's thesis deals with the topic of detecting human abnormal behavior in surveillance videos in urban scenarios. In order to differentiate between normal and abnormal behavior, only sequences of human pose skeletons are considered; the RGB information of the video frames is not used. This way, the model should be able to decide independent of the appearance of a person, such as clothing, gender and skin color, whether an abnormality is present. Two different datasets are considered; the internal PolMAR dataset as well as the public ShanghaiTech campus dataset, which are both suited for anomaly detection tasks. The ShanghaiTech campus dataset is suited for unsupervised training, and three different model types are compared: The MPED-RNN is based on gated recurrent units (GRUs); the Spatio-Temporal Graph Convolutional Autoencoder (ST-GCAE), which uses graph convolutions in order to process the pose skeletons; and different Transformer network variants are considered. The MPED-RNN was also extended with a memory module, a Bayesian Gaussian Mixture Model and an Isolation Forest. Furthermore, it was investigated whether the additional processing of pose skeleton sequences of two neighboring people in a frame would improve the classification performance of the ST-GCAE and the Transformer variations. The internal PolMAR dataset can be used for supervised training. For this dataset, this thesis proposes a new model, the Binary Autoencoder, which uses two ST-GCAEs and an MLP in order to classifiy normal and abnormal behavior. The Binary Auteoncoder can be used as a baseline model for future investigations and comparison with other models on the PolMAR dataset.