Crowd counting is an important aspect to safety monitoring at mass events and can be used to initiate safety measures in time. State-of-the-art encoder-decoder architectures are able to estimate the number of people in a scene precisely. However, since most of the proposed methods are based to solely operate on single-image features, we observe that estimated counts for aerial video sequences are inherently noisy, which in turn reduces the significance of the overall estimates. In this paper, we propose a simple temporal extension to said encoder-decoder architectures that incorporates local context from multiple frames into the estimation process. By applying the temporal extension a state-of-the-art architectures and exploring multiple configuration settings, we find that the resulting estimates are more precise and smoother over time.