Text this: Human emotion recognition from videos using spatio-temporal and audio features