Movies and Series subtitles in R with subtools

11 octobre 2016 Francois Keck Commentaires 3 commentaires

Every time I download subtitles for a movie or a series episode, I cannot help thinking about all this text that could be analyzed with text mining methods. As my PhD came to its end in April and I started a postdoc in September, I could use some time of my looong summer break to work on a new R package, subtools which aims to provide a toolbox to read, manipulate and write subtitles files in R.

In this post I will present briefly the functions of subtools. For more details you can check the documentation.

Installation

The package is available on GitHub.
You can install it using the devtools package:

devtools::install_github("fkeck/subtools")

Read subtitles

A subtitles file can be read directly from R with the function read.subtitles. Currently, four formats of subtitles are supported: SubRip, (Advanced) Substation Alpha, MicroDVD and SubViewer. The parsers are probably not optimal but they seem to do the job as expected, at least with valid files. The package also provides some wrappers to import whole directories of series subtitles (see read.subtitles.season, read.subtitles.serie and read.subtitles.multiseries)

Manipulate subtitles

The subtools package stores imported subtitles as simple S3 objects of class Subtitles. They are lists with two main components : subtitles (IDs, timecodes and texts) and optional meta-data. Multiple Subtitles objects can be stored as a list of class MultiSubtitles. The subtools package provides functions to easily manipulate Subtitles and MultiSubtitles objects.

Basically, you can :

combine subtitles objects with combineSubs
extract parts of subtitles with [.Subtitles
clean subtitles content with cleanTags, cleanCaptions and cleanPatterns
reorganize subtitles as sentences with sentencify

Extract/convert/write subtitles

Although you can conduct statistical analyses on subtitles objects, the package subtools is not designed for text mining. The following functions allow you to convert subtitles to other classes/format to analyze them. You can:

extract text content as a simple character string with rawText
convert subtitles and meta-data to a virtual corpus with tmCorpus if you want to work with the standard text mining framework tm.
convert subtitles and meta-data to a data.frame with subDataFrame. If you want to use tidy data principles with tidytext and dplyr, you should probably start here.
finally, it’s also possible to write subtitles objects to a file with write.subtitles. Though, it is unclear to me if there is any sense in doing that 😀

Application: Game of Thrones subtitles wordcloud

To illustrate how subtools can be used to get started with a subtitles analysis project, I propose to create a wordcloud showing the most frequent words in the popular TV series Game of Thrones. We will use subtools to import the subtitles, the tm and SnowballC packages to pre-process the data and finally wordcloud to generate the cloud. I will not provide the data I use here, because subtitles files are in a grey zone concerning licensing. But no worries, it’s pretty easy to find subtitles on Internet. 😉

library(subtools)
library(tm)
library(SnowballC)
library(wordcloud)

Because the subtitles are correctly organized in directories, we import them in one command line using the function read.subtitles.serie. The nice thing is that this function will try to automatically extract basic meta-data (like series title, season number and episode number) from directories/files name.

a <- read.subtitles.serie(dir = "/subs/Game of Thrones/")

a is a MultiSubtitles object with 60 Subtitles elements (episodes). We can convert it directly to a tm corpus using tmCorpus. Note that meta-data are preserved.

c <- tmCorpus(a)

And then, we can prepare the data:

c <- tm_map(c, content_transformer(tolower))
c <- tm_map(c, removePunctuation)
c <- tm_map(c, removeNumbers)
c <- tm_map(c, removeWords, stopwords("english"))
c <- tm_map(c, stripWhitespace)

Compute a term-document matrix and aggregate counts by season:

TDM <- TermDocumentMatrix(c)
TDM <- as.matrix(TDM)
vec.season <- rep(1:6, each = 10)
TDM.season <- t(apply(TDM, 1, function(x) tapply(x, vec.season, sum)))
colnames(TDM.season) <- paste("S", 1:6)

And finally plot the cloud!

set.seed(100)
comparison.cloud(TDM.season, title.size = 1, max.words = 100, random.order = T)

Subtitles wordcloud of the six seasons of Game of Thrones.

Few words about this plot. Like every wordcloud, I think it’s a very simple and limited descriptive way to represent the information. However, I like it. The people who have watched the TV show will look at it and say « Oh of course! ». In one hundred word, the cloud is not revealing the scenario of GoT, but for each season I can see one or two critical events popping out (wedding, kill/joffrey, queen/shame, hodor/hold/door).
What I find funny here (perhaps interesting?), is that these very important and emotional moments are supported in the dialogs by the repetition of one or two keywords. I don’t know if this is exclusive to GoT, or if it’s a trick of my mind, or something else. And I will not make any hypothesis, I’m not a linguist. But this is the first idea which came to my mind and I wanted to write it down.

Now, there are plenty of text-mining ideas and hypotheses and methods that can be tested with movies and series subtitles. So have fun 🙂

3 réactions au sujet de « Movies and Series subtitles in R with subtools »

Andrew Clark dit :

16 octobre 2016 à 23 h 32 min

pardon my English

Is there any way to see which character speaks which line?

Répondre
1. franzk dit :
  
  17 octobre 2016 à 9 h 37 min
  
  Hi,
  Unfortunately, not really. A subtitles file is not a script. There are some extended formats like WebVTT which allow to record who is speaking but this is not implemented in subtools and this kind of data may be hard to find.
  
  Répondre
Andrew Clark dit :

21 octobre 2016 à 2 h 10 min

franzk,
Yes thought that would be the case, unfortunately. Thanks for quick response

Répondre

Piece of K