In a post published on Dynamic Ecology last week, Meghan Duffy presented a graph showing the number of authors of articles published in Ecology for the period 1956-2016. [SPOILER: it increases!]. She also noted that someone probably did something similar for the whole field of ecology but couldn’t find it. Neither could I.
I was thinking that it would take a lifetime to collect manually these data (even if I love to spend my time in the dirt of the lab library). But luckily I remembered that I kept a nice dataset of publication records for a good number of journals in ecology. Few years ago, my friend Ivan Mateus and I, scrapped these data to test some scientometric hypotheses. It took several months for us to collect these data, so I’m glad I have the opportunity to use it.
I was in a rush (at the airport on my way to the DNAqua-net kick-off conference) but the R code written 4 years ago to compile the data just worked perfectly! And I could quickly come up with a nice graph. I made some technical choices here: 1/ I deleted some outliers (some rare articles are featuring more than 100 authors!). 2/ I didn’t use a boxplot as Meghan did, because I had data for every year but also because the distribution becomes very skewed with time. 3/ I used a color scale to represent the number of papers, so points act like bins. I could use hexagonal bins to get a continuous surface, but I liked it with points. It just requires a manual tuning for the dot size.
The graph shows the number of authors of articles for 285951 articles published in 113 journals in ecology. It covers the period 1932-2012, but I suspect that data for many journals are missing before 1975. This can explain the sudden change in dots color.
In conclusion, I think this graph does a good job of showing both the increase in the number of paper published every year in our field and the increase in the number of authors per paper. I was expecting this trend but I was really surprised by its strength. And it looks like it’s not going to stop now.
Every time I download subtitles for a movie or a series episode, I cannot help thinking about all this text that could be analyzed with text mining methods. As my PhD came to its end in April and I started a postdoc in September, I could use some time of my looong summer break to work on a new R package, subtools which aims to provide a toolbox to read, manipulate and write subtitles files in R.
In this post I will present briefly the functions of subtools. For more details you can check the documentation.
The package is available on GitHub.
You can install it using the devtools package:
A subtitles file can be read directly from R with the function read.subtitles. Currently, four formats of subtitles are supported: SubRip, (Advanced) Substation Alpha, MicroDVD and SubViewer. The parsers are probably not optimal but they seem to do the job as expected, at least with valid files. The package also provides some wrappers to import whole directories of series subtitles (see read.subtitles.season, read.subtitles.serie and read.subtitles.multiseries)
The subtools package stores imported subtitles as simple S3 objects of class Subtitles. They are lists with two main components : subtitles (IDs, timecodes and texts) and optional meta-data. Multiple Subtitles objects can be stored as a list of class MultiSubtitles. The subtools package provides functions to easily manipulate Subtitles and MultiSubtitles objects.
Basically, you can :
combine subtitles objects with combineSubs
extract parts of subtitles with [.Subtitles
clean subtitles content with cleanTags, cleanCaptions and cleanPatterns
reorganize subtitles as sentences with sentencify
Although you can conduct statistical analyses on subtitles objects, the package subtools is not designed for text mining. The following functions allow you to convert subtitles to other classes/format to analyze them. You can:
extract text content as a simple character string with rawText
convert subtitles and meta-data to a virtual corpus with tmCorpus if you want to work with the standard text mining framework tm.
convert subtitles and meta-data to a data.frame with subDataFrame. If you want to use tidy data principles with tidytext and dplyr, you should probably start here.
finally, it’s also possible to write subtitles objects to a file with write.subtitles. Though, it is unclear to me if there is any sense in doing that 😀
Application: Game of Thrones subtitles wordcloud
To illustrate how subtools can be used to get started with a subtitles analysis project, I propose to create a wordcloud showing the most frequent words in the popular TV series Game of Thrones. We will use subtools to import the subtitles, the tm and SnowballC packages to pre-process the data and finally wordcloud to generate the cloud. I will not provide the data I use here, because subtitles files are in a grey zone concerning licensing. But no worries, it’s pretty easy to find subtitles on Internet. 😉
Because the subtitles are correctly organized in directories, we import them in one command line using the function read.subtitles.serie. The nice thing is that this function will try to automatically extract basic meta-data (like series title, season number and episode number) from directories/files name.
a <- read.subtitles.serie(dir = "/subs/Game of Thrones/")
a is a MultiSubtitles object with 60 Subtitles elements (episodes). We can convert it directly to a tm corpus using tmCorpus. Note that meta-data are preserved.
c <- tmCorpus(a)
And then, we can prepare the data:
c <- tm_map(c, content_transformer(tolower))
c <- tm_map(c, removePunctuation)
c <- tm_map(c, removeNumbers)
c <- tm_map(c, removeWords, stopwords("english"))
c <- tm_map(c, stripWhitespace)
Compute a term-document matrix and aggregate counts by season:
Few words about this plot. Like every wordcloud, I think it’s a very simple and limited descriptive way to represent the information. However, I like it. The people who have watched the TV show will look at it and say « Oh of course! ». In one hundred word, the cloud is not revealing the scenario of GoT, but for each season I can see one or two critical events popping out (wedding, kill/joffrey, queen/shame, hodor/hold/door).
What I find funny here (perhaps interesting?), is that these very important and emotional moments are supported in the dialogs by the repetition of one or two keywords. I don’t know if this is exclusive to GoT, or if it’s a trick of my mind, or something else. And I will not make any hypothesis, I’m not a linguist. But this is the first idea which came to my mind and I wanted to write it down.
Now, there are plenty of text-mining ideas and hypotheses and methods that can be tested with movies and series subtitles. So have fun 🙂
The idea of this post is to create a kind of map of the R ecosystem showing dependencies relationships between packages. This is actually quite simple to quickly generate a graph object like that, since all the data we need can be obtained with one call to the function available.packages(). There are a few additional steps to clean the data and arrange them, but with the following code everyone can generate a graph of dependencies in less than 5s.
First, we load the igraph package and we retrieve the data from our current repository.
dat <- available.packages()
For each package, we produce a character string with all its dependencies (Imports and Depends fields) separated with commas.
Finally we create the graph with the list of edges. Here, I select the largest connected component because there are many isolated vertices which will make the graph harder to represent.
g <- graph.edgelist(dat.imports)
g <- decompose.graph(g)
g <- g[[which(sapply(g, vcount) == max(sapply(g, vcount)))]]
Now that we have the graph we can compute many graph-theory related statistics but here I just want to visualize these data. Plotting a large graph like that with igraph is possible but slow and tedious. I prefer to export the graph and open it in Gephi, a free and open-source software for network visualization.
The figure below was created with Gephi. Nodes are sized according to their number of incoming edges. This is pretty useless and uninformative, but still I like it. This looks like a sky map and it is great to feel that we, users and developers are part of this huge R galaxy.
This is the second « big » feature coming with branch 0.2 of rleafmap (now on CRAN!). With this new version you can write popups content in R Markdown which will be processed when you generate the map. This can be useful to format popups using markdown syntax (if you need more control remind that popups can also be formatted with html tags). More interesting with R Markdown is the possibility to include outputs of R code chunks. Thus, results, simulations and plots can be easily embedded within popups content.
To activate R Markdown for a layer you just have to pass the R Markdown code to the popup argument and set popup.rmd to TRUE.
The chunkerize function
You can write your R code chunks manually but you can also use the function chunkerize which tries to make your life simpler. This function has two purposes:
It turns a function and its arguments into an R code chunk.
It can decompose the elements of a list and use them as values for a given argument.
The latter may be useful when you have different features on your map and you want to execute the same function for each of them but with different data as input. This can be done by tagging the list containing the data with a star character (*). See the following example:
Data you need for the example: download and unzip
First we load the packages we need:
Then we load the data: a shapefile of regions and a dataframe giving the evolution of the population for each region.
reg <- readShapePoly("regions-20140306-100m")
reg <- reg[c(-11,-12,-16,-19,-20),]
pop <- read.csv("population.txt", sep = "\t", row.names = 1)
pop <- pop[rev(rownames(pop)), sort(names(pop))]
We prepare the data for chunkerize: pop is already a list, since it is a dataframe but for clarity we turn it into a simple list.
Year <- as.numeric(rownames(pop))
L <- as.list(pop)
L2 <- as.list(names(L))
Now is the trick. We create a chunk based on a plot function. We provide 5 arguments (names and values). Each arg.values is going to be recycled, except L and L2 which are tagged with a star. In that case, each element of these lists is going to be used as values.
Do you know the Maucha diagram? If you are not an Hungarian limnologist, probably not! This diagram was proposed by Rezso Maucha in 1932 as a way to vizualise the relative ionic composition of water samples. However, as far I know this diagram had few success in the community. I never heard about it until my coworker Kalman (who is also Hungarian) asked me if I knew how to plot it in R.
First, I have to admit I was a bit skeptical… But finally, we decided it could be an interesting and funny programming exercise.
We found instructions to draw the diagram in Broch and Yake (1969)  but rapidly we were interested to find the original paper of Maucha . This paper is apparently not available on-line, and we could only find a hard copy in the University of Grenoble (2 hours driving). Nonetheless, we had a look in the library of the lab and… miracle! We found it, between two old dusty books, probably waiting for decades!
Meticulously following the instructions of Maucha, we could write a function to draw the diagram. Then we added some additional options : colors, labels and the possibility to draw multiple diagrams from a matrix. Finally we put the code in a package (hosted on Github) with the dataset included in the original publication.
To install the package, install devtools from your CRAN repo and run:
Then you can load the dataset used by Maucha  to introduce his diagram:
And then you can use the function maucha which will plot one diagram for each line of the matrix.
Here we are. And if you are interested in ionic composition of waters, stay tuned, we are planning to add some stuff like stiff diagram and piper diagram.
 Broch, E. S., & Yake, W. (1969). A modification of Maucha’s ionic diagram to include ionic concentrations. Limnology and Oceanography, 14(6), 933-935.
 Maucha, R. (1932). Hydrochemische Methoden in der Limnologie. Binnengewasser, 12. 173p.
This is a functionality I wanted to add for some time… and finally it’s here! I just pushed on GitHub a new version of rleafmap which brings the possibility to attach legends to data layers. You simply need to create a legend object with the function layerLegend and then to pass this object when you create your data layer via the legend argument. Thus, a map can contain different legends, each of them being independent. This is cool because it means that when you mask a data layer with the layer control, the legend will also disappear.
You can create legends with five different styles to better suit your data: points, lines, polygons, icons and gradient (see the graphic).
Legends for John Snow’s cholera map
I give as example a new version of the cholera map.
Here is the code:
Revoir Andrea avec les autres, début juillet à la SEFS, m’a rappelé une mémorable soirée au château ! Andrea nous avait ramené une petite collection de bières de son Allemagne natale et nous avions organisé une séance de dégustation et notation…
Réalisé avec R, ade4 et… je reconnais, Inkscape pour la post-production.
In February, I participated in a hackaton organized by Thibaut Jombart at Imperial College, London, to work on visualization tools for outbreak data. This was a great time spent with great people! Thanks again, Thibaut, for organizing. I took part in the development of epimap, an R package for statistical mapping. The aim of epimap is to provide tools to quickly and efficiently visualize spatial data. There is a set of functions designed to do that and you can check out the Github page for a demo.
This package also provides two functions to coerce complex objects to Spatial classes so that they can be easily included in a map.
The contour2sp function takes a SpatialGrid object and returns contour lines as a SpatialLinesDataFrame.
The graph2sp function takes a graph (from igraph package) with geolocated vertices and returns a list of Spatial objects (points and lines).
Following this post of Arthur Charpentier (who nicely plays with rleafmap!), I decided to include the John Snow’s Cholera dataset in epimap so it can be simply used for tests.
In this post I want to show how epimap and rleafmap can be combined to create fully customizable interactive maps with complex objects. The cholera dataset gives the locations of cholera deaths and the locations of water pumps in London. The maps will show the location of cholera deaths with points, the local density of deaths with colored contour lines and the location of water pumps with icons. Moreover, the pumps will be represented within a network where two pumps are connected if there are close enough.
I released rleafmap about 1 year ago and I am just writing this blog post today. During this time, I presented the package to the french R-users community at the 3eme Rencontres R in Montpellier and could get some good feedbacks. Now, I would like to communicate better on the project. My idea is to post news about the development and communicate on new features illustrated with examples on this blog. The documentation and tutorials will be published on the project website (http://www.francoiskeck.fr/rleafmap/) if I can save time for that.
Purpose and philosophy
There are two things important to be aware for a good start with rleafmap.
First, the package use exclusively input data inheriting from the Spatial class of the sp package. These classes are central and allows to work with many packages. If you don’t know it, a good place to start is the vignette of the sp package on CRAN. If you prefer a good book have look to Bivand et al. 2013 .
The second point is about how the package works. Data layers are stored in independent R object with their own symbology. The map consists in a compilation of these objects. The idea is to stick to the philosophy of the traditional GIS software.
For a more complete view of the package I strongly recommend that you have a look to the website.
 Bivand R.S., Pebesma E.J. & Gómez-Rubio V. (2013) Applied Spatial Data Analysis with R, 2nd edn. Springer, New York.
Au travail nous sommes à présent tenus de badger quatre fois par jour via une interface web. Je ne reviendrai pas ici sur tout le mal que je pense de la badgeuse. Le problème à présent est de composer avec ce système et de ne pas oublier de badger alors qu’on a mille choses plus intéressantes à l’esprit. D’où l’idée d’utiliser le système de notification d’Ubuntu dans une tache cron, pour s’envoyer des petits messages de rappels.
On commencera par installer le paquet libnotify-bin. sudo apt-get install libnotify-bin
Pour ajouter une tache planifiées on édite le fichier crontab. crontab -e
Et on y ajoute autant de tache qu’on veut (exemple d’utilisation ici)
Dans notre cas, pour s’envoyer un petit message chaque jour à 8:30 on peut mettre : 30 8 * * * DISPLAY=:0 notify-send -u critical -i /usr/share/icons/gnome/256x256/emotes/face-wink.png "Salut, as-tu pensé à badger?"