Archives de
Catégorie : R

The Mathematics Genealogy Project: Customizing my mathematical family tree

The Mathematics Genealogy Project: Customizing my mathematical family tree

Some time ago, Maëlle Salmon published a very nice post showing how she scraped her mathematical family tree from the Mathematics Genealogy Project. Of course I immediately wanted to produce my own! I am not a mathematician myself, but one of my PhD supervisor has a PhD in mathematics. Which makes me the indirect descendant of a long lineage of famous mathematicians! As Maëlle kindly invited me to share my tree on Twitter, I decided to write this post to show how I customized the appearance of my tree to make it more sexy. [It is actually part of my job to plot sexy phylogenetic tree…]. So, this post is a remix of Maëlle’s post which was a remix of Nathalie Vialaneix’s post 🙂

1. Scraping more data

First I modified a bit Maëlle’s scraping function to collect more data. I added Xpath selectors for the degree, the university, the country, the year and the title of the thesis. I am not going to use all these informations but now they are here if you need them.

library(magrittr)
library(DiagrammeR)
library(dplyr)
library(igraph)
library(purrr)
library(stringr)
library(rvest)
library(xml2)


.get_advisors <- function(id_string = "id.php?id=143630", sleep_time, terminal = FALSE){
  # small break to be nice
  Sys.sleep(sleep_time)
  
  # try to get the page
  page <- glue::glue("https://www.genealogy.math.ndsu.nodak.edu/{id_string}") %>%
    httr::GET() 
  
  # try until it works but not more than 5 times
  try <- 1
  while(httr::status_code(page) != 200 & try <= 5){
    Sys.sleep(sleep_time)
    page <- glue::glue("https://www.genealogy.math.ndsu.nodak.edu/{id_string}") %>%
      httr::GET() 
    try = try + 1
  }
  
  # Now get student's data
  student_name <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//h2[@style="text-align: center; margin-bottom: 0.5ex; margin-top: 1ex"]') %>%
    rvest::html_text() %>%
    stringr::str_remove("\\\n")
  
  degree <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//span[@style="margin-right: 0.5em"]/span/preceding-sibling::text()') %>%
    rvest::html_text()
  
  university <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//span[@style="margin-right: 0.5em"]/span') %>%
    rvest::html_text()
  
  year <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//span[@style="margin-right: 0.5em"]/span/following-sibling::text()') %>%
    rvest::html_text() %>% 
    stringr::str_trim()
  
  country <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//div[@style="line-height: 30px; text-align: center; margin-bottom: 1ex"]/img') %>%
    rvest::html_attr("title")
  
  thesis_title <- httr::content(page) %>%
    rvest::xml_nodes(xpath = '//span[@id="thesisTitle"]') %>%
    rvest::html_text() %>%
    stringr::str_remove_all("\\\n")
  
  # Get all nodes corresponding to advisors
  # Thanks to their... formatting but it works
  all_advisors <- httr::content(page) %>%
    rvest::xml_nodes(xpath = "//p[@style='text-align: center; line-height: 2.75ex']") %>%
    rvest::html_nodes("a")
  
  if(terminal){
    name <- NA
    id_string_advisors <- NA
  } else {
    name <- purrr::map_chr(all_advisors, rvest::html_text)
    id_string_advisors <- purrr::map_chr(all_advisors, rvest::html_attr,
                                         "href")
  }
  
  # Export results
  tibble::tibble(student_name = student_name,
                 degree = list(degree),
                 university = list(university),
                 year = list(year),
                 country = list(country),
                 thesis_title = list(thesis_title),
                 id_string_student = id_string,
                 name = name,
                 id_string = id_string_advisors)
}

get_advisors <- memoise::memoise(.get_advisors)



#### Download and prepare data ####
# initial data.frame
df <- get_advisors(me, 5)
new_df <- df
keep_growing <- TRUE

while(keep_growing){
  # get size to compare to size after a bit more scraping
  nrow1 <- nrow(df)
  # get advisors for all new lines 
  # from the previous iterations
  new_df <- purrr::map_df(new_df$id_string, get_advisors, sleep_time = 30)
  df <- unique(rbind(df, new_df))
  # if the data.frame didn't grow, stop
  if(nrow(df) == nrow1){
    keep_growing <- FALSE
  }
}


df <- df %>% mutate(student_name = stringr::str_trim(student_name),
                    name = stringr::str_trim(name))


terminal_df <- setdiff(df$id_string, df$id_string_student) %>% 
  map(get_advisors, terminal = TRUE, sleep_time = 30) %>% 
  bind_rows() %>%
  mutate(student_name = stringr::str_trim(student_name),
         name = stringr::str_trim(name))


2. Labels

I wanted to create labels with the name of the mathematician, the flag of the country and the year. Displaying a picture in a node with graphviz (using DiagrammeR) was not simple. Apparently it is possible to use some kind of basic HTML to format nodes but I failed to include images. Finally I decided to use emojis 🙂 Thanks to Hadley Wickham’s package emo it was fairly easy.

#### Construct the graph ####
# create nodes
labels <-  unique(c(df$student_name, df$name))
nodes_df <- create_node_df(n = length(labels))
nodes_df$label <- labels

df_red <- df %>%
  bind_rows(terminal_df) %>%
  filter(!duplicated(student_name)) %>% 
  right_join(nodes_df, by = c("student_name" = "label"))


# create edges
edges_df <- df[, c("name", "student_name")]
edges_df <- dplyr::left_join(edges_df, nodes_df,
                             by = c("name" = "label"))
edges_df <- dplyr::rename(edges_df, from = id)
edges_df <- dplyr::left_join(edges_df, nodes_df,
                             by = c("student_name" = "label"))
edges_df <- dplyr::rename(edges_df, to = id)

edges_df <- mutate(edges_df, rel = "a", color = "BurlyWood4")


# Create labels
years <- map_chr(df_red$year, paste, collapse = " ") %>% 
  map_chr(str_trim) %>%
  str_replace_all("(?<=[0-9]{4}) (?=[0-9]{4})", ", ") %>% 
  str_replace_all(" /", ", ") %>% 
  paste0("(", ., ")") %>% 
  str_replace_all("\\(\\)", "")


df_red$country <-
  map_if(df_red$country,
         (map_lgl(df_red$country,~ is.null(.x)) | map_lgl(df_red$country,~ length(.x) == 0)),
         function(x) "") %>% 
  map(stringr::str_replace, "UnitedKingdom", "United Kingdom")

country_flag <- map_if(df_red$country,
                       map_lgl(df_red$country,~ any(!.x == "")),
                       function(x) map_chr(x, ~ as.character(emo::flag(paste0("^", .x, "$"))))) %>% 
  map_chr(paste, collapse = " ⋅ ")


label <- paste0(nodes_df$label, "\n", years, "")
label <- paste0(nodes_df$label, "\n", country_flag, "\n", years, "")

nodes_df$label <- label %>%
  str_replace_all("'", " ") %>%
  str_replace_all("[[:space:]]{2,}", " ")

3. Adding my self

nodes_df <- bind_rows(nodes_df,
              tibble(id = nrow(nodes_df) + 1,
                type = NA, label = "François Keck\n🇫🇷\n(2016)"))
          
edges_df <- bind_rows(edges_df,
              tibble(name = "AF", student_name = "FK",
                 from = 1, type.x = NA,
                 to = nrow(nodes_df), type.y = NA,
                 rel = "a", color = "BurlyWood4"))

4. Customizing the style of the nodes

I changed the color and shape of the nodes. For some obscure reasons the rectangle-based shapes of graphiz are not correctly rendered in Firefox on Ubuntu (labels are overlapping). It worked on Windows but then I couldn’t display my flags with colored emojis 🙁 The only solution I found was to fix the font size manually (see next point).

# Customizing the nodes
nodes_df <- mutate(nodes_df,
                   shape = "note",
                   color = "Tan",
                   fillcolor = "Moccasin",
                   style = "filled")


# create a DiagrammeR dgr_graph object
dgr <- create_graph(nodes_df = nodes_df,
                    edges_df = edges_df[, c(-1, -2)],
                    directed = TRUE)

# export the object to igraph format
# and then write it to a GraphViz DOT file
DiagrammeR::to_igraph(dgr) %>%
  igraph::write.graph(file = "graph.dot",
                      format = "dot") 
DiagrammeR::grViz("graph.dot", width = 4000, height = 5000)

5. The final touch

The viz.js library which stands behind DiagrammeR renders DOT objects in the browser via SVG. SVG files are XML-based and therefore can be directly processed with R. This gives us great power to manipulate the look of our tree.

First, we need to convert the widget to static HTML/SVG. We can do that in command line using Chromium in headerless mode to render the widget page.

DiagrammeR::grViz("graph.dot", width = 4000, height = 5000) %>% 
  htmlwidgets::saveWidget("index.html")
system("chromium-browser --headless --dump-dom index.html > genealogy.html")

Finally, I used R and xml2 to edit directly the SVG content and improve the look of the tree. In the code below I show how to fill the page and labels backgrounds with a texture image, how to fix the text size, and how to add a shadow effect on labels.


# Load and clean html
html <- read_html("index.html") 

xml_find_all(html, '//script') %>% 
  xml_remove()

# Background
xml_find_all(html, '/html/body') %>% 
  xml_set_attr('style', 'background-image: url("ricepaper2.png"); margin: 0px; padding: 40px;')

xml_find_all(html, '/html/body/div/div/svg/g/polygon') %>% 
  xml_set_attr('fill', 'transparent')


# Labels text size
xml_find_all(html, '//text') %>% 
  xml_text() %>% 
  str_detect("(^[A-Z])|(^\\()") %>%
  extract(xml_find_all(html, '//text'), .) %>% 
  xml_set_attr('font-size', '12')


# Labels background
xnodes <- xml_find_all(html, '//polygon[@fill="#ffe4b5"]')
xml_set_attr(xnodes, 'id', 'paper_tag')
xml_set_attr(xnodes, 'fill', 'url(#img1)')

xml_find_all(html, '/html/body/div/div/svg') %>%
  xml_add_child(read_xml('
  <defs>
    <pattern id="img1" patternUnits="userSpaceOnUse" width="100" height="100">
      <image href="beige_paper2.png" x="0" y="0" width="100" height="100" />
    </pattern>
  </defs>'), .where = 0)


# Labels shadow fx
xml_find_all(html, '//polygon[@id="paper_tag"]') %>% 
  xml_set_attr('filter', 'url(#f3)')

xnodes <- xml_find_all(html, '//*[@stroke="#d2b48c"]')
  xml_set_attr(xnodes, 'stroke', '#000000')
  xml_set_attr(xnodes, 'stroke-width', '0.5')

xml_find_all(html, '/html/body/div/div/svg') %>%
  xml_add_child(read_xml('
  <defs>
    <filter id="f3" x="-10%" y="-10%" width="200%" height="200%">
      <feOffset result="offOut" in="SourceAlpha" dx="1" dy="1" />
      <feGaussianBlur result="blurOut" in="offOut" stdDeviation="2" />
      <feBlend in="SourceGraphic" in2="blurOut" mode="normal" />
    </filter>
  </defs>'), .where = 0)

write_html(html, "genealogy.html")

I find the result pretty nice 🙂

And this is my mathematical family tree! I recognize some illustrious names here! Do you?

My tree

Bonus:

For a better parchment look, we can use the calligraphic font Tangerine for the labels. Note that some glyphs are unfortunately not supported by this font.

  
# Labels font. 
xml_find_all(html, '//text') %>% 
  xml_text() %>% 
  str_detect("(^[A-Z])|(^\\()") %>%
  extract(xml_find_all(html, '//text'), .) %>% 
  xml_set_attr('font-size', '20')

xml_find_all(html, '//text') %>% 
  xml_text() %>% 
  str_detect("(^[A-Z])|(^\\()") %>%
  extract(xml_find_all(html, '//text'), .) %>% 
  xml_set_attr('font-family', 'Tangerine')

xml_find_all(html, '/html/head') %>%
  xml_add_child(read_xml('<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Tangerine"/>'))

Facebooktwitter
Introducing editheme: Palettes and graphics matching your RStudio editor

Introducing editheme: Palettes and graphics matching your RStudio editor

I use RStudio every day at work. And every day my eyes say thank you to the developers for implementing themes support for the editor, and more recently a complete dark skin for the GUI. Dark themes are not only important for your health and the health of the planet. They are also absolutely essential to look cool in the office and make your colleagues believe you’re working on very complicated projects. 😀

However, one little thing could dispel the illusion in a second: an ugly, flashy graphic with a white background in your plot pane.

To solve this serious issue of my daily life I wrote editheme, a tiny package providing color palettes and functions to create graphics matching the RStudio editor. The following animated screenshot is worth a thousand words.

Read more about the package on its Github page.

Facebooktwitter
Exploring the CRAN social network

Exploring the CRAN social network

A few months ago, I published a post where I was trying to map the dependencies relationships between R packages. Today I want to do something similar with package contributors. My idea is to reconstruct a social graph where each node would be a person (presumably a developer), and two persons would be connected by an edge if they have collaborated on the same package. Thus I would be able to explore the CRAN social network!

This post is also a bit special for me. It’s the very first time I’m using dplyr and the tidyverse. I used to write my code in base R but the amazing work of Thomas Lin Pedersen around tiygraph and ggraph convinced me to take the plunge. And it was a lot of fun!

First, we load the packages and the data. In the first version of the project I used the XML file of each package that I retrieved from CRAN using wget and then read and parsed with xml2. Since R 3.4.0, things are much, much simpler with the function CRAN_package_db.

library(tidyverse)
library(stringr)
library(igraph)
library(tidygraph)
library(ggraph)
library(magrittr)

pdb <- tools::CRAN_package_db()
aut <- pdb$Author

That’s all we need to get the full list of contributors. But it comes in a very messy way. R provides a structured system to describe persons which should be used in the Authors@R field of the DESCRIPTION file, but most people use a simple character string.

The big cleanup

I combined a series of regular expressions and string manipulations to extract every name. It was the first time I was using stringr and I have to say, I don’t regret gsub, grepl, etc. I just miss the support for some PCRE functionalities…

aut <- aut %>%
  str_replace_all("\\(([^)]+)\\)", "") %>%
  str_replace_all("\\[([^]]+)\\]", "") %>%
  str_replace_all("<([^>]+)>", "") %>%
  str_replace_all("\n", " ") %>%
  str_replace_all("[Cc]ontribution.* from|[Cc]ontribution.* by|[Cc]ontributors", " ") %>%
  str_replace_all("\\(|\\)|\\[|\\]", " ") %>%
  iconv(to = "ASCII//TRANSLIT") %>%
  str_replace_all("'$|^'", "") %>%
  gsub("([A-Z])([A-Z]{1,})", "\\1\\L\\2", ., perl = TRUE) %>%
  gsub("\\b([A-Z]{1}) \\b", "\\1\\. ", .) %>%
  map(str_split, ",|;|&| \\. |--|(?<=[a-z])\\.| [Aa]nd | [Ww]ith | [Bb]y ", simplify = TRUE) %>%
  map(str_replace_all, "[[:space:]]+", " ") %>%
  map(str_replace_all, " $|^ | \\.", "") %>%
  map(function(x) x[str_length(x) != 0]) %>%
  set_names(pdb$Package) %>%
  extract(map_lgl(., function(x) length(x) > 1))

aut_list <- aut %>%
  unlist() %>%
  dplyr::as_data_frame() %>%
  count(value) %>% 
  rename(Name = value, Package = n) 

At this point we have a list of 13668 (unique) names, and this number is increasing almost every day. The R community is huge! Of course there are still some errors. And we can find surprising contributors:

aut_list$Name[4922]
# [1] "Her Majesty the Queen in Right of Canada"

The Social Network

The next step is to produce a two-column matrix describing all the connections of the network (edge list). An edge list can be turned into a graph object with the igraph package. Finally we can convert the igraph graph into a tidy graph so we can use the API provided by the tidygraph package. For example to filter nodes/edges or select only the main component.

edge_list <- aut %>%
  map(combn, m = 2) %>%
  do.call("cbind", .) %>%
  t() %>%
  dplyr::as_data_frame() %>%
  arrange(V1, V2) %>%
  count(V1, V2)

g <- edge_list %>%
  select(V1, V2) %>%
  as.matrix() %>%
  graph.edgelist(directed = FALSE) %>% 
  as_tbl_graph() %>%
  activate("edges") %>%
  mutate(Weight = edge_list$n) %>%
  activate("nodes") %>%
  rename(Name = name) %>% 
  mutate(Component = group_components()) %>%
  filter(Component == names(table(Component))[which.max(table(Component))])

The author of tidygraph, Thomas, also developed the ggraph package, an implementation of the grammar of graphics for relational data. Using ggraph, it is very easy to visualize our graph.

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  theme_graph()

R cran network

This is it! This hairy ink stain is the main connected component of the CRAN social network, representing collaborations between 6758 developers!

We can look for who’s the « most important » node in the graph according to the Google’s PageRank algorithm, although it’s not straightforward what could mean a high PR here…

g %>%
  mutate(Page_rank = centrality_pagerank()) %>% 
  top_n(10, Page_rank) %>% 
  as_tibble() %>% 
  ggplot() +
    geom_col(aes(forcats::fct_reorder(Name, Page_rank), Page_rank)) +
    coord_flip() + theme_minimal() +
    labs(title = "Top 10 contributors based on Page Rank", x = "")

Barplot_top10_pagerank

Focus on the core network

The complete graph makes a nice painting to hang on the wall of your living room. But hard to say anything about it. So, we will reduce the data and focus on the contributors involved in more than 4 packages.

g <- g %>% 
  left_join(aut_list) %>% 
  filter(Package > 4) %>% 
  mutate(Component = group_components()) %>%
  filter(Component == names(table(Component))[which.max(table(Component))])

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  theme_graph()

CRAN network zoom

Nice! It seems like there are two distinct clusters in the middle. Let’s see if we can identify them using a community detection algorithm.

g <- mutate(g, Community = group_edge_betweenness(),
            Degree = centrality_degree())

filter(g, Community == names(sort(table(Community), decr = TRUE))[1]) %>%
  select(Name, Package) %>%
  arrange(desc(Package)) %>%
  top_n(10, Package) %>%
  as_tibble() %>% 
  knitr::kable(format = "html", caption = "Cluster 1")
filter(g, Community == names(sort(table(Community), decr = TRUE))[2]) %>%
  select(Name, Package) %>%
  arrange(desc(Package)) %>%
  top_n(10, Package) %>%
  as_tibble() %>% 
  knitr::kable(format = "html", caption = "Cluster 2")
Cluster 1
Name Package
Kurt Hornik 56
Martin Maechler 46
Achim Zeileis 44
Dirk Eddelbuettel 41
Romain Francois 27
Ben Bolker 25
Brian Ripley 25
Torsten Hothorn 24
Roger Bivand 23
Douglas Bates 23
Cluster 2
Name Package
Hadley Wickham 95
Rstudio 92
Scott Chamberlain 44
Inc 36
Yihui Xie 34
Jeroen Ooms 33
Jj Allaire 29
R. Core Team 27
Bob Rudis 26
Gabor Csardi 22
Oliver Keyes 22
Winston Chang 22

If you know a bit R and its community, you will certainly recognize some names. The challenge is to interpret this classification. It is not a good thing to put people in boxes, but this is mandatory to make nice and colorful visualizations! So, I give it a try. For me, the first group include people of the early days who contributed to major packages which constitute the heart of R. The second group is more related to the second generation, associated with Rstudio products and who have been particularly prolific in the last years. Of course the two groups are strongly connected.

How can we label these two groups? After long consideration, I chose to go for « The Ancients » and « The Moderns », without any value judgment! This is a reference to an old debate in French literature. This dubious comparison would certainly be more appropriate to designate the ongoing debate base vs. tidyverse, but… not today!

Let’s see how we can visualize these two groups.

g <- g %>% 
  mutate(Community = case_when(Community == names(sort(table(Community),
                                                       decr = TRUE))[1] ~ "The Ancients",
                               Community == names(sort(table(Community),
                                                       decr = TRUE))[2] ~ "The Moderns",
                               Community %in% names(sort(table(Community),
                                                       decr = TRUE))[-1:-2] ~ "Unclassified")) %>% 
  mutate(Community = factor(Community))

g <- g %>%
  filter(Degree > 5) %>% 
  mutate(Degree = centrality_degree())

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  geom_node_point(aes(color = Community, size = Package)) +
  theme_graph() +
  scale_color_manual(breaks = c("The Ancients", "The Moderns"),
                     values=c("#F8766D", "#00BFC4", "#969696"))

CRAN network communities

ggraph really does a great job! I really like the result.

BTW, the R community is not really divided…

g %>% activate("edges") %>% 
  mutate(Community_from = .N()$Community[from],
         Community_to = .N()$Community[to]) %>% 
  filter((Community_from == "The Ancients" & Community_to == "The Moderns")|
           Community_from == "The Moderns" & Community_to == "The Ancients") %>% 
  as_tibble() %>% 
  nrow()

… there are 254 edges connecting the the Ancients and the Moderns! But I think this is a nice illustration of how social graphs keeps the imprint of history and past events. Next step would be to look at the graph dynamics over time. Maybe an idea for a future post…

Facebooktwitter
Visualizing the R packages galaxy

Visualizing the R packages galaxy

The idea of this post is to create a kind of map of the R ecosystem showing dependencies relationships between packages. This is actually quite simple to quickly generate a graph object like that, since all the data we need can be obtained with one call to the function available.packages(). There are a few additional steps to clean the data and arrange them, but with the following code everyone can generate a graph of dependencies in less than 5s.

First, we load the igraph package and we retrieve the data from our current repository.

library(igraph)
dat <- available.packages()

For each package, we produce a character string with all its dependencies (Imports and Depends fields) separated with commas.

dat.imports <- paste(dat[, "Imports"], dat[, "Depends"], sep = ", ")
dat.imports <- as.list(dat.imports)
dat.imports <- lapply(dat.imports, function(x) gsub("\\(([^\\)]+)\\)", "", x))
dat.imports <- lapply(dat.imports, function(x) gsub("\n", "", x))
dat.imports <- lapply(dat.imports, function(x) gsub(" ", "", x))

Next step, we split the strings and we use the stack function to get the complete list of edges of the graph.

dat.imports <- sapply(dat.imports, function(x) strsplit(x, split = ","))
dat.imports <- lapply(dat.imports, function(x) x[!x %in% c("NA", "R")])
names(dat.imports) <- rownames(dat)
dat.imports <- stack(dat.imports)
dat.imports <- as.matrix(dat.imports)
dat.imports <- dat.imports[-which(dat.imports[, 1]==""), ]

Finally we create the graph with the list of edges. Here, I select the largest connected component because there are many isolated vertices which will make the graph harder to represent.

g <- graph.edgelist(dat.imports)
g <- decompose.graph(g)
g <- g[[which(sapply(g, vcount) == max(sapply(g, vcount)))]]

Now that we have the graph we can compute many graph-theory related statistics but here I just want to visualize these data. Plotting a large graph like that with igraph is possible but slow and tedious. I prefer to export the graph and open it in Gephi, a free and open-source software for network visualization.
The figure below was created with Gephi. Nodes are sized according to their number of incoming edges. This is pretty useless and uninformative, but still I like it. This looks like a sky map and it is great to feel that we, users and developers are part of this huge R galaxy.

R packages network
Click on the picture to see the HD version
Facebooktwitter
rleafmap: R Markdown in interactive popups

rleafmap: R Markdown in interactive popups

This is the second « big » feature coming with branch 0.2 of rleafmap (now on CRAN!). With this new version you can write popups content in R Markdown which will be processed when you generate the map. This can be useful to format popups using markdown syntax (if you need more control remind that popups can also be formatted with html tags). More interesting with R Markdown is the possibility to include outputs of R code chunks. Thus, results, simulations and plots can be easily embedded within popups content.

To activate R Markdown for a layer you just have to pass the R Markdown code to the popup argument and set popup.rmd to TRUE.

The chunkerize function

You can write your R code chunks manually but you can also use the function chunkerize which tries to make your life simpler. This function has two purposes:

  • It turns a function and its arguments into an R code chunk.
  • It can decompose the elements of a list and use them as values for a given argument.

The latter may be useful when you have different features on your map and you want to execute the same function for each of them but with different data as input. This can be done by tagging the list containing the data with a star character (*). See the following example:

Data you need for the example: download and unzip
First we load the packages we need:

library(rleafmap)
library(maptools)

Then we load the data: a shapefile of regions and a dataframe giving the evolution of the population for each region.

reg <- readShapePoly("regions-20140306-100m")
reg <- reg[c(-11,-12,-16,-19,-20),]
pop <- read.csv("population.txt", sep = "\t", row.names = 1)
pop <- pop[rev(rownames(pop)), sort(names(pop))]

We prepare some colors for the map and a legend:

gcol <- rev(heat.colors(5))
gcut <- cut(as.numeric(pop["2012", ]),
            breaks = c(0, 2000000, 3000000, 4000000, 8000000, 12000000))
reg.col <- gcol[as.numeric(gcut)]
reg.leg <- layerLegend(style = "polygons",
                       title = "Population",
                       labels = levels(gcut),
                       fill.col = gcol)

We prepare the data for chunkerize: pop is already a list, since it is a dataframe but for clarity we turn it into a simple list.

Year <- as.numeric(rownames(pop))
L <- as.list(pop)
L2 <- as.list(names(L))

Now is the trick. We create a chunk based on a plot function. We provide 5 arguments (names and values). Each arg.values is going to be recycled, except L and L2 which are tagged with a star. In that case, each element of these lists is going to be used as values.

popup <- chunkerize(FUN = plot,
                    arg.names = c("x", "y", "type", "ylab", "main"),
                    arg.values = c("Year", "*L", "'b'", "'Population'", "*L2"))

Now we just have to create the layers and compile the map:

cdbpos.bm <- basemap("cartodb.positron.nolab")
reg.map <- spLayer(reg,
                   fill.col = reg.col,
                   legend = reg.leg,
                   popup = popup,
                   popup.rmd = TRUE)
writeMap(cdbpos.bm, reg.map)

Et voilà ! There is a problem with Rstudio and Firefox. It happens that popups do not appear where they should on the first click. It works fine on Chromium.

Facebooktwitter
Rummaging through dusty books: Maucha diagrams in R

Rummaging through dusty books: Maucha diagrams in R

Do you know the Maucha diagram? If you are not an Hungarian limnologist, probably not! This diagram was proposed by Rezso Maucha in 1932 as a way to vizualise the relative ionic composition of water samples. However, as far I know this diagram had few success in the community. I never heard about it until my coworker Kalman (who is also Hungarian) asked me if I knew how to plot it in R.

First, I have to admit I was a bit skeptical… But finally, we decided it could be an interesting and funny programming exercise.

We found instructions to draw the diagram in Broch and Yake (1969) [1] but rapidly we were interested to find the original paper of Maucha [2]. This paper is apparently not available on-line, and we could only find a hard copy in the University of Grenoble (2 hours driving). Nonetheless, we had a look in the library of the lab and… miracle! We found it, between two old dusty books, probably waiting for decades!

The famous book of Rezso Maucha !
The famous book of Rezso Maucha !

Meticulously following the instructions of Maucha, we could write a function to draw the diagram. Then we added some additional options : colors, labels and the possibility to draw multiple diagrams from a matrix. Finally we put the code in a package (hosted on Github) with the dataset included in the original publication.

To install the package, install devtools from your CRAN repo and run:

devtools::install_github("fkeck/oviz")

Then you can load the dataset used by Maucha [2] to introduce his diagram:

data(ionwaters)

And then you can use the function maucha which will plot one diagram for each line of the matrix.

maucha(ionwaters)

maucha_demo

Here we are. And if you are interested in ionic composition of waters, stay tuned, we are planning to add some stuff like stiff diagram and piper diagram.

[1] Broch, E. S., & Yake, W. (1969). A modification of Maucha’s ionic diagram to include ionic concentrations. Limnology and Oceanography, 14(6), 933-935.
[2] Maucha, R. (1932). Hydrochemische Methoden in der Limnologie. Binnengewasser, 12. 173p.

Facebooktwitter
Fully customizable legends for rleafmap

Fully customizable legends for rleafmap

This is a functionality I wanted to add for some time… and finally it’s here! I just pushed on GitHub a new version of rleafmap which brings the possibility to attach legends to data layers. You simply need to create a legend object with the function layerLegend and then to pass this object when you create your data layer via the legend argument. Thus, a map can contain different legends, each of them being independent. This is cool because it means that when you mask a data layer with the layer control, the legend will also disappear.

You can create legends with five different styles to better suit your data: points, lines, polygons, icons and gradient (see the graphic).

legends_rleafmap

Legends for John Snow’s cholera map

I give as example a new version of the cholera map.
Here is the code:

devtools::install_github("fkeck/rleafmap")
devtools::install_github("Hackout2/epimap")
library(rleafmap)
library(epimap)

data(cholera)
n.chol <- ifelse(cholera$deaths$Count > 4, 6, cholera$deaths$Count) + 2

# Basemap layer
cdbpos.bm <- basemap("cartodb.positron.nolab")

# Legends
death.leg <- layerLegend(style = "points", title = "Deaths",
                         labels = c("1", "2", "3", "4", "> 4"), size = c(3, 4, 5, 6, 8),
                         fill.col = "red", fill.alpha = 0.9)
pumps.leg <- layerLegend(style = "icons", title = NA, labels = "Water pump",
                         png = "/home/francois/water.png", png.width = 31, png.height = 31)

# Data layers
death.points <- spLayer(cholera$deaths, legend = death.leg,
                        size = n.chol, fill.col =  "red", fill.alpha = 0.9)
pumps.points <- spLayer(cholera$pumps, legend = pumps.leg,
                        png = "/home/francois/water.png",
                        png.width=31, png.height=31)

my.ui <- ui(layers = "topright")

writeMap(cdbpos.bm, pumps.points, death.points, interface = my.ui,
         setView = c(51.5135, -0.137), setZoom = 17)

And here the result. Enjoy and check out the legend interactivity when you play with the layer selector 😉

Facebooktwitter
Contours and Networks with epimap and rleafmap

Contours and Networks with epimap and rleafmap

In February, I participated in a hackaton organized by Thibaut Jombart at Imperial College, London, to work on visualization tools for outbreak data. This was a great time spent with great people! Thanks again, Thibaut, for organizing. I took part in the development of epimap, an R package for statistical mapping. The aim of epimap is to provide tools to quickly and efficiently visualize spatial data. There is a set of functions designed to do that and you can check out the Github page for a demo.

This package also provides two functions to coerce complex objects to Spatial classes so that they can be easily included in a map.

  • The contour2sp function takes a SpatialGrid object and returns contour lines as a SpatialLinesDataFrame.
  • The graph2sp function takes a graph (from igraph package) with geolocated vertices and returns a list of Spatial objects (points and lines).

Following this post of Arthur Charpentier (who nicely plays with rleafmap!), I decided to include the John Snow’s Cholera dataset in epimap so it can be simply used for tests.

In this post I want to show how epimap and rleafmap can be combined to create fully customizable interactive maps with complex objects. The cholera dataset gives the locations of cholera deaths and the locations of water pumps in London. The maps will show the location of cholera deaths with points, the local density of deaths with colored contour lines and the location of water pumps with icons. Moreover, the pumps will be represented within a network where two pumps are connected if there are close enough.

library(rleafmap)
library(epimap)

data(cholera)

# Create a network of pumps
pump.adj <- as.matrix(dist(sp::coordinates(cholera$pumps)))
pump.graph <- graph.adjacency(pump.adj < 0.003, diag = FALSE)
V(pump.graph)$lat <- coordinates(cholera$pumps)[, 2]
V(pump.graph)$lon <- coordinates(cholera$pumps)[, 1]

# Convert death density SpatialGrid to contour SpatialLines
death.cont <- contour2sp(cholera$deaths.den, nlevels = 10)

# Basemap layer
cdbdark.bm <- basemap("cartodb.darkmatter.nolab")

# Data layers
death.points <- spLayer(cholera$deaths,
                        size = 1,
                        fill.col =  "white",
                        fill.alpha = 0.5,
                        stroke = FALSE)
death.contour <- spLayer(death.cont,
                         stroke.col = heat.colors(12)[cut(death.cont$level, 12)],
                         stroke.lwd = 1.5,
                         stroke.alpha = 1)
pumps.points <- spLayer(graph2sp(pump.graph)[[1]],
                        png = "/home/francois/water.png",
                        png.width=31 ,
                        png.height=31)
pumps.links <- spLayer(graph2sp(pump.graph)[[2]],
                       stroke.lwd = 3,
                       stroke.col = "white")

my.ui <- ui(layers = "topright")

writeMap(cdbdark.bm, death.points, death.contour,
         pumps.links, pumps.points, interface = my.ui)

And here is the map we get:

Facebooktwitter
Introducing rleafmap. An R package for interactive maps with Leaflet.

Introducing rleafmap. An R package for interactive maps with Leaflet.

Obviously, I am late…

I released rleafmap about 1 year ago and I am just writing this blog post today. During this time, I presented the package to the french R-users community at the 3eme Rencontres R in Montpellier and could get some good feedbacks. Now, I would like to communicate better on the project. My idea is to post news about the development and communicate on new features illustrated with examples on this blog. The documentation and tutorials will be published on the project website (http://www.francoiskeck.fr/rleafmap/) if I can save time for that.

Purpose and philosophy

rleafmap is an R package that can be used to generate interactive maps with your data. If you manipulate spatial data in the R environment, at some point you probably want to visualize them. The most common way to visualize spatial data is maps. Like other packages (googleVis, rMaps…) rleafmap is designed to produce maps with interactivity to bring a richer experience to the end user. This is made possible by the use of Leaflet, the amazing open-source javascript library created by Vladimir Agafonkin.

There are two things important to be aware for a good start with rleafmap.

  • First, the package use exclusively input data inheriting from the Spatial class of the sp package. These classes are central and allows to work with many packages. If you don’t know it, a good place to start is the vignette of the sp package on CRAN. If you prefer a good book have look to Bivand et al. 2013 [1].
  • The second point is about how the package works. Data layers are stored in independent R object with their own symbology. The map consists in a compilation of these objects. The idea is to stick to the philosophy of the traditional GIS software.

For a more complete view of the package I strongly recommend that you have a look to the website.

[1] Bivand R.S., Pebesma E.J. & Gómez-Rubio V. (2013) Applied Spatial Data Analysis with R, 2nd edn. Springer, New York.

Facebooktwitter
Google au hasard

Google au hasard

Il m’arrive parfois, dans de grand moments de solitude, de me trouver béat à l’adresse www.google.fr, perdu et sans trop savoir ce que je fais là. Je m’étonne alors de voir comment mon esprit pense à la manière de trouver une réponse avant même que je n’ai formulé ma question. Mais c’est une évolution de mon comportement qui est somme toute assez logique puisque quelque soit la question, 9 fois sur 10 c’est bien par google que j’obtiens la réponse.

C’est comme ça que j’explique que je puisse échouer sur google sans avoir rien à lui demander. Et parfois face au curseur qui clignote j’ai ce réflexe (simiesque ?) de taper des touches au hasard.

Mykl Roventine on Flickr (CC BY-NC-SA)

Même si quand on y réfléchit on est surement assez loin du hasard, je suis toujours impressionné par la capacité du moteur à renvoyer des résultats. Par exemple la recherche sdbxce renvoie 612000 résultats (65 si on recherche le terme exact). On constate néanmoins et c’est d’ailleurs assez évident, que le nombre de résultat diminue à chaque nouvelle lettre ajoutée.

La petite expérience du jour consiste donc à sonder un peu la monstruosité de l’index de Google en déterminant jusqu’à combien de lettre frappées au hasard le bestiau nous renvoie des résultats.
On peut bien sur faire ça à la main mais R a quelques avantages décisifs :

  • Il est plus fort que nous pour simuler le hasard
  • Il peut faire les recherches Google à notre place ce qui n’est pas sans intérêt sachant qu’on en a plusieurs centaines au programme…

Commençons donc par générer nos termes de recherche. Une matrice 100×15 : 100 réplications de mots de 1 à 15 caractères tirés au hasard.

nb.letters <- 15
nb.samp <- 100

req <- data.frame()
for (i in 1:nb.letters){
  for(j in 1:nb.samp){
    samp <- sample(letters, i, replace=T)
    req[j, i] <- paste(samp, collapse="")
  }
}

Maintenant voici une fonction qui lance une recherche Google à partir d’une chaîne de caractères et retourne le nombre de résultats renvoyés par le moteur. J’ai choisi de rechercher les termes exacts (recherche entre guillemets) car je trouvais ça plus parlant.

library (XML)
library(RCurl)

goonum <- function(q){
  Sys.sleep(runif(1))
  url <- paste("http://www.google.fr/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=\"", q, "\"", sep="")
  gpage <- getURL(url)
  gpage <- htmlParse(gpage)
  nbres <- xpathApply(gpage, "//div[@id='resultStats']", xmlValue)
  nbres.num <- unlist(gregexpr("[0-9]", nbres))
  nbres.val <- vector()
  for(i in 1:length(nbres.num)) 
    {nbres.val[i] <- substr(nbres, nbres.num[i], nbres.num[i])}
  nbres.val <- as.numeric(paste(nbres.val, collapse=""))
  nbres.val <- ifelse(is.na(nbres.val), 0, nbres.val)
  return(nbres.val)
}

Il ne reste qu’à faire tourner la fonction en boucle sur notre matrice de requêtes. Et faire deux petit graphiques qui sont assez parlant.

g.ans <- apply(req, FUN=goonum, MARGIN=c(1,2))
plot(apply(log(g.ans+1), 2,mean), type="b", ylab="Nombre de résultats (log+1)", xlab="Nombre de lettres")
plot(apply(g.ans==0, 2, sum), type="b", ylab="% de résultats nuls", xlab="Nombre de lettres")

Hum, ça pour être parlant… Quand on vient du monde de la biologie (et pire encore, de l’écologie) on hallucine toujours un peu face à ce genre de données ! C’est doux comme de la soie !

Comme on pouvait s’y attendre, le nombre de réponses décroit de manière exponentielle avec l’augmentation du nombre de lettre… jusqu’à 7 lettres. A 8 lettres 90% des requêtes ne renvoient plus aucun résultat. A partir de 9, plus rien, en tout cas dans cet échantillon. M’enfin 7 lettres c’est déjà pas mal, des combinaisons à 7 lettres il y en a 26^7 soit plus de 8 milliards…

Facebooktwitter