Exploring the CRAN social network

12 septembre 2017 Francois Keck Commentaires 5 commentaires

A few months ago, I published a post where I was trying to map the dependencies relationships between R packages. Today I want to do something similar with package contributors. My idea is to reconstruct a social graph where each node would be a person (presumably a developer), and two persons would be connected by an edge if they have collaborated on the same package. Thus I would be able to explore the CRAN social network!

This post is also a bit special for me. It’s the very first time I’m using dplyr and the tidyverse. I used to write my code in base R but the amazing work of Thomas Lin Pedersen around tiygraph and ggraph convinced me to take the plunge. And it was a lot of fun!

First, we load the packages and the data. In the first version of the project I used the XML file of each package that I retrieved from CRAN using wget and then read and parsed with xml2. Since R 3.4.0, things are much, much simpler with the function CRAN_package_db.

library(tidyverse)
library(stringr)
library(igraph)
library(tidygraph)
library(ggraph)
library(magrittr)

pdb <- tools::CRAN_package_db()
aut <- pdb$Author

That’s all we need to get the full list of contributors. But it comes in a very messy way. R provides a structured system to describe persons which should be used in the Authors@R field of the DESCRIPTION file, but most people use a simple character string.

The big cleanup

I combined a series of regular expressions and string manipulations to extract every name. It was the first time I was using stringr and I have to say, I don’t regret gsub, grepl, etc. I just miss the support for some PCRE functionalities…

aut <- aut %>%
  str_replace_all("\\(([^)]+)\\)", "") %>%
  str_replace_all("\\[([^]]+)\\]", "") %>%
  str_replace_all("<([^>]+)>", "") %>%
  str_replace_all("\n", " ") %>%
  str_replace_all("[Cc]ontribution.* from|[Cc]ontribution.* by|[Cc]ontributors", " ") %>%
  str_replace_all("\\(|\\)|\\[|\\]", " ") %>%
  iconv(to = "ASCII//TRANSLIT") %>%
  str_replace_all("'$|^'", "") %>%
  gsub("([A-Z])([A-Z]{1,})", "\\1\\L\\2", ., perl = TRUE) %>%
  gsub("\\b([A-Z]{1}) \\b", "\\1\\. ", .) %>%
  map(str_split, ",|;|&| \\. |--|(?<=[a-z])\\.| [Aa]nd | [Ww]ith | [Bb]y ", simplify = TRUE) %>%
  map(str_replace_all, "[[:space:]]+", " ") %>%
  map(str_replace_all, " $|^ | \\.", "") %>%
  map(function(x) x[str_length(x) != 0]) %>%
  set_names(pdb$Package) %>%
  extract(map_lgl(., function(x) length(x) > 1))

aut_list <- aut %>%
  unlist() %>%
  dplyr::as_data_frame() %>%
  count(value) %>% 
  rename(Name = value, Package = n)

At this point we have a list of 13668 (unique) names, and this number is increasing almost every day. The R community is huge! Of course there are still some errors. And we can find surprising contributors:

aut_list$Name[4922]

# [1] "Her Majesty the Queen in Right of Canada"

The Social Network

The next step is to produce a two-column matrix describing all the connections of the network (edge list). An edge list can be turned into a graph object with the igraph package. Finally we can convert the igraph graph into a tidy graph so we can use the API provided by the tidygraph package. For example to filter nodes/edges or select only the main component.

edge_list <- aut %>%
  map(combn, m = 2) %>%
  do.call("cbind", .) %>%
  t() %>%
  dplyr::as_data_frame() %>%
  arrange(V1, V2) %>%
  count(V1, V2)

g <- edge_list %>%
  select(V1, V2) %>%
  as.matrix() %>%
  graph.edgelist(directed = FALSE) %>% 
  as_tbl_graph() %>%
  activate("edges") %>%
  mutate(Weight = edge_list$n) %>%
  activate("nodes") %>%
  rename(Name = name) %>% 
  mutate(Component = group_components()) %>%
  filter(Component == names(table(Component))[which.max(table(Component))])

The author of tidygraph, Thomas, also developed the ggraph package, an implementation of the grammar of graphics for relational data. Using ggraph, it is very easy to visualize our graph.

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  theme_graph()

This is it! This hairy ink stain is the main connected component of the CRAN social network, representing collaborations between 6758 developers!

We can look for who’s the « most important » node in the graph according to the Google’s PageRank algorithm, although it’s not straightforward what could mean a high PR here…

g %>%
  mutate(Page_rank = centrality_pagerank()) %>% 
  top_n(10, Page_rank) %>% 
  as_tibble() %>% 
  ggplot() +
    geom_col(aes(forcats::fct_reorder(Name, Page_rank), Page_rank)) +
    coord_flip() + theme_minimal() +
    labs(title = "Top 10 contributors based on Page Rank", x = "")

Focus on the core network

The complete graph makes a nice painting to hang on the wall of your living room. But hard to say anything about it. So, we will reduce the data and focus on the contributors involved in more than 4 packages.

g <- g %>% 
  left_join(aut_list) %>% 
  filter(Package > 4) %>% 
  mutate(Component = group_components()) %>%
  filter(Component == names(table(Component))[which.max(table(Component))])

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  theme_graph()

Nice! It seems like there are two distinct clusters in the middle. Let’s see if we can identify them using a community detection algorithm.

g <- mutate(g, Community = group_edge_betweenness(),
            Degree = centrality_degree())

filter(g, Community == names(sort(table(Community), decr = TRUE))[1]) %>%
  select(Name, Package) %>%
  arrange(desc(Package)) %>%
  top_n(10, Package) %>%
  as_tibble() %>% 
  knitr::kable(format = "html", caption = "Cluster 1")
filter(g, Community == names(sort(table(Community), decr = TRUE))[2]) %>%
  select(Name, Package) %>%
  arrange(desc(Package)) %>%
  top_n(10, Package) %>%
  as_tibble() %>% 
  knitr::kable(format = "html", caption = "Cluster 2")

Cluster 1
Name	Package
Kurt Hornik	56
Martin Maechler	46
Achim Zeileis	44
Dirk Eddelbuettel	41
Romain Francois	27
Ben Bolker	25
Brian Ripley	25
Torsten Hothorn	24
Roger Bivand	23
Douglas Bates	23

Cluster 2
Name	Package
Hadley Wickham	95
Rstudio	92
Scott Chamberlain	44
Inc	36
Yihui Xie	34
Jeroen Ooms	33
Jj Allaire	29
R. Core Team	27
Bob Rudis	26
Gabor Csardi	22
Oliver Keyes	22
Winston Chang	22

If you know a bit R and its community, you will certainly recognize some names. The challenge is to interpret this classification. It is not a good thing to put people in boxes, but this is mandatory to make nice and colorful visualizations! So, I give it a try. For me, the first group include people of the early days who contributed to major packages which constitute the heart of R. The second group is more related to the second generation, associated with Rstudio products and who have been particularly prolific in the last years. Of course the two groups are strongly connected.

How can we label these two groups? After long consideration, I chose to go for « The Ancients » and « The Moderns », without any value judgment! This is a reference to an old debate in French literature. This dubious comparison would certainly be more appropriate to designate the ongoing debate base vs. tidyverse, but… not today!

Let’s see how we can visualize these two groups.

g <- g %>% 
  mutate(Community = case_when(Community == names(sort(table(Community),
                                                       decr = TRUE))[1] ~ "The Ancients",
                               Community == names(sort(table(Community),
                                                       decr = TRUE))[2] ~ "The Moderns",
                               Community %in% names(sort(table(Community),
                                                       decr = TRUE))[-1:-2] ~ "Unclassified")) %>% 
  mutate(Community = factor(Community))

g <- g %>%
  filter(Degree > 5) %>% 
  mutate(Degree = centrality_degree())

ggraph(g, layout = 'lgl') +
  geom_edge_fan(alpha = 0.1) +
  geom_node_point(aes(color = Community, size = Package)) +
  theme_graph() +
  scale_color_manual(breaks = c("The Ancients", "The Moderns"),
                     values=c("#F8766D", "#00BFC4", "#969696"))

ggraph really does a great job! I really like the result.

BTW, the R community is not really divided…

g %>% activate("edges") %>% 
  mutate(Community_from = .N()$Community[from],
         Community_to = .N()$Community[to]) %>% 
  filter((Community_from == "The Ancients" & Community_to == "The Moderns")|
           Community_from == "The Moderns" & Community_to == "The Ancients") %>% 
  as_tibble() %>% 
  nrow()

… there are 254 edges connecting the the Ancients and the Moderns! But I think this is a nice illustration of how social graphs keeps the imprint of history and past events. Next step would be to look at the graph dynamics over time. Maybe an idea for a future post…

5 réactions au sujet de « Exploring the CRAN social network »

Sungpil Han dit :

13 septembre 2017 à 2 h 10 min

This is an interesting analysis indeed! The top beautiful figure is quite similar to an oriental painting style. So artistic.

Répondre
Joshua Kunst dit :

13 septembre 2017 à 3 h 13 min

Thanks so much for the nice post! 😀

Répondre
Florian Zenoni dit :

13 septembre 2017 à 11 h 20 min

Thank you very much for this great post.
I’d like to stress that I only have ~ 6 months of experience in R, so going through the post however I encountered a couple of issues.

First of all, running blindly the « big cleanup » I obtained the following error:
Error in parse(text = dbname) : :1:7: unexpected symbol
1: Scott Fortmann
^

Then, going through the single steps I found out that I am unable to interpret the following regex: « \\[([^]]+)\\] ». I looked for solutions in https://regexr.com/ but to no avail.

Répondre
1. Francois Keck dit :
  
  13 septembre 2017 à 22 h 37 min
  
  Hi Florian, I’m afraid I have no idea why it is not working for you. I tried again and it works on my computer, the syntax of this regex is correct. Hope you can find a solution.
  
  Répondre
  1. Florian Zenoni dit :
    
    14 septembre 2017 à 11 h 34 min
    
    Hi Francois,
    Thanks a lot for your answer. I will keep investigating. 🙂
    
    Répondre

Piece of K