Extraire son arbre du Mathematics Genealogy Project avec RScrapping the Mathematics Genealogy Project with R

Having had a look at Erwan’s mathematical genealogy tree, I wanted to have a full overview on my mathematical ancestors as well. This is pretty much easy using the Mathematics Genealogy Project, at least

  • once you’ve found one of your mathematics ancestor in this database (I am lucky enough to have Henri Caussinus for mathematics grandpa’ and I must say this is not only a chance because I was able, thanks to him, to build my mathematics genealogy tree but also because he is one of the most optimistic man you could dream about, because he is a great philosopher with an handsome accent from Toulouse):

    Portrait d'Henri Caussinus, mathématicien… par Universite_de_Toulouse
  • if you know how to scrap this web site automatically. I explain in this post how to do it automatically with R.

Scrapping the Mathematics Genealogy Project

First, I scrapped the Mathematics Genealogy Project, by using the R package XML. Basically, I started from Henri Caussinus’s MGP id (157063) and proceeded recursively to:

  • search for the name of the current person’s page;
  • find a way to extract the number of advisors he/she (actually, always he…) had, carefully taking care of all the cases (this part was the trickiest because I had to take into account several different cases, persons with no advisor, persons with several advisor or having done several thesis…);
  • extract the first MGP ids found on the web page according to the number of supervisors: these give access to the supervisors’ web page.

The process was stopped when no more advisors were found on the visiting webpages and I just had to add myself and my mathematics daddy (Louis Ferré) at the beginning of the data.

## MGP scrapping
library(XML)

base.url = "http://genealogy.math.ndsu.nodak.edu/id.php?id="
start.id  = 157063
genealogy = data.frame("id"=1,"mgp.id"=start.id,"name"=NA)
supervise = NULL

cur.person = 1
while (sum(is.na(genealogy$name)>0)) {
  print(cur.person)
  # create the url for the current person and extract the data from the web
  cur.mgp.id = genealogy$mgp.id[cur.person]
  cur.url = paste(base.url,cur.mgp.id,sep="")
  cur.page = htmlTreeParse(cur.url, useInternalNodes = TRUE, encoding="utf-8")
  
  # search for the mathematician's name
  cur.name = xpathApply(cur.page, "//h2", xmlValue)
  cur.name = gsub("\n","",cur.name)
  print(cur.name)
  genealogy$name[genealogy$id==cur.person] = cur.name
  
  # search for the number of supervisors
  nbadv = grep("Advisor",xpathApply(cur.page,"//p",xmlValue),value=TRUE)
  countadv = sum(sapply(nbadv, function(x) sum(unlist(sapply(1:10,function(ind)
                                  grep(as.character(ind),x))))))
  if (countadv==0) {
    if (length(grep("Unknown",nbadv))==0) countadv = 1
  }
  
  # search for the supervisors ids
  if (countadv>0) {
    advisors = xpathSApply(cur.page, "//a[contains(@href, 'id.php?id')]",
                            xmlAttrs)
    advisors = advisors[1:countadv]
    all.ids = sapply(advisors,function(x) gsub("[id.php?id=]","",x,perl=FALSE))
    # removed already existing supervisors
    existing.ids = all.ids[all.ids%in%genealogy[,2]]
    all.ids = setdiff(all.ids,existing.ids)
    if (length(all.ids)>0) {
      adv.data = data.frame("id"=seq(max(genealogy$id)+1,
                                      max(genealogy$id)+length(all.ids),by=1),
                             "mgp.id"=all.ids,"name"=rep(NA,length(all.ids)))
      # update supervise
      supervise = rbind(supervise,cbind(rep(cur.person,countadv),
                                         c(genealogy$id[match(existing.ids,
                                                              genealogy$mgp.id)],
                                           seq(max(genealogy$id)+1,
                                               max(genealogy$id)+length(all.ids),
                                              by=1))))
      # add new advisors
      genealogy = rbind(genealogy,adv.data)
    } else {
      # if no new advisors, just update supervise
      supervise = rbind(supervise,cbind(rep(cur.person,countadv),
                                         genealogy$id[match(existing.ids,
                                                              genealogy$mgp.id)]))
    }
  }
  
  cur.person = cur.person+1
}


save(supervise,genealogy,file="mygenealogy.rda")

# add Louis and me at the beginning of the list
genealogy$id = genealogy$id+2
supervise = supervise+2
starting.persons = data.frame("id"=c(1,2),"mgp.id"=c(NA,NA),
                               name=c("Nathalie Vialaneix","Louis Ferré"))
genealogy = rbind(starting.persons,genealogy)
supervise = rbind(rbind(c(1,2),c(2,3)),supervise)

Using data to display your genealogy tree

Then, I used the package R package igraph to built a igraph object from the collected data (persons’ names and supervising relations). None of the layouts available in the package was suited to display a genealogy tree so I tried to figure out a way to do it by myself. It was pretty much messy and did not strictly respect the genealogy order so I gave up in this direction and simply exported the igraph object to a dot file (using the name as “label” attributes):

# define the tree as an igraph object
library(igraph)
genealogy.tree = graph.data.frame(cbind(supervise[,2],supervise[,1]),
                                  directed=TRUE,vertices=genealogy)
save(supervise,genealogy,genealogy.tree,file="mynewgenealogy.rda")

# export it for graphviz
V(genealogy.tree)$label = V(genealogy.tree)$name
write.graph(genealogy.tree,file="genealogyTree.dot",format="dot")

readable by Graphviz:

dot -Tpng genealogyTree.dot > genealogyTree.png

So… there it is

Among my prestigious mathematics ancestors are Émile Borel, Pierre Simon de Laplace, Siméon Denis Poisson, Joseph Louis Lagrange, Leonhard Euler (but of course not Évariste Galois with whom I am secretly in love for years)… not that I can compare in any way to them!

This post is dedicated to my real grandpa whose life has always been nourishing my political opinions and who is gone last spring.