Installer le package R ‘rmr2’ (élément de RHadoop) sans HadoopInstalling R package ‘rmr2’ (part of RHadoop) without Hadoop

For pedagogical purposes, I needed to install rmr2, which is an R package designed to perform mapreduce code in R. This package is part of the RHadoop project which aims at allowing users to manage and analyze data with Hadoop in R. I followed two strategies:

  • the first one (not shown in this tutorial) consisted in installing a virtual linux machine (XUbuntu OS) with a one-cluster Hadoop installation on it;
  • the second one, which is the topic of this tutorial consisted in installing rmr2 without Hadoop on various OS. This strategy (and its results) is explained in the following.

Overview of the tutorial:

  1. Installation with linux
  2. Installation with windows
  3. Installation with Mac OS
  4. Your first mapreduce job

Make it easy: linux…

Installation on linux went smoothly. My system is made of:

  • KUbuntu 12.04 LTS
  • R is installed through a CRAN repository and updated packages for my distribution can be installed by RutteR ppa as explained in this previous post .

From here, everything is just straightforward:

  1. the dependencies are installed starting R and running the command:
    install.packages(c("Rcpp","RJSONIO","bitops","digest","functional","reshape2","stringr","plyr","caTools"))

    or using the sudo apt-get install r-cran-*** command lines for packages *** that are available in the RutteR repositories (note that caTools version was too old in the RutteR repository and that I had to install the package directly from CRAN.

  2. then go on this page and pick up the latest built rmr package (for me, it was rmr-3.1.1) and run the command line (in a terminal, not in R):
    R CMD INSTALL rmr2_3.1.1.tar.gz
    which should work properly.

You can finally test your installation by starting R and running:

library(rmr2)

Additionally, I provide a use case example to test mapreduce commands in R, at the end of this tutorial.

A nightmarish installation… Windows

Installation on windows was a lot trickier1, for one because I was using a Virtual Machine and also because… well… windows (what else?). Luckily, thanks to this tutorial, you should be able to avoid most of the troubles I’ve encountered.

64bit installation: easy!

If you are running R on a 64bit Windows (64bit-Windows 7, with the latest R release, 3.1.0), everything should be easy. You just have to:

  1. run 64bit version of R and install the dependencies:
    install.packages(c("Rcpp","RJSONIO","bitops","digest","functional","reshape2","stringr","plyr","caTools"))
  2. download the Windows built at this link and use the menu “Packages / Install package(s) from zip file” in R.

I hear you say: “easy! So why are you so grumpy about Windows?” Because, I did not want to spoil my computer installing the OS and I thus used a Windows virtual machine through virtualbox. My original VM was a 32bit Windows (see the next section) on which I was not able to install rmr2. While trying to set up a new 64bit Windows VM, I had a message saying me that VT-x was not enabled on my system and I was not able to install windows in virtualbox. More precisely, while starting the installation of Windows, I got the following error:

VT-x/AMD-V hardware acceleration has been enabled,
 but is not operational. Your 64-bit guest will
 fail to detect a 64-bit CPU and will not be able
 to boot. Please ensure that you have enabled
 VT-x/AMD-V properly in the BIOS of your host
 computer.

It took me a while to figure out that I should restart my Ubuntu OS (the host OS, on which virtualbox is running) and enter my computer’s BIOS (for me, press Echap before startup and then F10): you must search for an option (checkbox) saying “Virtualization (VT-x)” and tick it before you start your computer again: this will allow you to install a 64bit OS in virtualbox.

32bit installation: forget it!

If you have a 32bit Windows installation, you should probably forget to install rmr2. The main reason is that the built package available on this page is a 64-bit built. If you install it on Windows 32bit (I personally had a 32bit Windows XP virtually installed with virtualbox on my computer), you will probably succeed the installation and certainly have the following error message whilst trying to load the package into R:

Error: package ‘rmr2’ is not installed for 'arch = i386'

… and “that simple message stopped you?” are you wondering… hell no! Of course, I did try to install it from source using the following steps:

  1. I first installed a proper building environment using Christophe Genolini’s tutorial (see section “Configuration de votre ordinateur”; yes, sorry, it’s in French…) which explains how to compile an R  package in Windows (even if why anybody would want to build an R package in Windows was long a mystery for me…);
  2. because I still had an error saying g++: unknown command, and then because, the g++ compiler proposed in Christophe Genolini’s tutorial still leads to an error while compiling, I also followed this post to install g++, even though the Path environment variable wouldn’t update using the instructions, I added the following line to the Rpath file (Rpath is the script loading the different programs needed to compile the package as explained in Christophe Genolini’s tutorial):
    set Path=%PATH%;C:\cygnus\cygwin-b20\H-i586-cygwin32\bin

    to set it instead;

  3. I downloaded the pkg directory from the rmr2 github directory (for that I used git on linux);
  4. and I finally used the standard command lines used to build and install an R  package in Windows:
    C:\Rpath
    R CMD build pkg
    R CMD INSTALL rmr2_3.2.0.tar.gz

    which ended with:

    * installing to library 'C:/Program Files/R/R-3.0.2/library'
    * installing *source* package 'rmr2' ...
    ** libs
    cygwin warning:
      MS-DOS style path detected: C:/PROGRA~1/R/R-3.0.2/etc/i386/Makeconf
      Preferred POSIX equivalent is: /cygdrive/c/PROGRA~1/R/R-3.0.2/etc/i386/Makecon
    f
      CYGWIN environment variable option "nodosfilewarning" turns off this warning.
      Consult the user's guide for more details about POSIX paths:
        http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
    g++ -m32 -I"C:/PROGRA~1/R/R-3.0.2/include" -DNDEBUG     -I"d:/RCompile/CRANpkg/e
    xtralibs64/local/include"  `C:/PROGRA~1/R/R-3.0.2/bin/Rscript -e "Rcpp:::CxxFlag
    s()"`   -O2 -Wall  -mtune=core2 -c extras.cpp -o extras.o
    In file included from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp.h:27,
                     from extras.h:18,
                     from extras.cpp:15:
    C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\RcppCommon.h:64: sstream: No such fi
    le or directory
    In file included from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp.h:27,
                     from extras.h:18,
                     from extras.cpp:15:
    C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\RcppCommon.h:76: limits: No such fil
    e or directory
    In file included from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\RcppCommon.h:1
    85,
                     from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp.h:27,
                     from extras.h:18,
                     from extras.cpp:15:
    C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp/iostream/Rstreambuf.h:26: strea
    mbuf: No such file or directory
    In file included from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp/sugar/sug
    ar.h:28,
                     from C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp.h:68,
                     from extras.h:18,
                     from extras.cpp:15:
    C:\PROGRA~1\R\R-30~1.2\library\Rcpp\include\Rcpp/hash/hash.h:25: inttypes.h: No
    such file or directory
    make: *** [extras.o] Error 1
    ERROR: compilation failed for package 'rmr2'
    * removing 'C:/Program Files/R/R-3.0.2/library/rmr2'
    * restoring previous 'C:/Program Files/R/R-3.0.2/library/rmr2'

    That’s where I stopped. If anybody has a hint, I would be happy to take it.

Mac OS: no clue

Installation on Mac OS is almost as easy as installation on Linux (thank you Elise for sending me the command line):

  1. the dependencies are installed starting R and running the command:
    install.packages(c("Rcpp","RJSONIO","bitops","digest","functional","reshape2","stringr","plyr","caTools"))
  2. then go on this page and pick up the latest built rmr package (for me, it was rmr-3.1.1), start R and run the command line:
    install.packages("rmr2_3.1.1.tar.gz", repos=NULL, type="source")
    after having set the working directory to the directory in which the archive has been downloaded.

You can finally test your installation by starting R and running:

library(rmr2)

Does it work?

Final step is to check if the installation was successful by running a mapreduce job. Start R and do not forget to tell R that you actually don’t have Hadoop install (or it will complain while trying to run a mapreduce job):

library(rmr2)
rmr.options(backend="local")

Then, you can run the following commands:

# send groups ID (randomly generated from a binomial) to Hadoop filesystem
groups = rbinom(32, n = 50, prob = 0.4)
groups = to.dfs(groups)
# run a mapreduce job
## map: key value is the group id, value is 1
## reduce: count the number of observations in each group
## then, retrieve it from Hadoop filesystem
output = from.dfs(mapreduce(input = groups, 
                   map = function(., v) keyval(v, 1), 
                   reduce = function(k, vv)
  keyval(k, length(vv))))
# print results
## keys: group IDs
## values: results of reduce job (i.e., frequency)
data.frame(key=keys(output),val=values(output))

Further examples of mapreduce jobs can be found on the official RHadoop wiki.


1 Actually, a real p*** in the a***: it took me half a day, far too much for what this OS deserves but as it seems that a bunch of people, and especially students, are using it, I suppose, it was worth the effort…