Over the last three or four weeks, as time has permitted, I’ve been working at making the Bridge to R for WPS more powerful. For those who are unaware of the Bridge to R, it’s a set of macros and programs that allows a WPS user to access the R Statistics Package from within WPS. WPS is an affordable SAS/Base alternative that is available from World Programming Ltd. The original version that I wrote allows you to run R Software routines in a sequential stream, automatically grab the R log and Listing files and route them back into WPS. The next version which I’ve been working on and testing, adds to the Bridges functionality by allowing you to run multiple R processes in parallel.
Being able to launch multiple R processes from within WPS is relatively easy. The hard part has been capturing the output and list files when all of the processes complete. In theory, this shouldn’t be too hard, but in reality it’s quite difficult with R. Writing that last sentence made me think of one of my graduate school professors who was fond to say, “If it works in theory but not in reality, then obviously you have bad theory.” Here’s the problem… R doesn’t lock its log or lst files during its session. It just opens, appends and closes the files as it needs. So testing if a file is locked (or in use) is of little value here. I’ve also tried resorting to setting up semaphores to let me know that a job is complete. That doesn’t seem to work too well either. I’ve discovered that even if the job is complete and I’ve set the semaphore to indicate that state, the log and lst files may still be incomplete because they have been cached by the OS and have yet to be completely written.
So long story short, I’ve totally re-architected the Bridge to R so that running parallel R jobs works properly and you can get the results you expect. This is about 1600 lines of SAS macro code that is easily the most dense SAS code that I’ve ever written. That doesn’t include the external programs that I had to write to check if a file exists, is in use, and to run the jobs in the background so you don’t see it splashing all over your screen.
To provide the ability to run multiple R jobs in parallel, I’ve had to introduce two new macros. The new macros are only needed if you want to run the jobs in parallel. You can continue to use your original code unchanged but you will need to prefix your R job stream with the macro %RexecMode(NOWAIT/WAIT) and append the end of the job stream with the macro %WaitForR. For those who use SAS/Connect’s MPP process, the WaitFor paradigm will be familiar to you.
The RexecMode() macro sets up all the parameters to run the R tasks in parallel. The %WaitForR macro watches for the completion of all the R jobs and reads in the log and lst files when the R tasks have finished running.
I still have to make a decision on how many parallel processes to allow in the system. The upper limit is hard coded. The machine I’ve been developing and running the Bridge to R on is a Quad-Core with 8GB of RAM. It has decent I/O for the most part. Interestingly, I can usually run six to seven parallel R processes. Once in a while, I can get eight going. It’s highly dependent on the size of the data sets (amount of I/O) and how highly the CPU is utilized which controls or prevents the Bridges ability to spawn another R process. It’s not hard to peg all four cores at 100% with this software.
Based on my testing over the last few days, this is what you can expect in terms of performance improvements running R sequentially versus running in parallel. These tests were with data sets of 3,000,000 records and five variables. Here is the R code I used in each RStart/RStop block:
ptm <- proc.time()
t.test(score1, score3, paired=T)
t.test(score2, score4, paired=T)
proc.time() – ptm
Note that for each of the eight RStart/RSTOP blocks, they each had their own data set. Hence, the first job had data set measures1, the second job had measures2, etc…
To run these eight jobs sequentially, it took 10:42 (mm:ss). Running them in parallel and creating the corresponding CSV file took 4:19 (mm:ss). Running the eight jobs and using the existing CSV file (i.e. not creating the file on the fly) took 2:57 (mm:ss). So basically, I was able to reduce my real world execution time by more than two thirds.
Time # Tasks Description
10:42 8 Run jobs Sequentially
04:19 8 Run Parallel(create CSV files for R processing)
02:57 8 Run Parallel(CSV files already exist)
There’s definitely room for improvement in the numbers above. The Bridge to R still has a fair amount of code that can be stripped out. There’s lots of little checks in there that I needed while developing. The other thing that could help out considerably is getting an ODBC driver for the WPS data set. R can read and write data using ODBC and currently the Bridge requires that the data be transformed to a CSV file for processing by R.
So back to the question, how many parallel R jobs would be considered an adequate number? With hardware like the Cray Cx1, you can have eight blades with two Quad Core Xeon’s per blade for a total of 128 Logical CPU’s. If you consider two jobs per LCPU, then an upper limit of 256 parallel tasks seem to the high end of the spectrum. The problem though is that there is not a 64-bit version of R that runs on Windows at this time so you are still memory constrained by a 32-bit system. I have seen some comments about a possible port of R to the Windows 64-bit platform so perhaps this might be something solvable sooner than later. Revolution Computing has been working on this and comments from the company can be found at: http://tinyurl.com/5dxt35
This is what I see happening next year in regards to SAS/WPS software. R will be 64-bit and you will be able to run R jobs, either sequentially or in parallel on your desktop. The ability to address large memory segments under Windows 2008 in 64-bit mode or Vista 64 using R will not be a confining factor any longer to most analysts who create statistical models. WPS will be the engine to access, organize, summarize and report data and R will be the platform for advanced analytics. Once R is running on a Windows 64 bit platform, the default language for statistical modeling will be R and not SAS and its myriad mix of expensive products. This will be due to affordability and portability in light of the sheer number of platforms R runs on. Finally, the Bridge to R will make it very cost effective for companies who want to create vertical market applications to use WPS and R in lieu of the SAS product line.