Category Archives: Bridge to R

That damn blog!

I spent some time over the last weekend trying to update some aspects of my blog. I updated the version of WordPress I was using from 2.1.x to 2.7.x and the upgrade didn’t go so well. I lost all the categories for each of my blog postings and couldn’t recover them.

So, Over the next few weeks, I’ll be going back and adding in the categories for each posting. Sorry about that inconvenience and it wasn’t something that I had expected.

I’ll be putting together some information and possibly a video that shows how to use ODBC to connect to an external database using the Bridge to R for WPS. I know a lot of you are curious on how to do this so I’ll give it “that old college effort.”

I had an interesting discussion yesterday with a client at a large Fortune 500 insurance company. They are interested in the Bridge to R for SAS mainly because they have a few new analysts that have come on board who prefer R for their analytics. They want to be able to run the R code from within SAS in case the R folks are laid off or decide to leave.

But what surprised me the most was the statement that they have to find a vendor they can buy R from because they require some type of support for any software they acquire. So I guess there’s a market out there for R support and maintenance agreements from businesses.

One company that has an interesting web site is Revolution Computing  http://www.revolution-computing.com/ and they have such a product. However, there’s no damn phone number to contact anyone and they have not replied to my emails using their “Contact Us” page when I inquire about how I can license their product or the cost involved. Edit: I was sent an email pointing out that they do have  a phone number listed on the “about us” page. My screw up!

I’m sure there are other companies out there as well, but there is an interest in R on the Analyst Desktop and I suspect it will continue to grow as time goes on.

Clearing up any confusion…

A couple of questions I was asked about the Bridge to R is probably something I should answer publically.

1. The Bridge to R is not open source or available for free. It’s affordably priced at $50 a desktop. If you need a site license, contact me and we can talk about your needs and see what we can do for you.

2. The Bridge to R can run R jobs in parallel and the number of jobs it can execute simultaneously is dependent upon your hardware. The number of cores and your hard drive array throughput all plays a role in the number of jobs that can run at the same time. On my development machine which is a quad-core with 8GB of RAM and a fast RAID-0 work area, I can run between six and eight R jobs in Parallel.

Also, if you want to include the Bridge to R in your application for resale, we will aggressively price our product so you can include it in your vertical market application and won’t have to worry about licensing cost putting you out of the market. With the ability to run R jobs in parallel, companies can now start building forecasting systems and other applications that require greater throughput than what was achievable when running in sequential mode.

Version 2.2 of The Bridge to R now Available

So I worked hard this week and finally have put a trial version (Version 2.2) of the Bridge to R for WPS Users out on the MineQuest website. The Bridge to R not only offers you access to advanced statistical computing using R, but Version 2.2 includes the ability to run R jobs in parallel that can radically improve your statistical computing efficiencies.

I’ve updated the documentation for the Bridge and have included some additional sample programs that illustrate how easily you can run R jobs from within the WPS Workbench. I’ve tested this quite a bit over the last few weeks and the Bridge catches a lot of errors that may occur that stems from the user. For example, incorrect data set names, incorrect transport methods, etc…

There are a few things that I’m not crazy about with the Bridge but it will have to wait until the developers at WPC fix a few things. First, the output from a PROC CONTENTS appears in the listing even though I use the noprint option. The second anomaly is the amount of blank lines that appear in the log. This is caused by turning on/off notes so that the user isn’t flooded with hundreds of lines of notes generated by WPS. When I turn off notes, WPS simply issues a blank line instead of issuing no lines in the log. It still makes the log easier to read but aesthetically, it can be much better.

During the next month, until the end of the trial period, I will update the installation and the user guide document for clarity. I don’t expect to update the software unless the folks at WPC issue an ODBC driver or someone finds a major bug.

The trial version will run until 1/31/2009 and will not run after that. If you would like a copy, you can either purchase an annual license for $49.95 or get if for free when you license a copy of WPS through MineQuest. you can download a copy of the trial at http://www.minequest.com/downloads and selecting the files in the Bridge to R Macros section. Make sure you get both the installation guide as well as the zip file that contains the compiled macros.

WPS, R and High Performance Computing

Over the last three or four weeks, as time has permitted, I’ve been working at making the Bridge to R for WPS more powerful. For those who are unaware of the Bridge to R, it’s a set of macros and programs that allows a WPS user to access the R Statistics Package from within WPS. WPS is an affordable SAS/Base alternative that is available from World Programming Ltd. The original version that I wrote allows you to run R Software routines in a sequential stream, automatically grab the R log and Listing files and route them back into WPS. The next version which I’ve been working on and testing, adds to the Bridges functionality by allowing you to run multiple R processes in parallel.

Being able to launch multiple R processes from within WPS is relatively easy. The hard part has been capturing the output and list files when all of the processes complete. In theory, this shouldn’t be too hard, but in reality it’s quite difficult with R. Writing that last sentence made me think of one of my graduate school professors who was fond to say, “If it works in theory but not in reality, then obviously you have bad theory.” Here’s the problem… R doesn’t lock its log or lst files during its session. It just opens, appends and closes the files as it needs. So testing if a file is locked (or in use) is of little value here. I’ve also tried resorting to setting up semaphores to let me know that a job is complete. That doesn’t seem to work too well either. I’ve discovered that even if the job is complete and I’ve set the semaphore to indicate that state, the log and lst files may still be incomplete because they have been cached by the OS and have yet to be completely written.

So long story short, I’ve totally re-architected the Bridge to R so that running parallel R jobs works properly and you can get the results you expect. This is about 1600 lines of SAS macro code that is easily the most dense SAS code that I’ve ever written. That doesn’t include the external programs that I had to write to check if a file exists, is in use, and to run the jobs in the background so you don’t see it splashing all over your screen.

To provide the ability to run multiple R jobs in parallel, I’ve had to introduce two new macros. The new macros are only needed if you want to run the jobs in parallel. You can continue to use your original code unchanged but you will need to prefix your R job stream with the macro  %RexecMode(NOWAIT/WAIT) and append the end of the job stream with the macro %WaitForR. For those who use SAS/Connect’s MPP process, the WaitFor paradigm will be familiar to you.

The RexecMode() macro sets up all the parameters to run the R tasks in parallel. The %WaitForR macro watches for the completion of all the R jobs and reads in the log and lst files when the R tasks have finished running.

I still have to make a decision on how many parallel processes to allow in the system. The upper limit is hard coded. The machine I’ve been developing and running the Bridge to R on is a Quad-Core with 8GB of RAM. It has decent I/O for the most part. Interestingly, I can usually run six to seven parallel R processes. Once in a while, I can get eight going. It’s highly dependent on the size of the data sets (amount of I/O) and how highly the CPU is utilized which controls or prevents the Bridges ability to spawn another R process. It’s not hard to peg all four cores at 100% with this software.

Based on my testing over the last few days, this is what you can expect in terms of performance improvements running R sequentially versus running in parallel. These tests were with data sets of 3,000,000 records and five variables. Here is the R code I used in each RStart/RStop block:

%Rstart(CSV,measures1,NOGRAPHWINDOW);

cards;

ptm <- proc.time()

attach( &sastablenm_)

t.test(score1, score3, paired=T)

t.test(score2, score4, paired=T)

proc.time() – ptm

%Rstop;

Note that for each of the eight RStart/RSTOP blocks, they each had their own data set. Hence, the first job had data set measures1, the second job had measures2, etc…

To run these eight jobs sequentially, it took 10:42 (mm:ss). Running them in parallel and creating the corresponding CSV file took 4:19 (mm:ss). Running the eight jobs and using the existing CSV file (i.e. not creating the file on the fly) took 2:57 (mm:ss). So basically, I was able to reduce my real world execution time by more than two thirds.

Time # Tasks  Description

10:42   8           Run jobs Sequentially

04:19   8           Run Parallel(create CSV files for R processing)

02:57   8           Run Parallel(CSV files already exist)

There’s definitely room for improvement in the numbers above. The Bridge to R still has a fair amount of code that can be stripped out. There’s lots of little checks in there that I needed while developing. The other thing that could help out considerably is getting an ODBC driver for the WPS data set. R can read and write data using ODBC and currently the Bridge requires that the data be transformed to a CSV file for processing by R.

So back to the question, how many parallel R jobs would be considered an adequate number? With hardware like the Cray Cx1, you can have eight blades with two Quad Core Xeon’s per blade for a total of 128 Logical CPU’s. If you consider two jobs per LCPU, then an upper limit of 256 parallel tasks seem to the high end of the spectrum. The problem though is that there is not a 64-bit version of R that runs on Windows at this time so you are still memory constrained by a 32-bit system. I have seen some comments about a possible port of R to the Windows 64-bit platform so perhaps this might be something solvable sooner than later. Revolution Computing has been working on this and comments from the company can be found at: http://tinyurl.com/5dxt35

This is what I see happening next year in regards to SAS/WPS software. R will be 64-bit and you will be able to run R jobs, either sequentially or in parallel on your desktop. The ability to address large memory segments under Windows 2008 in 64-bit mode or Vista 64 using R will not be a confining factor any longer to most analysts who create statistical models. WPS will be the engine to access, organize, summarize and report data and R will be the platform for advanced analytics. Once R is running on a Windows 64 bit platform, the default language for statistical modeling will be R and not SAS and its myriad mix of expensive products. This will be due to affordability and portability in light of the sheer number of platforms R runs on. Finally, the Bridge to R will make it very cost effective for companies who want to create vertical market applications to use WPS and R in lieu of the SAS product line.