A Parallel Method for Analytical Processing
August 26th, 2008I’ve been doing some interesting work with R and WPS the last few days and I want to share a little bit about what I’m doing. I developed the Bridge to R a few months back and I’m basically satisfied with how it works. I will be introducing some new features to the next release to make things a bit easier to use and clean up the documentation. One of my interests however is in how to maximize the CPU usage with WPS and R so that I can complete statistical tasks more quickly.
What I have running on my development machine is a system that is written in WPS (I’m doing most of my SAS language development in WPS now due to the better IDE), Delphi and R. This system gives you the ability to launch multiple R programs and allows them to run in parallel. After the last R program completes, WPS grabs the R listing and log files, writes them to the appropriate windows and continues on processing WPS statements. It’s nice and tidy and it’s fun to watch the CPU meter bounce around like crazy on a CPU intensive set of programs.
At this point, I let the operating system balance which CPU should run which jobs but I might change that. I am able to force R to use a specific core or logical processor and since R is not multi-threaded (i.e. doesn’t use more than a single Logical CPU) it might have some benefits to the user to specify in code what they want to have done resource wise. I’m undecided on this aspect though because I’m thinking that the OS (Vista in this case) can handle the workload more efficiently than I or a user can.
The calling conventions I’m using are straightforward too. Consider the macro calls below:
%RppStart(machineName, Mode);
%Rstart();
R code here…
%Rstop;
%RppStop ;
Where
%RppStart is the macro calling convention and stands for ”R Parallel Program Start”
MachineName - currently, the name is always the local machine name. This could be expanded later to allow other machines on the network to execute the R code.
Mode - takes a value of P for Parallel or S for sequential processing.
%RppStop - is the macro that pulls together all the output from the completed R programs and writes information to the proper WPS windows.
Say you want to run three R programs in parallel and these programs are a cluster analysis, a factor analysis, and logistic regression, The complete sequence for calling a series of R programs would look like:
%RppStart(Phil-Vista,P);
%Rstart(Auto,MySASdataSetName,NoGraphWindow);
R code for cluster here…
%Rstop;
%Rstart(Auto,MySASdataSetName,NoGraphWindow);
R code for factor here…
%Rstop;
%Rstart(Auto,MySASdataSetName,NoGraphWindow );
R code for logistic here…
%Rstop;
%RppStop;
All three programs above would be launched almost simultaneously, execute at the same time and any code past the %RppStop would not run until all the programs in the %RppStart / %RppStop block had completed.
One interesting use for such a method or implementation is to allow one to run such CPU intensive programs such as forecasting models that really don’t take a lot of storage but often require testing numerous methods to obtain a best fit on literally hundreds of products. As desktop computers get more powerful with more cores for processing, such analysis will become more common. Moving to the servers, we see Intel announcing six core CPU’s and multiple socket motherboards so enterprise analytics will become quicker and more affordable too.