Category Archives: RSTATS

Thoughts on Mapping and Geocoding in WPS

I’ve been working on a framework for a set of new macros to be included in the Bridge to R that I think will be very useful for many WPS users. Coming from a background in Demography, I’ve always been partial to maps and charts. There’s a plethora of open source products out there as well as API’s from Google, Bing, etc… that allow a user to create some pretty darn nice maps.

I recently became aware of a new R library by Hadley Wickham and David Kahle called ggmap. Professor Wickham has created some phenomenal software for the open source R system. Hadley created ggplot2 that is truly the standard for graphics in the R world. He has also written a book on ggplot2 that is well worth purchasing and can be found on Amazon by clicking here.

For most of us in the WPS world, we are rather limited to the native graphics available in the product. That was mostly overcome with the Bridge to R that we created a few years ago. So there are two things that I see as important at this juncture that needs to be addressed and that is geocoding and mapping.

First I want to discuss geocoding. Geocoding has always been this strange process that is 10x more convoluted than it really needs to be. For most of us, we want to provide an address from out data set and get back latitude and longitude for that address. For a much smaller group of users, they want to enhance their data with latitude and longitude as well as zip code, etc… Either way, using external services from commercial companies to do such a thing is often expensive. This is especially true or smaller data sets where there is a standard fee plus so much per name.

The second aspect worth discussing is the availability of mapping software and the associated cost. Some of these programs are expensive to say the least. Unless you intend to make a career out of creating a map (and I am not) then other alternatives need to be looked at to keep costs down.

So, long story short, it’s worth investigating interfacing into open source and cost free solutions for map making using WPS. I have briefly looked at Google, Bing and OpenStreetMaps. There are pros and cons to each one of them but I want easy and nice looking and that’s the driver behind my development.

Professor’s Wickham and Kahle have done a lot in this area trying address the short comings of R for mapping. I could not approach their creative genius and determination in creating ggmap, but I can create an interface that makes using ggmap easier for WPS users. So, my summer adventure is to create a clean interface using WPS and the Bridge to R so that WPS users will have some extraordinary maps that they can create using ggmap.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in Grand Rapids, Michigan. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and is a authorized reseller of WPS in North America.

Another View of R and Big Data

I was reading a blog entry the other day that just blew me away. Librestats has a blog entry entitled, “R at 12,000 Cores” and it is a very good (and fun) read. It’s amazing what can be done by the open source advocates and this article is a great example of that.

After reading the article, I can’t help but think about the relationship between extremely large data, server size (both CPU’s and RAM) and how fast data is growing. There has to be a way to crunch through the amount of data that is piling up and this article addresses that issue.

I believe you will begin seeing vendors embrace R more openly, mainly because they have to embrace it. There’s not any companies that can develop code at the break neck pace that the R community is putting out packages. It’s truly amazing and cost effective to model data in the way that the above article describes the state-of-the-art.

Even small companies can make use of multiple servers with dozen of cores and lots of RAM rather inexpensively. Using Linux and R on a set of servers, an organization can have a hundred cores at their disposal for crunching data and not paying very much in licensing fees.

I have been giving some thought to making the Bridge to R run in parallel on a single server as well as across a set of servers using WPS and pdbR or Rmpi. This way, WPS would handle the management between the servers and the data transparently and provide for number crunching at very low cost. God knows we have a few extra multiple core servers laying around here so it may be an interesting adventure to give this a spin!

My first thought and intention is to make the code backward compatible. Perhaps just add a macro that can be called that contains the information needed to implement running R across cores and on a grid. It could be something as simple as:

%Rconfig(RconfigFile=xyz, RunInParallel=True||False);

The remaining statements in the Bridge to R would continue as they are and the R code would be pushed to the servers based on the information in the RconfigFile. WPS would still collect the output from these jobs and route the appropriate information to the log and listing window as well as the graphics to the graphics viewing window (wrapped in HTML) for users to view their output.

 

Is R Worthy of the Enterprise?

I’ve been a big proponent of R for the last few years and have written extensively on R as well in this blog. There have been a lot of folks who have written and believe that R is worthy of being in the Enterprise and I have to say, at this point I’m just not so sure of that.

My gripe with R is just how slow it seems to be for performing the basics such as descriptive statistics and frequency tables. When you compare the timings for these procedures against WPS or SAS using moderate sized data sets (i.e. 500,000 records), R is left in the dust.

What really caused my reversal in thought towards R is that I started to test the R library SAS7BDAT to read a SAS version 7 data set. I thought it might be a nice addition to the Bridge to R to be able to read a SAS data set directly. As I got into test the library for performance issues, I was a little surprised by what I discovered. Just reading in a SAS v7 data set that has five variables and 500,000 observations (or records) to perform a simple T Test, WPS was up to 18 times faster. The larger the data set, the faster WPS was over R.

I have always heard that R is supposed to be fast because the data frame is held in memory. I also think it has its place in education for learning statistics and data analysis. But the corporate world is another story. Using WPS, I can often blow R out of the water in terms of performance and this is with reading the WPS data set from the hard drive AND performing the computations.

Personally, I think the strength of R is in development of algorithms for models and graphics. GGPLOT2 is absolutely awesome and allows you to do some amazing graphs. But for running production jobs, especially time critical jobs, using WPS for the models when appropriate is a much better solution to the problem.

Don’t forget there’s still time to get into the action to win a Google Nexus 7 Tablet. If you register to take out a WPS evaluation before September 30th, 2012, you will automatically be registered in the drawing for the tablet. Certain conditions apply so read the the earlier blog post for all the details. You can request a WPS evaluation by going to the MineQuest Business Analytics website at the WPS evaluation page.

Bridge to R Demo Reel

Finally making some headway on a short demo reel for the Bridge to R. We decided against narrating anything and just included some pop-ups and pointers to show what is happening in the video.

The video shows a few things that I believe are important. First, that the Bridge to R handles R statements and gets the text listing and log back into the Eclipse Workbench using WPS. Secondly, in addition to handling the text side of things, the Bridge can handle graphs, plots and maps. Third, and quite important actually is that the Bridge to R does a lot of work in the background to catch errors coming back from R and reporting them in the log.

The video can be viewed at:

http://www.minequest.com/downloads/Bridge2Rvids/bridgev3hd/Bridge2rV3HD.mp4

or tinyurl:

http://tinyurl.com/89rsrcr

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Adventures in Porting

We’ve been busy porting the new version of the Bridge 2 R over to both the Mac and Linux platforms from Windows. The Windows release of the Bridge always allowed for the use of the Bridge from within the Eclipse Workbench. WPS didn’t have the Workbench as the GUI on the Max or Linux until version 3 which is the latest release. So, here’s what I found porting a large program that has to talk to different operating platforms (i.e. calls to the OS) for such things as delete files, move files, copy files, read directories, etc… and still interface with R.

The mundane part of porting was converting a lot of “\” to “/” throughout the code. In retrospect, we could have done a better job writing the Bridge in the first place to accommodate these conventions, but we didn’t have the intention of porting code back then either.

Here’s a couple of the gotcha’s that we experienced. When you read a directory on Linux or OS X, the structure is slightly different between the two and you have to accommodate that issue. The other BIG issue is that the pathnames are much longer on Linux and OS X when reading and writing to the WPS work folders. We ended up resizing our string variables to handle that specific difference.

The above might sound trivial but one think we discovered is that when you restart your server on OS X and Linux, the new work folder is contained inside the previous folder. For example, your original folder, let’s call it work1 is now hosting work2, your new folder. Now the path name is /work1/work2. But in reality, the names of the work folders are not work1 or work2 but long strings that can be hundreds of characters long. If you have a user who likes to restart their WPS Server, you can eat up a lot of string space quickly.

Since we store a lot of metadata for the Bridge 2 R inside the work folder, R has to be able to cope with very long filenames and I’m not convinced that it really copes all that well. Speaking of file names, here’s another anomaly between Windows and Linux/OS X systems. if you have a filename such as “myfile.txt ” (note the blank space at the end of .txt) Windows handles that just fine. Windows will interpret that as meaning you wanted “myfile.txt” However, if you write such a file or try to read a file with that name under Linux or OS X, then those two names are distinctly different. On Ubuntu or Fedora, that name shows up as “myfile.txt\” when you list the files from the terminal.

It took us about three days to port the Linux version of the Bridge over from Windows. Much of that time was spent dealing with the issues in the previous paragraph. We then took the ported Linux code and tested it on OS X. It took about 20 minutes to modify the section dealing with the difference in reading directories between the two platforms, and we then had a new version of the Bridge to R running on OS X.

In retrospect, porting the code over to Unix/Linux systems was worth the effort. It took a few days for us to do the porting and much of that was due to being naive about the new ported destinations. I will talk soon about the new enhancements (and a programming change users will have to make) in the Bridge to R in my next post.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Submitting R Programs Remotely using Dropbox

One of the great software applications currently available is a product called DropBox. DropBox is a piece of downloadable software that allows you to access your files between different computers by dropping a file into your Dropbox folder. Dropbox automatically syncs the files between all the computers that have access to your Dropbox folder. The great thing about Dropbox is that it just works and is smooth as can be.

I’ve been using Dropbox for about two or three months now and thought how great would it be to extend the functionality of Dropbox by being able to place into a specific folder a WPS or R file and have it automatically execute and write the output back into the Dropbox folder. Basically, you would have access to your organizations server for executing programs while travelling or working onsite.

My experimentation with this is under Windows, and I put together a little application that will allow you to remotely submit an R job. On my server, I have a filewatcher program that monitors the DropBox folder of my choosing and when it sees a new R program (i.e. one with a .R extension) it fires up R and processes the program. The system writes back any output to the Dropbox folder so you also have your .lst and .log files to review. You can also directly write output from your program (say an RDataframe file you created) by referencing the folder in your program.

I’ve included a little video of how R and Dropbox can be used to submit R programs on a remote server using a browser and place the output back into a Dropbox folder.

Click here to view a short 02:30 minute video of Drop4R

Of course, you don’t have to use a browser to place the files in the Dropbox folder. You can always just copy and paste or drag and drop the R program into the DropBox folder and the Job Spawner will simply execute the R program.

I’ve created a small zip file that contains a first draft of an installation guide on how you can setup Drop4R on your Windows computers. I’ve made the application freely available and you can use it without any restrictions.

Links:

Installation Guide: Dropbox Guide

Drop4R Installation File: drop4r.zip

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Technorati Tags: ,,,,

Configuring and Monitoring Your Linux Desktop

As most of the regular readers of my blog know, I’ve been dipping my toe into the world of Linux for the last six to eight months now. I have to admit, I had a predisposition against the OS for a number of reasons, but as I’ve become more comfortable with it, I can see why so many Quants and BI specialist have gravitated towards it.

I’ve noticed as I’ve ported the Bridge to R over to Linux that most of the hardcore R specialists are using Linux. They’ve needed the large memory address space that 64-bit Linux has offered for years. The other aspect is that the cost of Linux for many versions is pretty much just the cost of your time and bandwidth to download it.

Linux has been 64-bit for quite a while, well before Windows as far as I can tell. Although Windows XP 64-bit was available, it never really gained much in the way of popularity. Vista 64 really carried the banner forward and now Windows 7’s 64-bit creation is extremely popular.

As I’ve started developing WPS code on Linux, I’ve found some great programs that have made my transition a bit easier. UltraEdit for Linux is what I use as my editor to write WPS code and I’ve blogged about that before. One thing I’ve kind of missed was something similar to the Vista Gadget bar where you can have gadgets that monitor CPU usage, disk space and other system functions.

I found something that is quite useful on the Linux side called Conky. Conky allows you to monitor your system, and allows for notifications of incoming emails, disk space usage, Logical CPU usage, etc… If you’re like me, the eye candy is important on a desktop machine and Conky helps with that. Below is a shot of Conky running on Fedora 12.

conky

As you can see from the screen shot, Conky can be configured to provide information on the amount of uptime, RAM usage, Swap file usage and the utilization on number of cores that are being used. You can also configure to display information on your file systems (i.e. disk usage) and networking utilization. I like to listen to Shoutcast while I work so I almost always have down utilization.

You can also see how your drives are being utilized. On my Linux development machine, I monitor my DISK I/O for my WPS Work drive and the drive that houses my permanent WPS datasets. Finally, I have Conky display the top five apps in terms of CPU utilization.

The nice part about Conky is that you can get the application for free. Conky is available on Sourceforge at: http://conky.sourceforge.net/. There are numerous configuration files as well as examples you can look at to create your own unique Conky sidebar.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Technorati Tags: ,,,,

Extending the Bridge to R – Statistical Processing

We put the Bridge to R out on the internets for a free 60 day trial to try to gauge some interest from users about (1) the usage of R and (2) to gain some exposure for the software. So far, there has been more downloads for the WPS version than the SAS version of the Bridge. Although it’s only been available for four days now, I did receive an interesting email on Sunday morning. The question posed in the email was if it would be possible to create a standardized set of macros and a standard calling and implementation convention as part of the Bridge to R that would allow developers to create statistical macros using R as the calculation engine.

I have to admit, this has me really intrigued. So what would be involved in creating a standardized suite of macros that other developers can use to create user defined routines that would use R from either WPS or SAS? Personally, I can’t see the value in replicating anything that already exists in the SAS/Base or WPS-Core library. I can see value in replicating some of the most popular statistical procedures as a macro that takes a predefined set of parameters. As an example, let’s take a look at what would be required to create a forecast using the R library forecast created by Robert Hyndman.

if we are trying to forecast the variable pop and have another variable called startyr, all we really need to do is to pass to R the start date of the forecast, the frequency of the series, and the variable we want to forecast (pop). If start = 1970 and the frequency of the series is 1, then the R code would like:

yr =ts(year, start=1970, freq=1)

est <- ets(pop)

accuracy(est)

 

fit <- fitted(est)

res <- residuals(est)

pred <- forecast(est)

fit <- as.data.frame(fit)

res <- as.data.frame(res)

pop <- as.data.frame(pop)

yr <- as.data.frame(yr)

We can easily generate this code to run within the Bridge to R and using the macro language populate parameters. A simple template that would run the R code would like:

%let startyr = 1970; *--> do some preprocessing to get rid of this;

 

%Rstart(dataformat=csv,data=mydata,rGraphicsViewer=False);

datalines4;

 

library(forecast)

attach(&data)

 

yr =ts(year,start=&startyr, freq=&freq)

 

est <- ets(&var)

accuracy(est)

fit <- fitted(est)

res <- residuals(est)

pred <- forecast(est)

fit <- as.data.frame(fit)

res <- as.data.frame(res)

&var <- as.data.frame(&Var)

&date <- as.data.frame(&date)

 

;;;;

%Rstop(import=&var fit res &date pred);

The Bridge will take care of validating the existence of the data sets as well as reading in the output (log and list files) from R including importing the R data frames back into WPS or SAS. What would have to be added are routines to parse out variable names from a list (easily done), check that they exist in the data set, checking that the variables are of the correct type (alphanumeric, numeric) to be passed to R and the handling of missing values.

Thus, a very simple macro that a developer might implement for the automated forecasting of a univariate time series might look like:

%AutoForecast(dataset=mydata,

          date=Yr,

          Freq=1    

          var= pop,

          output= dataset that contains all the forecasted values);

Of course, the above example is very elemental. The developer would probably want to add some bells and whistles such as being able to suppress the printing of the output, creating plots and capturing them into a catalog, processing multiple variables, etc…

The value of creating a standardized set of macros and routines for statistical developers includes:

1) Ability to create a custom statistical routine in WPS or SAS that is not possible with just WPS or SAS by itself.

2. Inexpensively distribute these custom routines without requiring users to have specific statistical libraries.

3. Cost savings where one doesn’t have to license the SAS/Toolkit.

4. Reduce cost by replacing those statistical libraries where your organization uses just one or two procedures.

5. Use it as a basis for developing cost effective vertical market applications because your customers will not have to license additional modules/libraries from SAS.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Bridge to R v2.4.2 Available for a 60 Day Trial

The latest Bridge to R (version 2.4.2) is now available for download on an extended 60 day trial. The Bridge to R allows you to execute R syntax from within your WPS or SAS IDE and return the log and listing files from R into the SAS or WPS log and listing window. The Bridge alleviates the need to license SAS/IML Studio to access R using SAS. Also, this version of the Bridge brings SAS back into the picture in that both platforms, WPS and SAS are supported.

Requirements

The Bridge has minimal requirements. They are:

· WPS 2.4.x or SAS 9.2.x

· Windows Desktop Operating System

· R versions 2.7.x through 2.11.0.

Note that R release 2.11.0 is fairly new and not all the R packages from CRAN have been brought forward yet. Specifically, the package Hmisc still has not been released and there are some example programs that we use that rely on the Hmisc library.

The Bridge to R has also been tested on the x64 R build (i.e. the 64-bit alpha build for Windows) and so far, seems to work fine with that release as well.

Download

You can download the Bridge to R by going to the MineQuest website at:

http://minequest.com/BridgePreview.html

From the above web page, you can download the Bridge for your specific installation (i.e. WPS or SAS) as well as watch a tortuous video of what the Bridge is able to do. At least the video is only six minutes long but it does provide the background you need to decide if this is something you want to add to your software portfolio.

Installation

Place the Bridge2R.zip file on your desktop and unzip the package. The structure and contents of the folder should be:

\Bridge2R

\Bridge2R\SASMACR.WPCCAT

\Bridge2R\Bridge to R v242.pdf

\Bridge2R\samples\

There’s also a short installation and user guide that you can read before downloading the software. The installation guide is also included in the zip file.

If you have questions on installation issues, please visit the support forum that we just setup to help answer these kinds of questions.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Notes on the Next Release of the Bridge to R

Thought I’d write a little about what MineQuest has been working on for the next release of the Bridge to R. We just wrapped up the programming for the latest release and I’m pretty happy with what is right around the corner for our current users and new users as well.

First, we’ve added the ability to export WPS data sets directly into your R workspace as R data frames. We’ve always provided support for taking a single WPS data set into a data frame but this release makes it easy to export multiple data sets into R. This actually required a lot of effort to do and is based on a request from numerous customers who are using the Bridge.

Originally, I envisioned that people would use the Bridge in a similar way that they would use a WPS or SAS procedure. They would create a data set that contained all the variables they needed for a specific statistical routine in R and use that for their analysis. But I was easily convinced that this was short-sighted because it didn’t allow for the analyst to move all the data sets needed for such things as matrix operations into the R work space.

The other thing that convinced me that this was necessary is that I recently became aware of a book called "A Practical Guide to Geostatisical Mapping" by Tomislav Hengl. Tomislav writes about mapping and to create maps, you need to have multiple data sets. You need one that contains the data to be displayed and a data set that contains the coordinate files. I eventually want to provide some mapping data sets for the Bridge to R so one can create maps using the Bridge so the ability to read multiple WPS data sets is necessary.

Exporting WPS data sets to R is accomplished by specifying the names of the WPS data sets in the %Rstart() clause. Here’s an example:

%Rstart(DataFormat=xpt, data=a b c, rGraphicsViewer=No)

The data sets a, b, c are automatically exported to R dataframes for you without any other commands or programming.

The other improvement in the next release of the Bridge to R is that you can import multiple data frames from your R session to WPS. This is easily done and just requires the analyst to list the R data frames on the Import= clause of the %Rstop macro to bring all the frames back into WPS. For example:

%Rstop(import=dataframe1 dataframe2 dataframe3);

where dataframe1, dataframe2 and dataframe3 are the names of the R data frames that you want to import back into WPS. This will create three WPS data sets named dataframe1, dataframe2, and dataframe3, respectively.

We’ve also added more error checking to the Bridge. We now catch errors when using the XPORT transport format. One problem with using XPORT as a transport format is that it’s limited to eight character variable names. We now examine all the WPS data sets before they are exported to make sure that the variable names are eight characters or less in length and if not, we throw an exception, report on it and don’t try to process the R code because we already know it won’t execute.

By the way, the reason we support the transport format is due to customer requests from those in the biostats area. They wanted to make sure that they can pass a possible data processing audit and they felt much more comfortable with the XPORT format than passing data via a CSV format.

So what’s left? With the next release of the Bridge to R (by the end of April 2010) we are updating the documentation and adding more sample R programs that demonstrate how to use the Bridge. We are adding another half dozen R graphic sample programs and a few more statistical type programs as well.

I’m very confident that the Bridge to R when used with WPS can complement the WPS system by allowing the analysts to do just about any kind of graphics or statistical procedures all from within the WPS IDE. With the low cost of the Bridge (free if you license WPS from MineQuest) and the use of open source R, you can replace SAS/IML, SAS/Graph and many of the SAS statistical modules and be state-of-the-art on your analytics platform.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.