Category Archives: Open Source

Thoughts on Open Source

tucson_sunset

Unbelievable sunsets in Tucson, AZ

I recently read an article on Linux.com by Esther Shein titled “The Case for Open Source Software at Work” where she discusses the results of a survey on the use of Open Source software in companies. Pretty interesting read and it makes the argument that IT workers feel about the importance for having source code.

The elephant in the room that is never presented is how the value is measured by accounting or say purchasing. For example, how much perceived value is derived by other parts of the company because they look at the software as being free… i.e. cost free?

Individuals are different in their purchasing and use habits. Most individuals I know are driven by price as the first factor and popularity and completeness as other factors in their consideration.

I can’t recall ever seeing a survey of corporate types that measure the desire for specific software where free vs open source code is derived. I imagine that it may be measured internally by some companies but I would love to see a public survey that addresses that issue.

My own opinion, derived from looking at MS Office vs Libre Office is that quality and support is the most important driver for desktop office software. Every large company that I have consulted with use MS Office. They may use an older version but they use MS Office.

When I switch my thoughts to analytical software, I see the same thing. Corporations purchase or license software like WPS or SAS because of support and completeness. Documentation is also a big factor here too. Individuals who don’t have the financial resources to license analytical software like the aforementioned products gravitate towards free software.

I do grudgingly use R when needed but I prefer WPS over any other analytical software. It’s based on a language that I have used for 30 years and feel very comfortable with. I find it much easier to debug my code and like that if I chose to build a product, I know it will run on Windows, OS X, Linux and the mainframe.

When I factor in that I can license WPS for a bit over $3 a day on a Windows or Mac workstation (and our competitor charges just north of $41 a day for your first year) I find it compelling to have WPS in my BI stack. I can still use R and Python but the language of SAS is just too rich and broad to ignore.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in beautiful Tucson, Arizona. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS consulting and contract programming services and is an authorized reseller of WPS in North America.

Complexity and Cost

This past weekend, my wife and I went to a lovely wedding. This was a Catholic wedding that was amazingly short but the priest had a very interesting sermon on complexity and cost. He talked about complexity in our lives and the cost both direct and indirect that we each experience. One example that he gave was smart phones and how expensive they are in terms of outright cost of service as well as the indirect cost, that being how much time we take playing and looking at the gadgets at the expense of others and relationships around us.

Hi sermon got me thinking. This is true for software and business intelligence in particular. The cost of non-open source software can be pretty high. And the reason for that? Support cost, sales cost, maintenance cost, legal costs, etc…

I often see how companies have purposely fragmented their products so that they can charge more for additional libraries modules. This has increased cost tremendously for the consumer. Our competitor is a prime example of this. They send out a local or regional sales person to chat up the prospect. Often, they can’t answer the questions the customer has because of the complexity of the product. So they send out a Sales Engineer or two who visits the prospect to answer these questions and chat them up a second time. Now we have three people in the mix who are making a 100 grand a year (at least) involved in the sale. The price of the software product has to increase to the customer because of all the people involved in the sale.

Here’s another example of added complexity. Different pricing for the same product depending on how you use it. Take companies that are B2B in nature. Firms such as actuarial firms, claims processing, advertising etc… are often labeled as data service providers because they want to use the software in a B2B capacity. Sometimes this is as innocuous as being a Contract Research Organization providing statistical analysis. The cost here comes from a different license (think lawyers), people to audit the customer and employees to enforce the license. It all adds up!

That above examples illustrate everything that is wrong with traditional ways of thinking in terms of software. At MineQuest Business Analytics, we’re proud that we are able to help keep cost down for the customer. We don’t have such draconian licensing for companies that are DSP’s. We don’t have an organization that is setup to milk and churn the customer for every last cent. What we do have is a company that is dedicated to providing the best service and software at an affordable price.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in Grand Rapids, Michigan. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and is a authorized reseller of WPS in North America.

Another View of R and Big Data

I was reading a blog entry the other day that just blew me away. Librestats has a blog entry entitled, “R at 12,000 Cores” and it is a very good (and fun) read. It’s amazing what can be done by the open source advocates and this article is a great example of that.

After reading the article, I can’t help but think about the relationship between extremely large data, server size (both CPU’s and RAM) and how fast data is growing. There has to be a way to crunch through the amount of data that is piling up and this article addresses that issue.

I believe you will begin seeing vendors embrace R more openly, mainly because they have to embrace it. There’s not any companies that can develop code at the break neck pace that the R community is putting out packages. It’s truly amazing and cost effective to model data in the way that the above article describes the state-of-the-art.

Even small companies can make use of multiple servers with dozen of cores and lots of RAM rather inexpensively. Using Linux and R on a set of servers, an organization can have a hundred cores at their disposal for crunching data and not paying very much in licensing fees.

I have been giving some thought to making the Bridge to R run in parallel on a single server as well as across a set of servers using WPS and pdbR or Rmpi. This way, WPS would handle the management between the servers and the data transparently and provide for number crunching at very low cost. God knows we have a few extra multiple core servers laying around here so it may be an interesting adventure to give this a spin!

My first thought and intention is to make the code backward compatible. Perhaps just add a macro that can be called that contains the information needed to implement running R across cores and on a grid. It could be something as simple as:

%Rconfig(RconfigFile=xyz, RunInParallel=True||False);

The remaining statements in the Bridge to R would continue as they are and the R code would be pushed to the servers based on the information in the RconfigFile. WPS would still collect the output from these jobs and route the appropriate information to the log and listing window as well as the graphics to the graphics viewing window (wrapped in HTML) for users to view their output.

 

Software We Use and Love on a Daily Basis

WPS – World Programming System from World Programming LTD. WPS is a SAS Language compatible software system that implements many components of the SAS language on many platforms. WPS starts at just over $1,206 on a desktop. Check out the MineQuest website for more information.

R – R Project for Statistical Computing. The 64-bit port is definitely the way to fly if you are using R. R has amazing graphics and Hadley Wickham’s ggplot2 is worth the effort of learning R.

Microsoft Office 2007 and 2010 – Microsoft Office is the standard for writing documentation, use of spreadsheets and email on both Windows and OS X platforms. I probably use Excel, Word and Outlook more than any other office productivity too.

Skype – used a lot for both domestic and foreign phone calls and text messaging. Skype is easy to use and can provide your company with the ability to do business overseas at reasonable costs. Also, for a mere $5 a month, you can have group video calling as well as calls to any phone in the US and Canada.

MeetingBurner – MeetingBurner is relatively new and since we have such few users on a web conference (typically five or six) this makes perfect economic sense. We’ve not used it much but it is fast and is free for organizations that will have fewer than 15 attendees in a meeting. One great plus is that it integrates Skype for audio.

Oracles Virtual Box – We use this to reduce our exposure to running multiple physical servers. Oracle’s VirtualBox saves a lot of money for testing software because it can dramatically reduce your power consumption by not having individual servers.

Nuance Paper Port – It’s amazing how much paper we scan here. Bills, invoices, checks and all kinds of business related materials. With both a Canon MFC printer and a Brother MFC printer, we just throw documents into the hopper and scan away. You can learn about Paper Port by going to their website here.

UltraEdit – UltraEdit from IDM is the standard for programming editors. We use it on the Linux and Windows desktops. Soon we will be using it on the Mac desktop to port the Bridge to R over to OS X.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

What We Use: Our Favorite Hardware Gear

I love to hear what other people and organizations are using in terms of hardware and software to not only run their business, but as personal productivity enhancements. I thought it would be fun to share what we have here in house for development. This particular post will focus on the hardware side.

Custom Linux Server – 16GB of RAM and 4 Logical CPU’s. This box is the basis for our WPS Development and Testing environment for the Linux platform. Lots of work space (three 640GB SATA III drives in RAID0) for temp work and four 640GB drives in RAID5 for permanent data sets.

Windows Server 2008 R2 – A four core CPU with 8GB of RAM with two arrays each having 2×1.5TB drives in a RAID1 array. There’s another 4TB of assorted space for business data. This box is used mainly for testing WPS Windows Apps and for backing up all the various desktops. It also provides us with remote connectivity when we are on the road.

Desktops – a number of assorted desktops from various manufacturers. All have at least 8GB of RAM and a couple terabytes of storage. All desktops have at least four cores so performance is decent.

Apple Mac Mini – This is a recent purchase. It’s a four core Intel i5 CPU and the box (if you want to call it that) had 2GB of RAM. I immediately upgraded it to 8GB but think it might be time to go to 16GB since memory is so inexpensive. The latest pricing for two 8GB sticks that would work in that machine is about $160.

All the desktops have dual monitors, even the Mac Mini. Btw, if you need to buy an adapter for the mini-Display Port for the 2nd monitor, check out the prices at MonoPrice.com. I paid less than $7.00 USD for an adapter to output to DVI.  Apple wanted $29.00 for it!

A couple of notebook computers. Both are dual core and are great for traveling but doing any hardcore development on them would be painful.

Printers – We just recently gave a Tektronix Phaser color printer to charity. In house we have a Canon MFC 4150 for everyday printing and a Lexmark 543DN color laser. Both are wonderful printers but the next one we buy will have to be wireless. The Canon just doesn’t support printing and scanning on the Apple OS X operating system.

Of course, we have the assorted headsets and telephone systems to compliment the business requirements that we have. It’s amazing how quickly one can load up on junk hardware. It’s very hard for me to throw or give older equipment away… it’s the packrat in me.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.


Open Source BI

I ran across an interesting blog post the other day and thought it was worth sharing. The article, Open source BI: New kids on the block is a set of view points from Open source vendors discussing Jim Goodnight’s comment that “We haven’t noticed [open source BI] a lot. Most of our companies need industrial-strength software that has been tested; put through every possible scenario or failure to make sure everything works correctly. That’s what you’re getting from software companies like us – they’re well tested and it scales to very, very large amounts of data.”

I find it an entertaining read and agree with some of what is argued, but I think the bigger point that is missed is not whether Open source BI will continue to gain momentum and replace commercial BI products, but how Open source will become integrated into and begin working in tandem with commercial products.

Technorati Tags: BI,Open Source,SAS,SAS Replacement

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Business Analytics Predictions for 2011

1. Companies will begin to significantly look at their present enterprise agreements and due to economic uncertainty start renegotiating them in an effort to cut cost. One familiar refrain will be that basic analytical software has become a commodity and they will not continue to pay high annual licensing costs. We will see this trend accelerate dramatically with local and state governments who are being crushed by looming deficits.

2. Open Source will continue to make in-roads in the analytics sphere. R will continue to grow in the enterprise by virtue of its popularity in the academic circles. As students enter the work force, they will want to use software they’re most comfortable with at the time.

3. Enterprises will start offering analysis, reports and data to trade partners that show how they can improve their services with each other. This will be a win-win scenario for both organizations.

4. As hardware capability increases, analytical software pricing will become a major concern as businesses will want to use the software on more platforms and in more areas of the company. Linux will be the platform of choice for most of these companies due to low cost and high performance.

5. Desktop analytics, contrary to popular opinion will continue to dominate the enterprise. This is where the hardcore data analysts live, on the desktop, and this will also be where the new algorithms will be developed. Visualization software will also start to become common on the desktop. Businesses who short change their analytical development staff with low powered desktops and small LCD monitors will see less active development by their staffs.

6. We will see enterprises who have invested in specific high cost analytic languages and who have put into production the rules, reports and algorithms on large servers either recode to a new language or migrate to compatible and lower cost languages.

7. The role of innovation will be double edged. There will be those companies and organizations who invest heavily in analytics see advantages over their competitors. There will also be those companies who gain competitive advantage by utilizing their BI stack more effectively by making it available throughout the company.

8. Licensing will continue to hamper companies and organizations as well constrain growth by restricting what companies can provide (reports, data, etc…) to their customers by virtue of being labeled Data Service Providers. Processing of third party data will be a monumental problem to companies due to license issues.

9. The days of processing large amounts of data on z/OS are all but over. I know this has been said before but there just isn’t growth on that platform. Plus, all the innovation in analytics is taking place on the desktop and smaller servers. Companies will look at moving their analytics to z/Linux and other Linux platforms in an attempt to save money on hardware and software cost.

10. Multi-threaded applications running on the BI stack will be the rage. As core counts and memory availability continue to expand, the ability to make use of SMP and MMP hardware will be more important than ever.

11. Server pricing based on client access or client counts will begin to decline. Competition for customers will make such pricing ill-advised.

12. The allure of cloud computing will be strong but with regulatory constraints and laws regarding privacy including fear of losing control of data (i.e. wikileaks) the two largest service sectors in the United States (which are banking and healthcare) will have taken note and will continue to avoid the use of public clouds.

13. Just as in other parts of the economy where we see the creation and bursting of bubbles, in 2011 social media such as Facebook and Twitter will start to be seen as a venue for narcissist’s and a time waste for many people. Companies who have invested millions of dollars to “mine” tweets will see such analysis as less than helpful given such low ROI and such analysis will begin to fall out of favor.

14. Mobile applications will be hot. The delivery of analytics on devices such as iPads and other tablets including smart phones will become much more common.

15. Since “flat is the new norm” Cell Phone providers will find that the high cost of data plans will drive potential customers away and since Wi-Fi connectivity has become so common (almost everywhere I go there is free Wi-Fi), we will see a decrease in 3G and 4G use and Wi-Fi only tablets will dominate. We already are seeing this trend with the Apple iTouch vs. the iPhone, and now the Rim BlackBerry Playbook will also be offered in a Wi-Fi only version.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Submitting R Programs Remotely using Dropbox

One of the great software applications currently available is a product called DropBox. DropBox is a piece of downloadable software that allows you to access your files between different computers by dropping a file into your Dropbox folder. Dropbox automatically syncs the files between all the computers that have access to your Dropbox folder. The great thing about Dropbox is that it just works and is smooth as can be.

I’ve been using Dropbox for about two or three months now and thought how great would it be to extend the functionality of Dropbox by being able to place into a specific folder a WPS or R file and have it automatically execute and write the output back into the Dropbox folder. Basically, you would have access to your organizations server for executing programs while travelling or working onsite.

My experimentation with this is under Windows, and I put together a little application that will allow you to remotely submit an R job. On my server, I have a filewatcher program that monitors the DropBox folder of my choosing and when it sees a new R program (i.e. one with a .R extension) it fires up R and processes the program. The system writes back any output to the Dropbox folder so you also have your .lst and .log files to review. You can also directly write output from your program (say an RDataframe file you created) by referencing the folder in your program.

I’ve included a little video of how R and Dropbox can be used to submit R programs on a remote server using a browser and place the output back into a Dropbox folder.

Click here to view a short 02:30 minute video of Drop4R

Of course, you don’t have to use a browser to place the files in the Dropbox folder. You can always just copy and paste or drag and drop the R program into the DropBox folder and the Job Spawner will simply execute the R program.

I’ve created a small zip file that contains a first draft of an installation guide on how you can setup Drop4R on your Windows computers. I’ve made the application freely available and you can use it without any restrictions.

Links:

Installation Guide: Dropbox Guide

Drop4R Installation File: drop4r.zip

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Technorati Tags: ,,,,

Creating Maps with WPS and the Bridge to R

A while back, I demonstrated how you can use the Bridge to R to create almost any graph or plot using WPS in combination with R. I showed how you can create the cowboy hat as well as some basic and not so basic charts and plots. One thing that I didn’t demonstrate was how you can create thematic maps (aka chloropleth maps) using the Bridge.

Today, I want to delve into that area a little bit and provide some programming samples that you can use to create these maps. First, you need to have a copy of the Bridge to R and WPS (or SAS) to run these demos. Some of the later code also uses the county2000 dataset available from the downloads section of the minequest.com website.

First a little background. Thematic mapping is a great way to show how certain attributes change or vary over given political boundaries. For example, depicting how states differ in terms of income tax assessment, or what are the most populous counties in the country. Providing a visual map for your users to understand variation across geography is always helpful in my opinion. R provides a library called “maps” that contains polygons for drawing thematic maps and a means for attaching a variable that you want to visually display demonstrating change over a given geography. I will show how you can use state and county outlines from R to do just that with the Bridge to R.

To draw a simple outline of the United States using the Bridge, it only takes three lines of R code. For example:

Program 1. Displaying U.S. State Outlines.

   1: *--> Outline of the United States - by state;

   2: %Rstart(dataformat=manual, data=, rGraphicsViewer=true);

   3: datalines4;

   4:

   5:  library(maps)  # load the boundary file

   6:  map("state", interior = TRUE, projection="polyconic", col="blue")

   7:  title('United States')  # draw the map

   8:

   9: ;;;;

  10: %Rstop(import=);

Map 1. U.S. State Outlines Map.

US_States

Click on map to view an expanded image

We can expand on the above map by adding one more line of code which will draw the county outlines inside of the state boundary outline.

Program 2: Creating State and County Outlines.

   1:

   2: *--> Outline of the United States - by state/county;

   3: %Rstart(dataformat=manual, data=, rGraphicsViewer=true);

   4: datalines4;

   5:

   6: library(maps)

   7:

   8:  map('county', boundary=TRUE,

   9:  interior=TRUE, projection="polyconic", col='lightgray', fill=TRUE, resolution=0, lty=1)

  10:  map('state', boundary=FALSE,

  11:      projection="polyconic", col='white', fill=FALSE, add=TRUE, lty=1)

  12: ;;;;

  13: %Rstop(import=);

Map 2. State County Outlines.

US_State_county_outline Click on map to view an expanded image

We can take this one step further by selecting only the geographic areas that we are interested in displaying by passing an argument to R passing just the regions we are interested in viewing. In this case, I’ve taken the liberty to pass the string containing the regions to R by using a macro variable. The Bridge to R can pass macro variables to R to help minimize typing and making mistakes.

Program 3. Selecting specific areas to map.

   1: *--> Great Lakes States by county - How to map a subset;

   2: %let geogarea = 'ohio','michigan','indiana','illinois','wisconsin';

   3:

   4: %Rstart(dataformat=manual, data=, rGraphicsViewer=true);

   5: datalines4;

   6:

   7: library(maps)

   8:  map('county',region= c(&geogarea), boundary=TRUE,

   9:  interior=TRUE, projection="polyconic", col='lightgray', fill=TRUE, resolution=0, lty=1)

  10:  map('state',region= c(&geogarea), boundary=FALSE,

  11:      projection="polyconic", col='white', fill=FALSE, add=TRUE, lty=1)

  12:

  13:  title('Great Lakes States')

  14:

  15: ;;;;

  16: %Rstop(import=);

When we run the code above (Program 3), we are presented with a map that just contains the counties for the Great Lakes States Ohio, Michigan, Indiana, Illinois, and Wisconsin.

Map 3. Great Lakes States.

greatlakes_states_county_outlineClick on map to view an expanded image

So far, I’ve shown you how to (1) create a map, (2) overlay two geographic areas (state and county) on a map, and (3) how to select a specific subset of the data to display (Great Lakes States) in creating your maps. Let’s move on and see how you can map your data using the Bridge to R and the R maps library.

The data I’m using to create the county population density map below is from a zip file that you can download from the MineQuest website. Basically, I’m using WPS to manipulate the data to get it into a format that R can use and then using the Bridge to R, call the mapping routines to display this data.

Program 4. Displaying your data in a thematic map.

   1: libname cntydata 'c:\data';

   2:

   3: proc format;

   4: value popval

   5: 0-24999 = 1

   6: 25000-99999=2

   7: 100000-249999=3

   8: 250000-499999=4

   9: 500000-749999=5

  10: 750000-high=6;

  11: run;

  12:

  13:

  14: data cntydata(keep=names cntypop);

  15:   set cntydata.county2000;

  16:   length names $ 32 cntypop 8;

  17:   cntypop = pop100;

  18:   if state in('02','15','72') then delete;

  19:   x=indexw(name,'County');

  20:   if x > 0 then cntyname=substr(name,1,x-1);

  21:

  22:   y=indexw(name,'Parish');

  23:   if y > 0 then cntyname=substr(name,1,y-1);

  24:

  25:   names=trim(lowcase(fipname(state)))||','||trim(lowcase(cntyname));

  26:   format cntypop popval.;

  27: run;

  28:

  29:

  30: *--> great a US map at county level showing population density;

  31: %Rstart(dataformat=csv, data=cntydata, rGraphicsViewer=true);

  32: datalines4;

  33:

  34: library(maps)  # Load the maps library

  35: popdata <- (cntydata)

  36:

  37: #define the color map to be used

  38: cols <- c("#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043")

  39:

  40: mp <- map("county", plot=FALSE,namesonly=TRUE)

  41: # draw the county outlines

  42: map("county", col=cols[popdata[match(mp,popdata$names),]$cntypop],fill=TRUE, projection="polyconic")

  43:

  44: # Draw the state outlines

  45: map('state', boundary=FALSE,projection="polyconic", col='white', fill=FALSE, add=TRUE, lty=1)

  46:

  47: title('U.S. County Population Density')

  48: ;;;;

  49: %Rstop(import=);

Map 4. U.S. County Population Density.

us_state_county_pop_2000 Click on map to view an expanded image

Above is the map generated by the code in Program Listing 4. Personally, I think it’s a nice thematic map and does demonstrate population density by county. It obviously can be enhanced by adding a legend and perhaps a footnote, but I will leave that up to you to figure out. The code that creates the map is only seven lines long. This could easily be made into a template by users for further expanding the map as well as for code reuse purposes.

For more information on creating maps with R, visit Cran at:

http://cran.r-project.org/web/packages/maps/index.html and download the maps.pdf file.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Bridge to R Preview Still Available

Just a quick reminder to those who are interested in the Bridge to R. The free trial preview of the Bridge is still available for download at: http://minequest.com/BridgePreview.html. The trial is time limited and will stop working after June 30th, 2010.

There are two versions available depending on whether you’re a WPS user or a SAS user. If you’re on the fence whether R and the Bridge to R is something you want to explore and would like to see a short web video on the Bridge, there’s additional links as well as installation instructions at the link above.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.