Tag Archives: Analytics

Thoughts on Open Source

tucson_sunset

Unbelievable sunsets in Tucson, AZ

I recently read an article on Linux.com by Esther Shein titled “The Case for Open Source Software at Work” where she discusses the results of a survey on the use of Open Source software in companies. Pretty interesting read and it makes the argument that IT workers feel about the importance for having source code.

The elephant in the room that is never presented is how the value is measured by accounting or say purchasing. For example, how much perceived value is derived by other parts of the company because they look at the software as being free… i.e. cost free?

Individuals are different in their purchasing and use habits. Most individuals I know are driven by price as the first factor and popularity and completeness as other factors in their consideration.

I can’t recall ever seeing a survey of corporate types that measure the desire for specific software where free vs open source code is derived. I imagine that it may be measured internally by some companies but I would love to see a public survey that addresses that issue.

My own opinion, derived from looking at MS Office vs Libre Office is that quality and support is the most important driver for desktop office software. Every large company that I have consulted with use MS Office. They may use an older version but they use MS Office.

When I switch my thoughts to analytical software, I see the same thing. Corporations purchase or license software like WPS or SAS because of support and completeness. Documentation is also a big factor here too. Individuals who don’t have the financial resources to license analytical software like the aforementioned products gravitate towards free software.

I do grudgingly use R when needed but I prefer WPS over any other analytical software. It’s based on a language that I have used for 30 years and feel very comfortable with. I find it much easier to debug my code and like that if I chose to build a product, I know it will run on Windows, OS X, Linux and the mainframe.

When I factor in that I can license WPS for a bit over $3 a day on a Windows or Mac workstation (and our competitor charges just north of $41 a day for your first year) I find it compelling to have WPS in my BI stack. I can still use R and Python but the language of SAS is just too rich and broad to ignore.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in beautiful Tucson, Arizona. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS consulting and contract programming services and is an authorized reseller of WPS in North America.

Analytical Data Marts

Recently, there has been a conversation on what defines “Big Data”. It’s my position (among others) that Big Data is data that is so large that a single computer cannot process it in a timely manner. Hence, we have grid computing. Grid computing is not inexpensive and is overkill for many organizations.

The term “Huge Data” has been bandied about as well. In the conversations regarding what is Big Data, it was sort of agreed that Huge Data is a data set that sits somewhere between 10GB and 20GB in size. (Note: In about two years I will look back at this article and laugh about writing that a 20GB data set is huge for desktops and small servers.) The term Big Data is so abused and misused by the technical press and even many of the BI vendors that it’s almost an irrelevant term. But Huge Data has my interest and I will tell you why.

The other day I read a blog article on the failure of Big Data projects. The article talks about a failure rate of 55%. I was not surprised by that kind of failure rate. I was surprised that there were not solutions being offered. In the analytics world, especially in finance and health care, we tend to work with data that comes from a data warehouse or a specialized data mart. The specialized data mart is really an analytics data mart with the data cleaned and transformed into a form that is useful for analysis.

Analytical data marts are cost effective. This is especially true when the server that is required is modest compared to the monsters DB’s running on large iron. Departments can almost always afford a smaller server and expect and receive much better turnaround time on jobs than most data warehouses. Data marts are more easily expandable and can be tuned more effectively for analytics. Heck, I’ve yet to work on a mainframe or large data warehouse that could outrun a smaller server or desktop for most of my needs.

The cost for a WPS server license on a four, eight or even sixteen core analytics data mart is quite reasonable. With WPS on the desktop and a WPS Linux server, analyst can remotely submit code to the data mart and receive back the log, listings and graphics right back into their desktop workbench. But the biggest beauty of running WPS in your data mart platform is that WPS comes with all the database access engines as part of the package. If you have worked in a large environment with multiple database vendors, you can see how this can be very cost effective when it comes to importing data from all these different data bases into an analytical data mart.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in Grand Rapids, Michigan. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and is a authorized reseller of WPS in North America.

Linux, Open Source and Analytics

I spent part of the weekend at a beer tasting party and the other part brushing up my skills on Linux. The beer tasting party was much more rewarding and is almost embarrassing how well we do each year at this thing. I’m proud to say that we are experts in identifying domestic beers!

So how did the Linux research/testing go? Not bad really. I created a couple of Virtual Machines (VM’s) using SUN’s VirtualBox in both Ubuntu and Fedora 11. After struggling for about an hour in how to get the VirtualBox add-ins to work in Fedora so I could change display resolutions and work in seamless mode, the rest was pretty straight forward.

I built an analytics VM that I think is pretty nice… especially if you are a student or a professional who wants to test and expand their knowledge of some Open Source Software . I installed a number of Open Source analytical applications in the Fedora and Ubuntu VM’s. These include:

  1. R – Open Source Statistics
  2. Rcmdr – A front end GUI for R
  3. Rattle – a data mining application for R from Togaware
  4. Eclipse – a GUI/Workbench framework
  5. StatET – an Eclipse plugin that allows you to run R – from Walware
  6. Open Office – Office Applications
  7. BIRT – a reporting suite which integrates into Eclipse
  8. PSPP – an open source (and unfinished) SPSS clone

 

What’s nice about using a VM is that it doesn’t corrupt your current installation. It’s also handy in that you can simply copy the VM onto a DVD or another storage medium and regain the space that the VM is using.

The downside to using a VM is that it slower than running in native mode. This is especially true when you are doing a lot of disk access to read and write data. The two things that I do like about SUN’s VM’s is that I can easily assign a number of CPU cores to the VM and secondly run the VM in seamless mode. Running in seamless mode takes away a lot of the negative views and pain that I have when using Linux at this stage.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.