Category Archives: Cloud Computing

Disruptive Analytics

I picked up Disruptive Analytics available on Amazon which is Thomas Dinsmore’s recent book a few days ago and thought I would leave my impressions. Note this is not a review! First, I really enjoyed the history of the analytics platforms. The second and third chapters were very informative (History and Open Source respectively) and I learned a few things!

Regarding Open Source, I agree that we will see Python supplant R as the “go to language” for analytics in the Open Source arena. It might take a few years but if my customers interests are indicative of this trend, it will happen.

Dinsmore does an admirable job in Chapter 4 on Hadoop. This chapter is fairly dense reading for me mainly because there are a lot of terms and definitions in this chapter. If you were ever looking for an overview of the Hadoop ecosystem, this is probably a good start.

The other chapter I really liked was Chapter 6. This chapter deals with streaming analytics and I believe we are just in the infancy of this revolution. Smart Cities will be a very visible platform for many people to see and benefit from streaming analytics.

I would like to see in a future edition a presentation of the role of the analytics workstation and flash memory in the analytics framework. Data Scientist who are developing algorithms and processing data are often using workstations in lieu of servers. Perhaps even a few pages on how nVidia is revolutionizing the analytics world with CUDA processing on high power workstations. I think I would enjoy that.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in beautiful Tucson Arizona. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and is an authorized reseller of WPS in North America.

How Does BI Perform in a Virtual Machine?

I’ve been working on setting up some baseline testing of how Virtual Machines perform running WPS. This proved to be a bigger PITA than I originally thought because I realized the Linux VM I had was too small to do any testing with data sets of any appreciable size. Since there not any easy to use tools for expanding a VirtualBox VM partition, I ended up just recreating the VM with a larger footprint.

What I wanted to test is how much faster WPS is when running on a native host and compare and contrast those timings with WPS running in an XP VM and in a Linux Fedora VM, all running under SUN’s VirtualBox software.

The VM’s are setup to be as identical as I can make them. They both have 3GB of RAM and each VM has two logical CPU’s dedicated to the VM. The host system is an Intel Quad-Core Q6600 with 8GB of RAM and 1.2 TB of hard drive space spanning four drives in a RAID-1 configuration. I setup the graphic card parameters to be identical in each machine as well.

One thing I did learn in this exercise is that there’s a fair amount of tuning you can do to a VM. That includes setting the number of cores dedicated to the VM to installing a real storage/IO driver inside the VM to get the best performance you can.

One area that I did make a mistake early on was trying to use a shared folder as the temp disk space for WPS. VirtualBox is slow as molasses reading and writing to a shared folder. Performance improved dramatically when I had WPS use the temp folder inside the VM and not using the shared folder.

What I ended up doing is write a simple benchmark program that invoked as many WPS PROCS as I could (and that I typically use) as well as some data step and SQL steps. I’m not trying to be exhaustive in writing the benchmark but I want to get a feel for how VirtualBox performs in contrast to running the same application in a non-VM environment. I also wanted to have some kind of realistic number of records being processed for the test runs. I decided to run a 100,000 record test and then a 1,000,000 record test.

I’m afraid the table below will wrap unless you have your text size set to small. Just in case it does, you can view the Excel Spreadsheet by clicking here..

Below are the results from benchmark program.

 

Vista x64 Native

Vista x64 Native

 

XP 3GB Vbox

XP 3GB Vbox

 

Fedora 3GB Vbox

Fedora 3GB Vbox

Procedures and Data Steps

Real

CPU

 

Real

CPU

 

Real

CPU

Statistically Oriented Procedures

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Proc Corr – 12 Vars

  100K Obs.

0.453

0.452

 

0.57

0.55

 

0.586

0.135

  1,000K Obs.

4.252

4.258

 

5.187

5.157

 

5.718

1.804

 
Proc Means – 19 Vars
  100K Obs.

0.184

0.202

 

0.27

0.26

 

0.358

0.122

  1,000K Obs.

1.768

1.887

 

2.343

2.323

 

3.383

0.425

   

Proc Summary – 19 Vars

  100K Obs.

0.131

0.14

 

0.27

0.18

 

0.32

0.03

  1,000K Obs.

1.357

1.404

 

1.802

1.782

 

3.349

0.439

  

Proc Univariate – 19 Vars

  100K Obs.

1.484

1.513

 

2.343

2.133

 

2.235

1.441

  1,000K Obs.

13.768

14.258

 

18.957

18.887

 

18.327

11.127

 

Proc Standard – 12 Vars

  100K Obs.

0.493

0.499

 

0.65

0.64

 

0.922

0.17

  1,000K Obs.

5.006

5.257

 

11.626

6.99

 

8.829

1.459

 
Proc Rank – 12 Vars
  100K Obs.

0.425

0.452

 

0.62

0.61

 

1.016

0.348

  1,000K Obs.

5.129

5.288

 

11.416

7.57

 

13.047

4.494

 

Proc Freq – two var crosstab

  100K Obs.

0.157

0.156

 

0.3

0.21

 

0.391

0.078

  1,000K Obs.

1.567

1.638

 

2.193

2.183

 

3.634

0.491

 
Data Manipulation

Proc Append – Two 50K obs data sets, 24 Variables

  100K Obs.

0.059

0.046

 

0.11

0.1

 

0.216

0.009

  1,000K Obs.

0.459

0.53

 

11.326

1.231

 

2.285

0.127

 
Proc Sort – 24 Vars
  100K Obs.

0.153

0.171

 

0.34

0.32

 

1.028

0.167

  1,000K Obs.

1.269

1.856

 

20.349

3.845

 

12.235

2.893

 

Proc Datasets – Create Simple Index

  100K Obs.

1.725

1.809

 

1.612

1.462

 

8.335

0.796

  1,000K Obs.

22.937

23.384

 

18.797

18.396

 

117.437

6.967

 

Proc SQL – Simple Where Returns 10% of Obs.

  100K Obs. 0.074 0.093 0.12 0.12 0.269 0.033
  1,000K Obs. 0.871 0.92 1.792 1.261 2.606 0.241
 

Data Step Create Records

  100K Obs.

0.341

0.343

 

0.56

0.48

 

0.556

0.261

  1,000K Obs.

3.294

3.307

 

8.872

4.676

 

5.874

3.587

 

Proc Transpose – 1 Var,1 ID var, by var

  100K Obs.

2.779

2.979

 

3.184

3.124

 

3.791

3.12

  1,000K Obs.

27.263

28.204

 

32.156

31.815

 

34.932

29.461

 
Proc SQL – Join
  100K Obs.

0.266

0.28

 

0.731

0.55

 

0.805

0.095

  1,000K Obs.

5.783

6.676

 

35.1

13.669

 

23.381

3.72

 
Data Step Merge
  100K Obs.

0.407

0.39

 

1.291

0.68

 

0.932

0.199

  1,000K Obs.

3.883

4.087

 

9.233

6.719

 

6.197

2.712

 
Reporting

Proc Tabulate – 2 Vars

  100K Obs.

0.238

0.202

 

1.001

0.25

 

0.442

0.165

  1,000K Obs.

2.11

2.262

 

2.984

2.804

 

4.458

1.88

 

Proc Print – 1,000 Obs. 24 Vars

  100K Obs. 0.09 0.093 0.931 0.1 0.119 0.09
  1,000K Obs. 0.087 0.062 0.1 0.09 0.118 0.095
 
Data Access

Create Transport File – 24 Vars

  100K Obs.

0.254

0.187

 

0.69

0.3

 

0.62

0.063

  1,000K Obs.

2.973

1.856

 

9.103

3.304

 

6.204

1.056

 
Read Transport File – 24 Vars
  100K Obs.

0.232

0.249

 

0.38

0.37

 

0.633

0.197

  1,000K Obs.

2.341

2.355

 

7.3

4.025

 

6.351

1.412

 

Create SPSS file – 24 Vars

  100K Obs.

0.167

0.156

 

1.061

0.34

 

1.29

0.154

  1,000K Obs.

1.707

1.887

 

5.467

3.505

 

8.55

0.769

 

Read SPSS file – 24 Vars

  100K Obs.

0.222

0.218

 

0.971

0.35

 

0.873

0.12

  1,000K Obs.

2.19

2.308

 

7.34

3.975

 

7.451

1.657

 

Proc Import – CSV; 24 Vars

  100K Obs.

0.816

0.811

 

1.161

1.091

 

1.386

0.446

  1,000K Obs.

8.262

8.58

 

12.788

11.055

 

13.74

7.045

 

Proc Export – CSV; 24 Vars

  100K Obs.

1.816

1.887

 

2.343

2.153

 

3.077

2.244

  1,000K Obs.

17.987

18.517

 

22.882

21.971

 

30.306

20.288

 
Data Management
 
Proc Delete
  100K Obs.

0.002

0

 

0.1

0.1

 

0.014

0.005

  1,000K Obs.

0.11

0

 

0.2

0.2

 

0.034

0.004

 
Proc Copy – 24 Vars
  100K Obs. 0.88 0.109 0.24 0.23 0.62 0.063
  1,000K Obs. 2.973 1.856 9.103 3.304 4.629 0.187
 
Proc Contents
  100K Obs.

0.002

0

 

0

0

 

0.003

0.001

  1,000K Obs.

0.008

0

 

0.781

0

 

0.002

0

 
Graphics
Proc Plot
  100K Obs.

0.12

0.14

 

0.64

0.22

 

0.453

0.083

  1,000K Obs.

1.243

1.294

 

2.113

2.093

 

4.147

0.577

 
Proc Chart
  100K Obs.

0.06

0.062

 

0.16

0.11

 

0.217

0.041

  1,000K Obs.

0.594

0.624

 

1.171

1.131

 

1.981

0.269

 
Proc Gplot
  100K Obs.

0.258

0.28

 

0.46

0.41

 

1.135

0.179

  1,000K Obs.

2.435

2.371

 

5.247

3.785

 

9.985

1.319

 
Proc Gchart
  100K Obs.

0.081

0.109

 

0.14

0.11

 

0.237

0.049

  1,000K Obs.

0.613

0.592

 

1.131

1.121

 

2.046

2.22

Total of all Times 158.608 161.546 302.108 206.42 394.115 119.629

 

One of the first things that jumps out at me is the elapsed time for creating an index in the Linux VM. The test took 117 seconds to complete. At first I thought this might just be an artifact of running a large dataset in a virtual machine but I’m now of the belief that this is a reflection of a first release of WPS on Linux and this aspect of it needs some performance tuning.

The other issue I see (and I don’t have the CIS background to figure this out on my own) is the CPU time for the Linux tests. Those seem rather small to me on many of the PROCS that I benchmarked. Especially when you contrast them to the Windows tests.

Other than those two issues, I don’t think anything else really jumps out at me. The performance timings you see in the chart are pretty indicative of how other applications run in terms of real time. I will say that with some work and perhaps throwing a $1,000 at the problem, I could get the timings for the VM’s to drop by 30% or 40%. I’d start by adding the Intel Matrix Storage Controller in the VM’s as well as adding some disk storage so that I had a two additional virtual disks. That way, I can have a separate temp work space as well as a disk for my permanent files and be able to read and write simultaneously and I’m certain I could get I/O down considerably.

In my opinion, if you had modest size data sets that you process on a regular basis (modest meaning no more than a few million records), I think running WPS or most any other BI application in a VM is quite possible. The reasons for running a BI application in a VM is to save money in that you only pay for the cores that you need for your application and not all the cores on the server. Hence, if you have a 16 core server and you only need six cores for your research group, you only pay for six cores. That holds for WPS licensing but not SAS licensing. The greedy folks at SAS want you to pay for all 16 cores whether you use them or not.

I hope this information helps those folks looking at WPS and are thinking about running it in a VM. I could have included some other PROCS such as LOGISTIC, REG and COMPARE, but quite frankly I became exhausted doing this study.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Cloud Computing and Integration

In my previous post, I wrote a little about Cloud Computing and BI and a couple of scenario’s where such a service could pay some dividends. Those scenarios are actually real world examples of how Cloud Computing could be applied. But what all those scenario’s lacked is the need for having Cloud Computing being part of your everyday arsenal or toolkit. That is, having the cloud available at a moment’s touch to help solve your real world problems.

But what do you do if you have a problem to solve that is too big for your laptop or desktop to solve? What can you do if you don’t have adequate storage space to process all your data? Imagine having the availability of a large powerful server up and running in a few seconds by just clicking on a few buttons and connecting to your own server in the cloud. Suddenly, you have access to terabytes of storage and gigs of ram and lots of processors to offload your work. As an added service (or product, whatever you want to call it) this has a lot of appeal.

There are BI companies doing this to some degree or another. Mathematica offers such a service and as far as I can tell, it’s fairly transparent and the ability to connect to Amazon’s EC2 is built in to their system. Of course, this doesn’t come free (but it does come cheap) because there are additional charges to use EC2. But Amazon has commoditized cloud computing so hourly charges are quite reasonable.

Talking about Amazon’s EC2, Amazon has announced an Eclipse Plug-in that is all point and click for setting up a remote server using EC2. If you’re lucky enough to be using Eclipse for your IDE, then you have access to a number of valuable plug-ins to extend your environment.

So I strongly suspect that in the near future, we’ll see more BI and analytics companies extending their products into the cloud. Whether they decide to build their own server farms or leverage clouds like Amazon will be a matter of taste and control. However, making a compute intensive service broadly available to your customers will be a big plus in the mind of the client and also become an additional revenue stream for these companies to exploit.

Cloud Computing and BI

One of the interesting aspects of cloud computing is the ability to lower your cost of doing business and for software vendors to make their software products and services available to more people 24×7. Cloud Computing is gaining notoriety for a number of reasons but the most prevalent is the low cost of entry and the sheer convenience factor of having such services available.

I’ve had a hard time getting my arms around Cloud Computing as it applies to BI, and specifically to SAS programming. I think the reason for this is the way I use my tool sets and this has colored my ability to see how Cloud Computing can be utilized. However, looking back a few years and examining how some of my clients used SAS (or WPS for that matter) is somewhat revealing.

First, let’s examine some ideas on organizations could use the cloud. The most obvious one to me is to make use of the cloud to handle situations where you just don’t have the resources in-house (server hardware, software and application licenses) to handle a project and you need more capacity than you currently have.

The other situation is one where you know that you have needs a few times a year for increased demand for one of your applications. For example, perhaps you have hundreds of users who have to enter data in to a system for sales forecasting and goal setting. Or maybe you’re an educational organization where you need to collect data from governmentally funded daycare and preschools a few times a year and you have a lot of clients who will be entering data. Maintaining the infrastructure in-house just isn’t rational from a cost perspective to do these tasks just a few times a year.

Another example that I’m familiar with is a non-profit organization that is mandated to create an annual booklet on the state of health care in the region. This organization compares healthcare utilization for the region against state usages as well as federal usages. It takes two people about ten weeks to create all the tables and graphs for the annual book. This is the only usage of the SAS license at this organization and they spend almost $7,000 a year for a two user license.

I’ve not talked about how developers might want to use the Cloud for stress testing their applications or how companies could develop vertical market applications and use the cloud for hosting. The ability of an Amazon with huge bandwidth and multiple cheap servers makes such things a reality.

So, wouldn’t it be nice if these organizations were able to only pay for the hardware and software resources they need? I imagine a savvy BI firm could create a standard Amazon Machine Image with their product preinstalled and a few utilities like compression programs and PDF readers and open a whole new Line of Business as well as a new sales channel.

In my view, not offering such a product/service is just leaving money on the table. The most likely recourse a customer has if you don’t have such a service is to find some other way of handling the problem, most likely, move to a new lower cost BI platform or rewrite the application to take advantage of other scales of economy. Paying a nominal hourly or monthly fee to access and use your products make good sense for both the client as well as the software company.

 

Technorati Tags: ,,,,