Tag Archives: Parallel Processing

High Performance Workstations for BI

There’s one thing I really enjoy and that’s powerful workstations for performing analytics. It’s fun to play around with and can be insightful to speculate on the design and then build a custom higher-end workstation for running BI applications like WPS and R.

ARS Builds

Every quarter, ARS Technica goes through an exercise where they build three PC’s mainly to assess gaming performance and then do a price vs. performance comparison. There’s a trend that you will soon see after reading a few of these quarterly builds and that is, the graphics card plays a major role in their performance assessment. The CPU, number of cores and fixed storage tend to be minimal when comparing the machines.

This if course will be in contrast to what we want to do for our performance benchmarks. We are looking at a holistic approach of CPU throughput, DISK I/O and graphics for getting the most for the dollar on a workstation build. But ARS does have a lot to recommend when it comes to benchmarking and I think it’s worthwhile including some of their ideas.

What Constitutes a High End Analytics Workstation?

This is an interesting question and one that I will throw out for debate. It’s so easy to get caught up in spending thousands of dollars, if not ten thousand dollars (see the next section) for a work station. One thing that even the casual observer will soon notice is that being on the bleeding edge is a very expensive proposition. It’s an old adage that you are only as good as your tools. There’s also the adage that it’s a poor craftsman that blames his tools. In the BI world, especially when speed means success, it’s important to have good tools.

As a basis for what constitutes a high end workstation, I will offer the following as a point of entry.

  • At least 4 Logical CPU’s.
  • At least 8GB of RAM, preferably 16GB to 32GB.
  • Multiple hard drives for OS, temporary workspace and permanent data set storage.
  • A graphics card that can be used for more than displaying graphics, i.e. parallel computing.
  • A large display – 24” capable of at least 1920×1080.

As a mid-tier solution, I would think that a workstation comprised of the following components would be ideal.

  • At least 8 Logical CPU’s.
  • A minimum of 16GB of RAM.
  • Multiple hard drives for OS, temporary workspace and permanent data set storage with emphasis on RAID storage solutions and SSD Caching.
  • A graphics card that can be used for more than displaying graphics, i.e. parallel computing.
  • A large display – 24” capable of at least 1920×1080.

As a high end solution, I would think that a workstation built with the following hardware would be close to ultimate for many (if not most) analysts.

  • Eight to 16 Logical CPU’s – Xeon Class (or possible step down to an Intel I7).
  • A minimum of 32GB of RAM and up to 64GB.
  • Multiple hard drives for OS, temporary workspace and permanent data set storage with emphasis on RAID storage solutions and SSD Caching.
  • A graphics card that can be used for more than displaying graphics, i.e. parallel computing.
  • Multiple 24” displays capable of at least 1920×1200 each.

I do have a bias towards hardware that is upgradeable. All-in-one solutions tend to be one shot deals and thus expensive. I like upgradability for graphics cards, memory, hard drives and even CPU’s. Expandability can save you thousands of dollars over a period of a few years.

The New Mac Pro – a Game Changer?

The new Mac Pro is pretty radical from a number of perspectives. It’s obviously built for video editing but its small size is radical in my opinion. As a Business Analytics computer it offers some intriguing prospects. You have multiple cores, lots of RAM, high end graphics but limited internal storage. That’s the main criticism that I have about the new Mac Pro. The base machine comes with 256GB of storage and that’s not much for handling large data sets. You are forced to go to external storage solutions to be able to process large data sets. Although I’ve not priced out the cost of adding external storage, I’m sure it’s not inexpensive.

Benchmarks

This is a tough one for me because so many organizations have such an array of hardware and some benchmarks are going to require hardware that has specific capabilities. For example, Graphics Cards that are CUDA enabled to do parallel processing in R. Or the fact that we use the Bridge to R for invoking R code and the Bridge to R only runs on WPS (and not SAS).

I did write a benchmark a while ago that I like a lot. It provides information on the hardware platform (i.e. amount of memory and the number of LCPU’s available) and just runs the basic suite of PROCS that I know is available in both WPS and SAS. Moving to more statistically oriented PROC’s such as Logistic and GLM may be difficult because SAS license holders may not have the statistical libraries necessary to run the tests. That’s a major drawback to licensing the SAS System. You are nickel and dimed to death all the time. The alternative to this is to have a Workstation benchmark that is specific to WPS.

Perhaps the benchmark can be written where it tests if certain PROCS and Libraries are available and also determine if the hardware required is present (such as CUDA processors) to run that specific benchmark. Really, the idea is to determine the performance of the specific software for a specific set of hardware and not a comparison between R, WPS and SAS.

Price and Performance Metrics

One aspect of ARS that I really like is when they do their benchmarks, they calculate out the cost comparison for each build. They often base this on hardware pricing at the time of the benchmark. What they don’t do is price in the cost of the software for such things as video editing, etc… I think it’s important to show the cost with both hardware and software as a performance metric benchmark.

Moving Forward

I’m going to take some time and modify the WPS Workstation Benchmark Program that I wrote so that it doesn’t spew out so much unnecessary output into the listing window. I would like it to just show the output from the benchmark report. I think it would also be prudent to see if some R code could be included in the benchmark and compare and contrast the performance if there are some CUDA cores available for assisting in the computations.

About the author: Phil Rack is President of MineQuest Business Analytics, LLC located in Grand Rapids, Michigan. Phil has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and is a authorized reseller of WPS in North America.

Services for WPS Conversions and Evaluations

It’s been a while since we’ve reviewed and re-jiggered the services that we offer and the ones we do offer need a bit of a makeover. Although our existing services are still pretty pertinent, we are looking at expanding and rounding out our consulting services in the area of SAS to WPS conversions. In the next few weeks we’ll be modifying our website to reflect these changes as well as creating some marketing brochures that can be downloaded and shared.

Many organizations are looking to significantly reduce software cost over the next two to three years. They don’t want to necessarily change their current architecture and most want to continue using existing source code whenever and wherever possible. Based on those premises, we’ll soon be putting together a services portfolio that encompasses the following practices.

Work with IT or an organizations Analytics Departments in providing a WPS Proof of Concept.

  1. Evaluate Price/Performance
  2. Define the requirements for an analytical and/or reporting replacement.

Assist in the evaluation of WPS Software as a replacement to existing SAS products.

Perform detailed Code Evaluation on existing SAS user and production SAS code libraries to evaluate compatibility with WPS and provide or recommend workarounds as necessary.

Recommend hardware and specific configurations for a WPS Server Installation.

Provide SMP libraries for Symmetrical Multi-Processor Hardware.

Install and test The Bridge to R.

Provide guidance to companies who are Data Service Providers on how best to reduce their exposure to SAS DSP fees.

Provide Consulting to departments and users who are focused on particular projects, i.e.

  1. For re-architecting their systems.
  2. For jump start/quick start scenarios.

Although these last two may seem a little “out there” for many people, you would be surprised to find out how common it is that a company acquires another organization and inherits a system that requires immediate attention in terms of licensing, cost reduction or consulting assistance to move the system to a new platform. It’s also not a rare situation where a company needs to immediately move their source code off of SAS due to DSP issues, escalating server costs or license problems. In these situations, MineQuest and World Programming can be of immense help.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

Looking Down the Road

We’ve had the Bridge to R out for about a week now and so far, it’s been very positive on the feedback. The Bridge is pretty stable and I think fairly easy to use after you setup R and install the Bridge itself.

Which brings us to what we may want to include in the next release. What we’re thinking about is including MPExec, and a nifty little macro called WPS2XML which allows you to create an XML file as well as generate a schema (DTD or XSD) from your WPS data set. The code for both utilities is already written and just requires testing in a more stressful environment.

So, what is MPExec? It’s a very robust utility that we wrote for WPS customers last year. MPExec stands for Multi-Processor Execution and it allows the WPS user to thread their programs so that they can run multiple parts of the program at the same time. On a multi-core desktop or server, one can dramatically reduce their programs execution time — depending on how well the program can be threaded.

Most programmers (and SAS programmers especially) think in a top down fashion when designing their programs. For example, you may have multiple steps to extract and clean data from a database, another set of steps that access data in a transport file that was created on the mainframe that also needs cleaned and sorted, and finally, some historical data that is sitting in another database on another server.

None of these three steps outlined above have anything in common (i.e. data sharing) in the sense that they have to run sequentially. Why not run these all at the same time on your multi-core desktop or server and save time? That’s exactly what MPExec allows you to do.

I’m interested in hearing what other users may have interest in when it comes to utilities to enhance and expand their use of WPS. Currently, the Bridge to R and the accompanying utilities are only available on Windows platforms. Is there a need or interest to have them also execute on Linux/Unix/Solaris? Of course, we’re always interested in hearing ideas about how we can expand our utilities to include R in a more seamless fashion as well.

About the author: Phil Rack is President of MineQuest, LLC. and has been a SAS language developer for more than 25 years. MineQuest provides WPS and SAS consulting and contract programming services and a reseller of WPS in North America.

MPExec Documentation Preview

We have the first draft of the documentation for MPExec. MPExec is a software add-in that allows for multi-threading WPS applications. I’ll place the document out on the MineQuest website for anyone interested in taking a look at it. Be forewarned, it is a first draft so expect some grammar and spelling issues.

The documentation is pretty short, only five pages long but you can easily read and understand what is required to run MPExec. Adding threading to a WPS program using MPExec is easy to do. Simply add a few statements to existing code segments that are good candidates for concurrent processing and you can dramatically reduce your applications execution time.

Right now, we are planning on including it as part of the MineQuest MacroLib. You can get the MacroLib included for free if you choose to license WPS (either Windows Desktops or Windows Servers) from MineQuest directly. The MacroLib includes some other useful applications such as XMLRead and XMLWrite, and the Bridge to R for WPS. We also plan on supporting MPExec on the Unix/Linux platforms in the future and we’ll announce that availability when we are thoroughly done testing.

If you’ve licensed WPS from WPC directly or some other reseller, we can offer an annual license for the MacroLib so you have access to all the goodies as if you had purchased directly from us. Right now, the pricing for the MacroLib is $119 for the desktop version. The pricing for the Windows Server version is set at 15% of the annual WPS license fee.

The documentation for MPExec can be found at: http://www.minequest.com/Misc/mpexec_users_guide.pdf

MPExec requires WPS release 2.3.5. By the end of June, we expect to be able to begin fulfilling phone orders. Make sure you check our website for an announcement!

Benchmarks WPS in a Threaded Environment

I finally completed the writing of MPExec over the weekend. MPExec is the name of the macro’s that create an environment where you can run multiple threads or processes at the same time. All-in-all, I have maybe 40 hours into the project and I’m quite satisfied with the results. I’m sure it would have normally taken me more than 40 hours but I was able to piggy-back on existing code that I had written with the Bridge to R.

I decided to make MPExec as portable as possible. Thus, I didn’t use any outside exec files that I used in the Bridge to R to do such things as check for file existence and to spawn new threads Btw, FileExist is a new function in WPS 2.3.5 and works great! The other thing that I see in version 2.3.5 is that I can shut off log messages/notes/source and not have the log fill up with blank lines as in the earlier releases. This is fantastic and allows my logs to look professional without lots of white space.

I don’t think I can get the code to execute any faster than it does on my development machine when executing multiple threads. So let’s take a look at some benchmarks. Below is a table that shows how well a Quad-Core PC can execute multiple WPS threads. All times are in minutes and seconds.

 

1,000,000

Records

2,000,000

Records

5,000,000

Records

Threads

Par / Seq

Par / Seq

Par / Seq

2

0: 18 / 0:22

0:18 / 0:28

0:25 / 0:41

4

0:19 / 0:44

0: 20 / 0:54

0:42 / 1:28

6

0:22 / 1:06

0:30 / 1:20

0:58 / 2:01

8

0:28 / 1:44

0:34 / 1:56

2:16 / 2:49

A brief explanation so the above table makes a little sense is in order. I ran each test three times and took the average time for all three runs and rounded to the highest value. I developed the test programs so that one thread was creating the data and then performing a SORT, MEANS and FREQ on the data. The other thread that executed was always a UNIVARIATE and a CORR using a permanent data set with 600,000 records. I kept this balance of creating temp data sets and using permanent data sets when I had more than two threads. So, when running four threads, I had two CORR and UNIVARIATES running and two SORT, MEANS and FREQ running. With six threads, I had three of each and with eight threads, I had four of each running. For an example of the code, see the bottom of the blog.

I ran the test times sequentially as well. This gives us the time that it would take to run the programs without threading. Comparing the Parallel times with the sequential times, we can get an idea of how much faster we can run our code using threading.

One thing to note. Since we are always running the CORR and UNIVARIATES using 600,000 records (and from a different drive array) these times tend to be pretty constant. This is true especially with two and four threads with one or two million records. The time differences start to disappear appear when we start using 5,000,000 records and six or eight threads. The test machines temp drive(s) start to become overwhelmed and are I/O bound.

With a fast drive array for your work space (temp files), you can really get some amazing decreases in your execution times by using threading. The system I’m running these tests on has a two drive RAID-0 setup for temp space. If I was to add an additional drive to that array, I’m sure the execution times with eight threads and five million records would be much lower… perhaps by 30 to 40%.

WPS Code for benchmarking two threads.

%MPExec;

%let iter=1e6;

%startthread(Job_A);

     data a;
       do ii=1 to &iter;
       a=ranuni(0);
       b=ranuni(0);
       c=ranuni(0);
       d=ranuni(0);
       e=ranuni(0);
       f=ranuni(0);
       g=ranuni(0);
       h=ranuni(0);
       i=ranuni(0);
       aa=round(a*10,1);
       output;
       end;
    run;

    Proc sort data=a; by ii; run;

    proc means data=a;
    run;

    Proc freq data=a;
    tables aa;
    run;

;;;;
%stopThread;

%startThread(Corr_Run);

    libname tstdata ‘c:\wpstestdata\’;

    Proc univariate data=tstdata.d;
    var j k l;
    run;

    Proc corr data=tstdata.d;
    run;

;;;;

%stopThread;

%WaitForThreads;

 

Implementing Threaded Processes in WPS

I’ve been working the last few days on creating code that will allow a WPS user to run parts of a WPS program in parallel. I’ve made a lot of progress and it seems to be working fine. It needs a lot of cleaning up and some tuning yet, but I’m fairly satisfied with the progress.

The good news is that it’s fairly easy to implement in your existing code. Simply adding a few lines of code before and after each segment will spawn a new thread that is executed separately. At the end of all the threads execution, we read in the LOG and LST files so the output appears in the Hosts programs log and list files.

There are a couple of nifty attributes that are worth mentioning. First, you can pass a macro variable’s value to the spawned program. Secondly, you have access to all the work files that were created at the end of all the threads from your Host program. This way, you can use the temp data sets in future steps and PROCS.

Performance seems pretty decent on my work station which is a Quad-Core with 8GB of RAM. Once I get the code cleaned up and documented, I’ll run some benchmarks on a server. On my workstation however, running four tasks with 10 million records each in parallel took 42 seconds. Running these sequentially took 1:30 (mm:ss). When I upped that to five tasks with 10 million records and a different mix of PROCS and data steps, my sequential processing time took 2:37 (mm:ss) and time to run in parallel was 1:23. So you can see, there are some real potential performance increases available.

It will be interesting to see how well this can scale in terms of number of threads. I’m seeing that the larger the data set to be processed, the better the performance. I suppose this is reasonable when you consider that each thread must fire up another instance of WPS so that smaller threads are handicapped by the start-up and initialization of the system.

There are some ugly issues though. Log Analyzers are going to be confused with duplicate line numbers (each threaded processes will have line numbers that are also in other threaded processes) and making the log appear in such a way that it’s not just a mass of text that is hard to follow also needs to be addressed.

But these are things that are doable and I’ll provide some more information as time permits.

 

Code Constructs for Running WPS in Parallel

In my last two blog posts, I’ve discussed some issues surrounding running WPS jobs in parallel. One thing that I feel is worth mentioning (and I see as a WPS Reseller) are the number of small servers that SAS runs on. Over the year, almost every server I see is either a single or dual-core processor model. The SAS licensing fees are so horribly high that companies cannot afford to (a) put the number of developers on a server that they would like to and (b) due the smaller size of the server, do the type of processing they would like to in their analysis and reporting groups.

Of course, more sanely priced software allows you to overcome these two hurdles and that’s where the ability to implement parallel processing in your code. Hence, with a larger server you can run more data and support more users by being able to utilize multiple cores.

In thinking about the code constructs, I want to make this as simple as possible to implement for any developer. The key to this is to minimize the number of new keywords that are introduced and to keep processing to a single machine (whether a desktop or server).

We can use something along the lines of ParallelR in the keyword constructs to implement running WPS jobs in parallel. Creating a Parallel WPS processing implementation is much easier than implementing the parallel R processing code. We already know such things as the name of the work directory and we can use workinit, workterm, and work to control the placement of the work libraries that will be spawned by the additional processes. Also, there’s much less code checking that has to take place for such things as file types and whether other programs also exist.

Since I want to reuse as much of my older code as possible, I’m going to frame my processing in a very similar manner. Consider the following code segments:

%ParallelWPS;

  %StartThread()

  Proc Sort data = mylib.dda0109;
    by descending avgbal;
  run;

  Proc Print data = mylib.dda0109(obs=50);
  run;

  %StopThread();

  %StartThread();

  Proc Sort data = mylib.dda0209;
    by descending avgbal;
  run;

  Proc Print data = mylib.dda0209(obs=50);
  run;

  %StopThread();

%WaitFor;

The above code spawns two parallel processes. Each new process is started with a %StartThread() call and ends with a %StopThread call. Let’s take a closer look at the keywords.

The Macro %ParallelWPS sets up the environment for parallel processing.

%StartThread() begins processing the WPS/SAS code for the parallel processing.

%StopThread() is an indicator of where the code breaks. In other words, everything between the code block %StartThread() and %StopThread() is executed in a single process.

%Waitfor simply waits for both of the programs to terminate and then reads in the log and lst files back to the master program.

One nice thing that can be added in the future to the ParallelWPS construct are parameters such as Priority=[Low/Normal/High] or Remote=[MachineName] to be able to control execution even more.

Using these keywords and this style of implementation, I think most users could eyeball their code and decide pretty easily which parts can be run in parallel. By using parallel processing for spawning threads for execution, the smart WPS developer can achieve huge performance gains in their SAS code and are able to purchase and utilize high performance hardware for their BI solutions.

 

Technorati Tags: ,,

Programs and Areas of Code that are Good Candidates for Parallelization

What parts of a production system or WPS program are good candidates for running in parallel? I touched on it a bit in the last posting but it’s worth talking about in more detail.

Programs that have a substantial data preparation for later processing are usually excellent candidates. Specifically, we are looking for programs and code, where it can be determined that it’s data output is not used by any other processes at the same time and thus can be run in parallel. Begin by looking for data steps and procedures such as PROC CIMPORT, PROC CPORT, multiple PROC SORTS, and PROC DATASETS where data is being indexed. Also, look for long running procedures and data steps as potential candidates. Often times, these can be broken into multiple groups and run simultaneously.

Another area that is ripe for parallelization are areas where there is lots of reporting and data file creation for export type data sets. That is, Excel or MS Access tables that are created for end users. If the data is processed using BY GROUPS, you can more than likely split these groups into separate threads for processing.

Determining what can be run in parallel can be tricky if you don’t take the time to really understand your programs code. Since WPS or SAS processes data in a top down fashion, take your time and eye ball your data to determine where natural distinctions take place. In other words, try to identify where code is being used to manipulate large amounts of data, where code is being used to analyze the data, and where the code is that is reporting the data. Based on these three processing segments, you can often rapidly determine what can be processed in parallel.

The other handy item to use is a log profiler or analyzer program. You can profile the log of an existing program and look at run times to determine where the greatest amount of time is being spent. There are a few macro programs available from SAS to help you do this and Savian has a log analyzer as well. The Savian log analyzer does not work with WPS at this time however.

So, there’s a few hints in how to get started to identify which programs and program segments are good candidates for running in parallel.

Running SAS and WPS Programs in Parallel

There’s been some discussion about running SAS programs or pieces of programs in parallel on numerous sites, i.e. SAS-L. It is an interesting discussion and for many who work on the development and implementation of production systems on multi-cores servers, it makes a lot of sense.

So why would you want to run your programs in parallel? Quite simply, to save time and use your hardware resources most efficiently. In the work I tend to get involved in, I often see opportunities to reduce processing time by 20 to 40 percent. Here’s a real world example. I have a production job that downloads files from the mainframe to a Windows Server. There’s actually 16 transport files that are downloaded, appended and indexed for the actual conversion and processing on the Windows Server. Running these processes sequentially, take over 25 minutes. Running them in parallel takes about nine minutes.

Another real world example is in the reporting of financial portfolios. One of my clients maintains six portfolios that they track and analyze separately. However, there’s some cross pollination between the portfolios because some customers can be in more than one portfolio. At the end of the job sequence, dozens of graphs and reports are created for each portfolio. Dozens of data sets are exported to Excel and Access databases for other departments to review. Also created are hundreds of branch reports and spreadsheets for branch managers to review and analyze. All the reporting and ancillary database creation are run in parallel. The time savings for this part of the process is almost 30 minutes.

In a SAS environment, running parallel processes can take two forms. Either using SAS/Connect or the new Table Server. But looking at some of the documentation, I’m not so sure how straight forward it’s going to be to use the Table Server product and it’s still up in the air whether the product is included in SAS/Base or if it’s an additional license requiring more money. Besides using the new SAS Table Server, you can always run your SAS code in parallel using SAS/Connect. Again, this is an extra cost and requires you to license the module to run your code in parallel.

This is all fine and dandy if you’re a SAS user because you’re used to throwing money at SAS for your solutions. But what if you are a WPS user? Can you run programs or sections of programs in parallel? The answer is most certainly "yes." It’s a matter of writing understanding your code and writing your code so you can make the most of an environment that supports this methodology. The second factor is creating a set of macros that makes it easy to implement.

Creating a set of macros that will provide you with the ability to run your WPS code in parallel is something that MineQuest can do. Based on the Bridge to R where we run R code in parallel, we can do nearly the same thing with WPS. This should be another tool in your arsenal to run code faster and more efficiently using WPS.