2

I have a daily ETL process in SSIS that builds my warehouse so we can provide day-over-day reports.

I have two servers - one for SSIS and the other for the SQL Server Database. The SSIS server (SSIS-Server01) is an 8CPU, 32GB RAM box. The SQL Server database (DB-Server) is another8CPU, 32GB RAM box. Both are VMWare virtual machines.

In its oversimplified form, the SSIS reads 17 Million rows (about 9GB) from a single table on the DB-Server, unpivots them to 408M rows, does a few lookups and a ton of calculations, and then aggregates it back to about 8M rows that are written to a brand new table on the same DB-Server every time (this table will then be moved into a partition to provide day-over-day reports).

I have a loop that processes 18 months worth of data at a time - a grand total of 10 years of data. I chose 18 months based on my observation of RAM Usage on SSIS-Server - at 18 months it consumes 27GB of RAM. Any higher than that, and SSIS starts buffering to disk and the performance nosedives.

Here is my dataflow http://img207.imageshack.us/img207/4105/dataflow.jpg

enter image description here

I am using Microsoft's Balanced Data Distributor to send data down 8 parallel paths to maximize resource usage. I do a union before starting work on my aggregations.

Here is the task manager graph from the SSIS server

alt text

Here is another graph showing the 8 individual CPUs

alt text

As you can see from these images, the memory usage slowly increases to about 27G as more and more rows are read and processed. However the CPU usage is constant around 40%.

The second graph shows that we are only using 4 (sometimes 5) CPUs out of 8.

I am trying to make the process run faster (it is only using 40% of the available CPU).

How do I go about making this process run more efficiently (least time, most resources)?

Glorfindel
  • 1,213
  • 3
  • 15
  • 22
Raj More
  • 327
  • 1
  • 3
  • 11
  • Could you add a screen shot of the data flow? – billinkc Nov 21 '11 at 15:03
  • @billinkc Here is my data flow http://img207.imageshack.us/img207/4105/dataflow.jpg – Raj More Nov 22 '11 at 13:18
  • 1
    @RajMore, out of curiosity: any progress on this? Please post your solution for community to benefit. – Filip Popović Dec 06 '11 at 15:23
  • I tried to do the `Unpivot` in the TSQL figuring I would share the workload between my database and SSIS servers (and maybe that would somehow coax the data to flow faster). That turned out to be several degrees slower. – Raj More Dec 07 '11 at 22:20

3 Answers3

2

After good suggestions made by bilinkc, and without knowing where bottleneck is I would try another few things.

As You already noted, You should work on parallelism, not processing more data (months) in the same dataflow. You have already made transformations run in parallel, but source and destination (as well as aggregation) is not running in parallel! So read to the end and keep in mind that You should make them run in parallel too in order to utilize your CPU power. And don't forget that You are memory bound (can't aggregate infinite number of months in one batch) so the way to go ("scale-out") is to get chunk of data, process it and put it into destination database as soon as possible. This requires to eliminate common components (one source, one union All) because each chunk of data is limited to speed of those common components.

Source related optimization:

  • try multiple sources (and destinations) in the same dataflow instead of Balanced Data Distributor - You use clustered index on date column so your database server is capable to quickly retrieve data in date based ranges; if you run package on different server than database resides on, You will increase network utilization

Transformations related optimization:

  • Do You really need to do Union All before Aggregate? If not, take a look at Destination related optimization regarding multiple destinations
  • Set Keys, KeyScale, and AutoExtendFactor for Aggregate component to avoid re-hashing - if these properties are incorrectly set then You will see warning during package execution; note that predicting optimal values are easier for batches of fixed number of months than on infinite number (like in your case 18 and raising)
  • Consider aggregating and (un)pivotting in SQL Server instead doing it in SSIS package - SQL Server outperforms Integration Services in these tasks; of course, transformations logic may be such that forbids aggregating before performing some transformations in package
  • if You can aggregate (and pivot/unpivot) (for instance) monthly data in database, try doing it in source query or in destination database with SQL; depending on your environment, writing to separate table in destination database, building index, SELECT INTO with aggregating with SQL might be faster than doing it in package; note that parallelizing such activities will put a lot of pressure on your storage
  • You have a multicast at the end; I don't know how many rows get there, but consider following: write to destination to the right (on the screenshot) then populate records to the destination to the left in SQL query (to eliminate second aggregation and to release resources - SQL Server will probably do it much faster)

Destination related optimization:

  • use SQL Server Destination if possible (package must be run on the same server as database and destination database must be SQL Server); note that it requires exact columns data type match (pipeline -> table columns)
  • consider setting Recovery Model to Simple on your destination (data warehouse) database
  • paralelize destinations - instead of union all + aggregate + destination use separate aggregates and separate destinations (to the same table); here You should consider partitioning your destination table and putting partitions on separate filegroups; if you process data month by month make partitions by month and use partition switching

Seems like I stayed unclear about which way to go with parallelism. You can try:

  • putting multiple sources in single dataflow requires You to copy and paste transformation logic and destination for each source
  • running multiple dataflows in parallel where each dataflow processes only one month
  • running multiple packages in parallel where each package has one dataflow which processes only one month; and one master package to control execution of each (month) package - this is preferred way because You will probably run package only for one month once You get into production
  • or the same as previous but with Balanced Data Distributor and Union All and Aggregate

Before You do anything else, You might want to do a quick test: get your original package, change it to use 1 month, make exact copy that processes another month and run those packages in parallel. Compare it to your original package processing 2 months. Do the same for 2 separate 6-months package and single 12month package at time. It should run your server at full CPU usage.

Try not to overparalellize because You will have multiple writes to destination, so You don't want to start 18 parallel monthly packages, but rather 3 or 4 for start.

And finally, I strongly believe that memory and destionation I/O pressures are the ones to be eliminated.

Please inform us on your progress.

1

Use Process Explorer to reveal some more resource usage (memory and IO). Note that the Disk-IO graph may be a bit misleading since the peaks in the graph are often due to the hard drives caching capabilities, so when disk IO is the bottleneck it doesn't always reveal itself immediately in the graph.

In some cases you can benefit from installing a ram-drive and putting the temp directories there. I've successfully used this one to reduce the time our build-machine used to do a full nightly build and run tests. I'm not sure if SSIS would benefit though.

Louis Somers
  • 616
  • 6
  • 15
  • I will check if they will allow me to install process explorer. I do not think I will benefit from a RAM drive. – Raj More Nov 21 '11 at 14:10
1

(reposting my initial response, have not factored in BDD)

At the end of the day, all processing is bound by one of four factors

  • Memory
  • CPU
  • Disk
  • Network

The first step is to identify what the limiting factor is and then determine whether you can influence it (acquire more of or reduce usage of)

Component choices

The reason your server's memory runs out when you do more than 18 months is related to why it takes so long for it to process. The Pivot and Aggregate transformations are asynchronous components. Every row coming in from the source component has N bytes of memory allocated to it. That same bucket of data visits all the transformations, has their operations applied and is emptied at the destination. That memory bucket is reused over and over again.

When an async component enters the arena, the pipeline is split. The bucket that was transporting that row of data must now be emptied into a new bucket to complete the pipeline. That copying of data between execution trees is an expensive operation in terms of execution time and memory (could double it). This also reduces the opportunity for the engine to parallelize some of the execution opportunities as it's waiting on the async operations to complete. A further slow down to operations is encountered from the nature of the transformations. The Aggregate is a fully blocking component so all the data has to arrive and be processed before the transformation will release a single row to the downstream transformations.

If it's possible, can you push the pivot and/or the aggregate onto the server? That should decrease the time spent in the data flow as well as the resources consumed.

You can try increasing the amount of parallel operations the engine can chose. Jamie's article, SQL CAT's article

If you really want to know where your time is being spent in the data flow, log the OnPipelineRowsSent for an execution. Then you can use this query to rip it apart (after substituting sysssislog for the sysdtslog90)

Network transfer

Based on your graphs, it doesn't appear the CPU or Memory is taxed on either box. I believe you have indicated the source and destination server are on a single box but the SSIS package is hosted and processed on another box. You're paying a not-insignificant cost to transfer that data over the wire and back again. Is it possible to process the data on the source server? You'd need to allocate more resources to that box and I'm crossing my fingers that's a big beefy VM and that's not a problem.

If that's not an option, try settings the Packet Size property of the connection manager to 32767 and talk to network ops about whether jumbo frames are right for you. Both of those tips are in the Tune your Network section.

I suck at disk counters but you should be able to see if the wait types are disk related.

billinkc
  • 276
  • 2
  • 16