Back to Declic Consulting web site.. Data Generation with VSDBPro

Data Generation with VSDBPro

by patricet28 3. January 2008 09:23

You've built your data warehouse, created your OLAP cube, written your ETL scripts to populate your relational database. Now come the famous questions: "How long will it take to process the cube?", "How much storage do I need to store the cube?", "what if I throw these additional aggregations to the mix?"

Of course, there is always the popular rule of thumb that gives you a 1.5-2M rows per minute processing time and a size of about 1/4th to 1/6th the size of the relational database (without indexes) for the OLAP store. But what if you need more precise metrics? There are two ways to get better results: either you get your relational store filled with real data (most accurate, but not always possible in the development phase) or you have a way to fabricate a high volume of fake data.

The remainder of this post explores a way to create a high volume of test data, which is pretty much what you need to a) have a tangible measurement of how long it takes to process the cube, and b) determine if you have adequate hardware resources to support your app (CPU, RAM and disk storage wise).

I've been using Visual Studio for Database Professionals (http://msdn2.microsoft.com/en-us/teamsystem/aa718807.aspx) and its data generation feature to create high volumes of test data. My goal was to populate 30+ SQL tables (used to create about 10 dimensions) and my fact table with 20M rows. My initials tests have been quite fustrating, as the process of generating data threw an Out-of-memory exception after roughly 3M rows. (Configuration: Dual-core T7200, 3GB RAM, Windows Server 2003 Enterprise Edition x64, SQL Server 2005 with SP2 and Cumulative update package 4). I've tried multiple times on various hardware platforms (some with 8GB RAM), but no matter what I did, I always ended up with this memory exception. Usage of memory would slowly increase over time, until it reached a point where it threw an exception and stopped the generation process.

AS2005 best practices teach us that a good design for a fact table should only have numeric aggregatable values and foreign keys to dimension tables. Indeed, that was exactly my situation.

But it looked like VSDBPro had issues with handling the foreign keys. Here are the steps I followed to get around this problem:

  • Install PowerTools for VSDBPro (http://blogs.msdn.com/gertd/archive/2007/08/07/it-is-august-6th.aspx)
  • In your SQL Server relational schema, drop relationships between your fact table and your dimension tables (I used a copy of my original database schema). Make sure that there are no foreign key constraints between the fact table and the dimension tables.  The screenshots below show an example with the AdventureWorks database that ships with the SQL Server samples.


  • In VSDBPro, import your new database schema
  • Create a data generation plan for your dimensions. I usually have two data generation plans: one for generating dimensions, the other one for generating data for the fact table.
  • Create another dgen plan for your fact table. For each dimension key ield, in the fact table column details, replace the foreign key generator with a sequential data bound generator.
  • Provide the SELECT statement that will retrieve keys from your dimension tables.



  • I then run the dimensions dgen plan, followed by the fact dgen plan. Notice that only the fact table is selected in the second plan. Had you used foreign keys, you'd have to generate both the dimensions tables and the fact table at the same time (VSDBPro can not create keys based on existing data. It has to generate both the primary and the foreign keys).


Back to my 20M rows fact table. On my machine, although memory usage slowly increased to 91%, it remained steady until it completed generation of my 20M rows. It took a couple of hours to generate that amount of data, but it was worth the wait! Now I can answer the questions raised at the beginning of this post with tangible numbers. I now know exactly the amount of disk space used for storage as well as the time it takes to process my cube and how much memory is consumed in this process.

Good luck!

 

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Tags:

Comments

Add comment


(Will show your Gravatar icon)  

  Country flag

biuquote
  • Comment
  • Preview
Loading



Powered by BlogEngine.NET 1.4.5.0
Theme by Mads Kristensen

TextBox

Tag cloud

RecentPosts