You've built your data warehouse, created your OLAP cube, written
your ETL scripts to populate your relational database. Now come the
famous questions: "How long will it take to process the cube?", "How
much storage do I need to store the cube?", "what if I throw these
additional aggregations to the mix?"
Of course, there is always the popular rule of thumb that
gives you a 1.5-2M rows per minute processing time and a size of about
1/4th to 1/6th the size of the relational database (without indexes)
for the OLAP store. But what if you need more precise metrics? There
are two ways to get better results: either you get your relational
store filled with real data (most accurate, but not always possible in
the development phase) or you have a way to fabricate a high volume of
fake data.
The remainder of this post explores a way to create a high
volume of test data, which is pretty much what you need to a) have a
tangible measurement of how long it takes to process the cube, and b)
determine if you have adequate hardware resources to support your app
(CPU, RAM and disk storage wise).
I've been using Visual Studio for Database Professionals (http://msdn2.microsoft.com/en-us/teamsystem/aa718807.aspx)
and its data generation feature to create high volumes of test data. My
goal was to populate 30+ SQL tables (used to create about 10
dimensions) and my fact table with 20M rows. My initials tests have
been quite fustrating, as the process of generating data threw an
Out-of-memory exception after roughly 3M rows. (Configuration:
Dual-core T7200, 3GB RAM, Windows Server 2003 Enterprise Edition x64,
SQL Server 2005 with SP2 and Cumulative update package 4). I've tried
multiple times on various hardware platforms (some with 8GB RAM), but
no matter what I did, I always ended up with this memory exception.
Usage of memory would slowly increase over time, until it reached a
point where it threw an exception and stopped the generation process.
AS2005 best practices teach us that a good design for a fact
table should only have numeric aggregatable values and foreign keys to
dimension tables. Indeed, that was exactly my situation.
But it looked like VSDBPro had issues with handling the foreign keys. Here are the steps I followed to get around this problem:
- Install PowerTools for VSDBPro (http://blogs.msdn.com/gertd/archive/2007/08/07/it-is-august-6th.aspx)
- In your SQL Server relational schema, drop relationships
between your fact table and your dimension tables (I used a copy of my
original database schema). Make sure that there are no foreign key
constraints between the fact table and the dimension tables. The
screenshots below show an example with the AdventureWorks database that
ships with the SQL Server samples.
- In VSDBPro, import your new database schema
- Create a data generation plan for your dimensions. I usually
have two data generation plans: one for generating dimensions, the
other one for generating data for the fact table.
- Create another dgen plan for your fact table. For each
dimension key ield, in the fact table column details, replace the
foreign key generator with a sequential data bound generator.
- Provide the SELECT statement that will retrieve keys from your dimension tables.
- I then run the dimensions dgen plan, followed by the fact dgen plan.
Notice that only the fact table is selected in the second plan. Had you
used foreign keys, you'd have to generate both the dimensions tables
and the fact table at the same time (VSDBPro can not create keys based
on existing data. It has to generate both the primary and the foreign
keys).
Back to my 20M rows fact table. On my machine, although memory usage
slowly increased to 91%, it remained steady until it completed
generation of my 20M rows. It took a couple of hours to generate that
amount of data, but it was worth the wait! Now I can answer the
questions raised at the beginning of this post with tangible numbers. I
now know exactly the amount of disk space used for storage as well as
the time it takes to process my cube and how much memory is consumed in
this process.
Good luck!