There are 2 suggestions I hear about when performance tuning the Lookup transformation on a large dimension table when getting the surrogate key for a fact table. I hope to introduce here a third.
The first is the caching option.
The Full Cache option will cache the whole table (or query columns). If you have a large dimension table, this might use a lot of memory.
The Partial Cache will load the matching rows but will clear least used lookup values once the memory size of the cache is exceed.
No Cache will query the data each and every time a lookup value is needed. No good is running lots of lookups on a large table
I believe using Full Cache is good when you have no memory limitations and a small dimension table. The Partial Cache is good for large tables. Not sure where No Cache would be used unless the lookup is not used much in the Data Flow Task.
The next suggestion is instead of selecting the whole table, use a T-SQL query to select only the columns you need,
By using only the needed columns in the SELECT T-SQL statement, you limit the amount of information retrieve into SQL Server’s Buffer Pool and the SSIS cache.
One more option can help improve this query even more – a proper index.
SELECT ProductSKey, ProductID
FROM DimProduct
WHERE Status = 1
This query used above will still do a Clustered Index scan or Non-clustered Index with a Key Lookup in the execution plan based on the indexes available like an index on the just the status column.
But, if you can create a covering index with the Status as a Non-Clustered Index including ProductSKey (surrogate key) and ProductID (natural key) in the index, you can get better performance on the SQL instance side.
The cost of the query went from 0.0178649 to 0.0065309. The statistics IO went from 21 logical reads to 7 and the execution went from a Clustered Index scan to a Non-clustered Index Seek.
Using the partial cache and proper index on dimension table helped reduce a look up on one of our dimension tables from 3 seconds to less than one second on a 200,000+ row dimension table at my current employer. The execution performance was viewed in the new SSISDB statistics report in the 2012 version of SSIS. More on that in another blog.