Using -1 for Unknown Dimension member with Netezza Can Cause Slowness
There could be a problem when using typical practice dimensional design with the IBM PureData for Analytics server. The extremely fast data warehouse appliance can slow to a stop when distributing on a primary key that uses a single row for unknown. This happens when a large amount of data is associated with this row causing a skew to a single data slice.
A common practice with dimensional modeling is to use a surrogate primary key that consists of sequential meaningless numbers. A business key in the dimension is used to lookup the surrogate key when loading facts. To handle the case where the business key from the fact table is missing or not found in the dimension a typical design is to use an entry using something like -1 that will never collide with the normal numbers.
A table in Netezza is distributed across a number of data slices and when a fact and dimension are joined in a query, their related rows must be physically together on the same slice before parallel processing. If they are not, the rows are redistributed or broadcast so that they are. When there are two very large tables to join, performance can be increased by setting the distribution on each table to the columns they share in the relationship.
If this is done when a large amount of data is associated with the unknown row, that data will be skewed to a single data slice. Since your query is only as fast as the slowest data slice, this will be the bottleneck.
A solution to this problem can be found here. (I never thought I would advocate adding 10,000 dummy records.) When applying this approach to a large dimension table, try to generate negative numbers that span the approximate range as the positive numbers. Otherwise, they could all still skew to the same data slice.