Friday, February 24, 2012

Interval/Bucket Dimension

I am trying to find the most efficient way of handelling discreet buckets over continuous/discreet set of values.

In english I am looking to have a dimension that says if age is between 0 and 5 they are an infant, 6 to 18 child and > 18 adult. I know I can do this in a case statement but I have similar senarios that require multiple levels of nesting and the ability to easily change the banding criteria with many fact table sharing the same banding information.

I have done this in the past by preprocessing the fact table and updating the key value based on between joins, but this is very inefficient on large datasets (upwards of 30 million rows).

I also need to be able to do this across continuous values such a monetary amounts.

Can anyone help

Philip Coupar

Dear Philip,

Sorry for interrupting in. Actually, I'm facing the same problem as you did. Have you solved this by any mean? Please share with us if you do. Thanks!

Regards,
Alex|||

Take a look at the DiscretizationMethod property of an attribute

http://msdn2.microsoft.com/en-US/library/ms174810.aspx

Try and build a new attribute that discetizes the coninuous set of values.

Edward Melomed.
--
This posting is provided "AS IS" with no warranties, and confers no rights.

|||Dear Edward,

I've tried looking into the DiscretizationMethod before. However, it only got 3 modes: automatic, EqualAreas and Clusters only. On the other hand, what I am trying to do is using a customizable bucket, said 0~49, 50~99, 100~149 and so on. I can't find any method to control the DiscretizationMethod in this way. Thanks!

Regards,
Alex|||

Your custom buckets are fitting into EqualAreas schenario :)

On the other hand you can create column in the relational database and create custom mapping yourself. Alternatively you can create a named calculation in DSV and use Case statement to map into the values you'd like. http://msdn2.microsoft.com/en-us/library/ms181765.aspx

Edward Melomed.
--
This posting is provided "AS IS" with no warranties, and confers no rights.

|||Dear Edward,

I don't think the EqualAreas can do what I want. The Areas range are calculated by SSAS and I have no control over it. So, it's impossible to group in my way. It'll make sense to me if it provide an option for me to specify the size of each area.

By the way, I'm currently using the CASE method to handle my problem. The drawbacks are a long SQL as I need to break down from 0 to 1000 with an interval of 50. Moreover, I need to insert some dummy records into the Fact table. Otherwise, I will get a discontinuous list if there are not such value in the Fact table. I don't want my user to see the range jumping from 0~49 to 500~549.

Regards,
Alex|||

I agree it is possible to solve this issue as a case statement. It is also possible to solve this problem by using a table that specifies the start and end point of each bucket and allocate the surrogate key neccessary to have a nice dimension table.

However if we take the example where the fact table contains someone's age at the time of an event, we can see that we may first of all break this down by Adult/Child, then have further age bandings under this. The descritization functionality does not accomodate this as it will not build multiple levels, and you cannot assert the start and end ages for each bucket, required for this type of analysis.

I do not really want to find the surrogate key by doing a between join on a dimension table as this would lead to something quite inefficient over very large fact tables. It also does not make sense to me, in terms of performance, to have the case statement in the DSV as this will require a large amount of processing over a degenerate dimension attribute, a significant issue if you need to reprocess 250 Million fact rows.

The original post was asking for ideas on the most efficient ways of achieving this. Edward's posts may help some people who have different decretization requirements, or are handling smaller datasets. Alex is looking at the same issue I referred to in my original post. It may well be that there are no more effcient ways of doing this at the moment, if anyone else has any ideas of how to improve on this kind of requirement I am sure we are all looking forward to hearing from you.

|||

Check out my articles doing aging buckets this way:

... In SSAS with Named Calculations:

http://www.databasejournal.com/features/mssql/article.php/10894_3590866_7

... And a slightly different approach in MSAS 2000 (same logic applies in SSAS 2005):

http://www.databasejournal.com/features/mssql/article.php/3525516

I've done it other ways for clients with specific needs, too. Let me know if you wish further amplification, etc.

Good Luck!

William E. Pearson III
CPA, CMA, CIA, MCSE, MCDBA
Island Technologies Inc.
931 Monroe Drive
Suite 102-321
Atlanta, GA 30308

404.872.5972 Office

wep3@.islandtechnologies.com
wep3@.msas-architect.com

www.msas-architect.com
-- -- --

Publisher Sites:

http://www.databasejournal.com/article.php/1459531

http://www.sql-server-performance.com/bill_pearson.asp

http://www.informit.com/authors/bio.asp?a=862acd62-4662-49ae-879d-541c8b4d656f

http://www.2000trainers.com/section.aspx?sectionID=17

|||

I like the idea of adding the aging logic to the time dimension, this combines the best of the surrogate key approach with the simplicity of the embeded case statements, and would overcome the performance issues around large fact tables.

However this has a very limited application, as within a single UDM model there may be many different custom range requirements, so we may have both transaction aging and the person's age each of which would be banded very differently. and this approach would mean exposing both aging structures on every use of the time dimension.

No comments:

Post a Comment