I used a decision-tree mining-model to describe and predict fraud. The table contains 1039 records with 775 distinct value of A-number (the calling party). I used 9 columns in the model. SQL Server reports that only 3 columns are significant in predicting the fraud
- BPN_is_too_short (called party-number is too short)
- Duration_is_zero
- Invalid_area_code
The key-column in A-number, and the predicted column is Is_Fraud with the range of values are only 0 and 1. There's no record with NULL (missing-value) in the column Is_Fraud.
Mining Legend shows in the first split
[-] 625 cases of fraud
[-] 150 cases of non-fraud
[-] 0 cases of missing
In addition to that, Mining Legend shows
[-] 79.69% of fraud
[-] 19.64% of non-fraud
[-] 0.67% Missing
Now when I compare those values, they don't match.
(A) 625/775 is 80.645%, not 79.69%
(B) 150/775 is 19.355%, not 19.64%
(C) 0 cases of NULL (missing value) should imply 0% of missing, not 0.67% of missing
Furthermore in one node (with the split on duration_is_zero), there are 541 cases of fraud and 0 cases of non-fraud. This implies the node is leaf-node. However, Mining Legend shows
514 cases of fraud, 99.35%
0 cases of non-fraud, 0.33%
[F] 0 cases of missing, 0.33%
My questions
(1) Why the values don't match like in cases A through C ?
(2) Why the values don't match even in cases D through F when we have no subtree at all ?
I've searched explanation by reading the mathematical reasoning, entropy, Gini index; but it does not answer the discrepancies of those values and percentages in the Mining Legend.
Regards,
Bernaridho
Our DT algorithm uses a baysian prior in all calculations. This means that it assumes that all possible states have equal probability at the beginning. This prior gets distributed throughout the tree. Therefore, what you are seeing are the prior-adjusted probabilities and the actual support.|||
Hi..
I need this information more details for doing my thesis which also use DT.
Would you like to give examples how to calculate this percentage ? I have tried to calculate using baysian prior, but the result isn't match which the percentage.
Thank you for your answer..
No comments:
Post a Comment