Problem with the CHAID procedure

Paulo Rodrigues

Join Date: Oct 2018

Posts: 3
#1

Problem with the CHAID procedure

31 Oct 2018, 04:00

Dear Statalist Users,

I want to learn how to build decision trees using the CHAID algorithm. To do that I looked at the most simple example in the help document of the procedure. As indicated, I run the following code:

set seed 1234567
webuse auto
chaid foreign, unordered(rep78) minnode(4) minsplit(10) xtile(length, n(3))

The result shown in the help is:

Chi-Square Automated Interaction Detection (CHAID) Tree Branching Results
--------------------------------------------------------------------------------

1 2 3 4
+---------------------------------------------------------+
1 | xtlength@1 xtlength@2 xtlength@2 xtlength@3 |
2 | rep78@1 3 2 rep78@4 5 |
3 | Cluster #1 Cluster #2 Cluster #4 Cluster #3 |
+---------------------------------------------------------+

The result I get is:

Chi-Square Automated Interaction Detection (CHAID) Tree Branching Results
--------------------------------------------------------------------------------

1 2 3 4
+---------------------------------------------------------+
1 | xtlength@1 xtlength@2 xtlength@2 xtlength@3 |
2 | rep78@1 4 5 rep78@2 3 |
3 | Cluster #1 Cluster #2 Cluster #4 Cluster #3 |
+---------------------------------------------------------+

Where the difference is in Cluster #2 and Cluster #4, with rep78@1 being merged with rep78@4 5 instead of rep78@2 3.
The first of the two results is the solution the procedure should return, since the contingency table (conditional on xtlength@2) is:

1 2 3 4 5

D 2 3 13 0 0

F 0 0 0 2 3

Where it can be seen that the algorithm should merge 1,2,3 and 4,5.

My question is: Why does the procedure no longer replicate the result from the help document? Is there a bug, or am I missing something?

Thank you very much for your help in advance!

Paulo
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

31 Oct 2018, 06:09

Welcome to Statalist, Paulo.

The chaid command is a user-written extension to Stata installed from SSC. In the output of ssc describe chaid (as well as at the end of the output of help chaid) you'll see the author's name and email address for support. I'd write to him directly, if you have not already done so.

I hypothesized that the recent change in Stata's random number generator was at the root of the problem, but adding

Code:

set rng kiss32

before your sample code did not affect the results.
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#3

31 Oct 2018, 09:34

Hi Paulo,

It's misreport in the helpfile.

The result reported is based on the original version of the software (put out in 2013) that had a few category merging bugs that have been subsequently resolved. That result needs an update in the helpfile.

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
1 like
Comment
Paulo Rodrigues

Join Date: Oct 2018

Posts: 3
#4

02 Nov 2018, 02:56

Hello William and Joseph,

thank you very much for your answers.

It's just not clear to me why (conditional on xtlength@2) the procedure would not merge rep78 1, 2, and 3. All three have counts in "domestic" and none in "foreign". The category rep78 4 and 5 have counts in "foreign" and none in "domestic". The difference between rep78 1 and rep78 4 5 is higher than between rep78 1 and rep78 2 3, since in the first case the vectors are linearly independent and in the second case they are not.

It would be great if you can give me some intuition for this. I guess it has something to do with CHAID not explicitly going for node purity or that the number of observations in this example are just to low for the test to give reliable results?

Thanks again.

Best,
Paulo
Comment
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#5

02 Nov 2018, 07:45

Hi Paulo,

The helpfile notes that:

The current implementation of chaid differs from the traditional use of contingency tables in that it uses logistic models to estimate chi-square values and, as such, may require somewhat larger sample sizes than do other implementations of chaid for the ml algorithm to converge. The default estimation method for chaid is mlogit. The use of Stata's ml based commands greatly increases the flexibility of the kinds of data chaid can accommodate.

The idea being that the structure chaid could accommodate other dependent variables. I had intended to extend -chaid- further to accommodate other glm's, but time simply has not permitted that work.

Ultimately, in this situation, models with perfect prediction (i.e., the 0s vs not-0s across rep78) led to separation of the logit models and p-values that will be treated as 1. Thus, as you note, its due to the small sample.

Unexpected, perhaps, but the user is warned about this kind of behavior.

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
1 like
Comment
Paulo Rodrigues

Join Date: Oct 2018

Posts: 3
#6

02 Nov 2018, 08:46

Hi Joe,

that clarifies it.

Thank you very much for your help and for providing the procedure to the community.

Best,
Paulo
Comment

Announcement

Problem with the CHAID procedure

Comment

Comment

Comment

Comment

Comment