Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with the CHAID procedure

    Dear Statalist Users,

    I want to learn how to build decision trees using the CHAID algorithm. To do that I looked at the most simple example in the help document of the procedure. As indicated, I run the following code:

    set seed 1234567
    webuse auto
    chaid foreign, unordered(rep78) minnode(4) minsplit(10) xtile(length, n(3))

    The result shown in the help is:

    Chi-Square Automated Interaction Detection (CHAID) Tree Branching Results
    --------------------------------------------------------------------------------

    1 2 3 4
    +---------------------------------------------------------+
    1 | xtlength@1 xtlength@2 xtlength@2 xtlength@3 |
    2 | rep78@1 3 2 rep78@4 5 |
    3 | Cluster #1 Cluster #2 Cluster #4 Cluster #3 |
    +---------------------------------------------------------+

    The result I get is:

    Chi-Square Automated Interaction Detection (CHAID) Tree Branching Results
    --------------------------------------------------------------------------------

    1 2 3 4
    +---------------------------------------------------------+
    1 | xtlength@1 xtlength@2 xtlength@2 xtlength@3 |
    2 | rep78@1 4 5 rep78@2 3 |
    3 | Cluster #1 Cluster #2 Cluster #4 Cluster #3 |
    +---------------------------------------------------------+

    Where the difference is in Cluster #2 and Cluster #4, with rep78@1 being merged with rep78@4 5 instead of rep78@2 3.
    The first of the two results is the solution the procedure should return, since the contingency table (conditional on xtlength@2) is:

    1 2 3 4 5

    D 2 3 13 0 0

    F 0 0 0 2 3

    Where it can be seen that the algorithm should merge 1,2,3 and 4,5.

    My question is: Why does the procedure no longer replicate the result from the help document? Is there a bug, or am I missing something?

    Thank you very much for your help in advance!

    Paulo


  • #2
    Welcome to Statalist, Paulo.

    The chaid command is a user-written extension to Stata installed from SSC. In the output of ssc describe chaid (as well as at the end of the output of help chaid) you'll see the author's name and email address for support. I'd write to him directly, if you have not already done so.

    I hypothesized that the recent change in Stata's random number generator was at the root of the problem, but adding
    Code:
    set rng kiss32
    before your sample code did not affect the results.

    Comment


    • #3
      Hi Paulo,

      It's misreport in the helpfile.

      The result reported is based on the original version of the software (put out in 2013) that had a few category merging bugs that have been subsequently resolved. That result needs an update in the helpfile.

      - joe
      Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
      ----
      Research Fellow
      Fors Marsh

      ----
      Version 18.0 MP

      Comment


      • #4
        Hello William and Joseph,

        thank you very much for your answers.

        It's just not clear to me why (conditional on xtlength@2) the procedure would not merge rep78 1, 2, and 3. All three have counts in "domestic" and none in "foreign". The category rep78 4 and 5 have counts in "foreign" and none in "domestic". The difference between rep78 1 and rep78 4 5 is higher than between rep78 1 and rep78 2 3, since in the first case the vectors are linearly independent and in the second case they are not.

        It would be great if you can give me some intuition for this. I guess it has something to do with CHAID not explicitly going for node purity or that the number of observations in this example are just to low for the test to give reliable results?

        Thanks again.

        Best,
        Paulo

        Comment


        • #5
          Hi Paulo,

          The helpfile notes that:

          The current implementation of chaid differs from the traditional use of contingency tables in that it uses logistic models to estimate chi-square values and, as such, may require somewhat larger sample sizes than do other implementations of chaid for the ml algorithm to converge. The default estimation method for chaid is mlogit. The use of Stata's ml based commands greatly increases the flexibility of the kinds of data chaid can accommodate.
          The idea being that the structure chaid could accommodate other dependent variables. I had intended to extend -chaid- further to accommodate other glm's, but time simply has not permitted that work.

          Ultimately, in this situation, models with perfect prediction (i.e., the 0s vs not-0s across rep78) led to separation of the logit models and p-values that will be treated as 1. Thus, as you note, its due to the small sample.

          Unexpected, perhaps, but the user is warned about this kind of behavior.

          - joe
          Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
          ----
          Research Fellow
          Fors Marsh

          ----
          Version 18.0 MP

          Comment


          • #6
            Hi Joe,

            that clarifies it.

            Thank you very much for your help and for providing the procedure to the community.

            Best,
            Paulo

            Comment

            Working...
            X