Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New on SSC: chaidforest - random forest ensemble classifier based on CHAID trees as base learners

    With many thanks to Kit Baum, the package chaidforest is now available from SSC.

    chaidforest is an implementation of a random forest (Breiman, 2001) ensemble classification algorithm which is based on the idea that multiple, relatively less accurate classifiers, when combined, can produce more accurate predictions than a single more accurate classifier such as a logit-based model - especially when considering out-of-sample predictions.

    The chaidforest classifier uses the CHAID algorithm as its "base learner" (type findit chaid for details about the CHAID algorithm); which means that the chaidforest classifier is built as a "forest" of individual CHAID "trees." After the forest of individual CHAID trees is "grown," predictions or observed probabilities for each observation can be obtained. The multiple trees grown constitutes the "forest" component of the random forest. Traditionally, a classification and regression tree is used as the base learner for a random forest. As such, the chaidforest ensemble classifier can be expected to differ somewhat from other random forest classifiers and is similar in spirit to the cforest() function in the R package party.

    chaidforest proceeds by growing a number of trees set by the user. By default, chaidforest randomly selects a subset of splitting variables (naturally, without replacement) and a subset of observations (optionally, with [the default] or without replacement) for each individual tree. The random splitting variable and observation selection constitutes the "random" component of the random forest.

    By combining many individual trees, with randomly selected splitting variables and observations, influential variables and observations have less impact on any individual tree. In particular, less influence by specific splitting variables allows less influential variables to actually be used to split/partition the data variables in any given tree, increasing their representation in the final predictions and (potentially) enhancing out-of-sample prediction.

    Currently, the user can only use chaidforest to obtain predicted values and probabilities. Thus, many routines for random forests (importance of splitting variables, a proximity matrix, out-of-sample fit) have not yet been incorporated - post estimation routines for these random forest products are currently in development. Additionally, an alternative splitting variable merging algorithm based on random binary mergers of splitting variable levels is in development which could speed forest growth and require less pre-processing by the user (user-defined merging) to speed along forest growth, yet remain in the spirit of the random forest.

    chaidforest, like chaid (SSC) off of which it is based, only allow categorical variables as response and splitting variables. Thus, continuous variables must be categorized. chaidforest contains a convenience option xtile() to break continuous splitting variables into quantiles, however.

    To install type:

    Code:
    ssc install chaidforest
    chaidforest requires Stata vsersion 12.1 and moremata (SSC)

    Please feel free to contact me with bug reports, suggestions, or other comments regarding chaidforest.

    - joe


    References

    Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
    Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
    ----
    Research Fellow
    Fors Marsh

    ----
    Version 18.0 MP
Working...
X