Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata with Machine Learning

    I am working on a data science/machine learning project and was happy to see Stata packages on machine learning topics:
    https://www.stata.com/stata-news/news33-4/users-corner/
    but I am not sure how effective these packages are in comparison with R or Python algorithms.

    1. If you know any of these Stata packages work well in your case, please let me know and if you can share your example/project outcome, that would be fantastic:
    Or if you know in which situation, it does not work, please share as well.
    • Support Vector Machines (SVM)
    • Random forest
    • Neural Network
    • Decision Trees
    • Bayesian Approach
    • Naive Bayes
    • kNN
    • Deep Learning
    • AI
    2. I have a few questions on this particular example of decision trees.

    I have tried decision trees using R and it worked pretty well both statistically and graphically. In Stata, I tried -rforest- and -chaidforest- as follows:

    Code:
    clear all
    
    webuse auto
    
    chaidforest foreign, unordered(rep78) minnode(2) minsplit(5) xtile(length weight, nquantiles(3)) alpha(.8)
    estat gettree, tree(1) graph
    
    
    rforest foreign weight length rep78 mpg, type(reg)  iter(500)
    
    rforest foreign weight length rep78 mpg, type(class)
    a. Which one is really a package for decision trees? The names of those packages are both random forest but it seems to me that -chaid- is more on decision trees because it provides nodes of decision and graphs on decision tree.
    b. Can rforest provide a graph on decision trees as well? I only see the graph on variable importance
    c. -chaidforest - only allows one variable to start the splitting with, i.e. "foreign". If user does not know which "important" variable to start with, what would be an efficient way to start the decision tree? How do we allow the whole set of variables in and the algorithm automatically make the decision?

    Hope to hear from your experience! And thank you in advance for your answers.
    Wish Stata can provide great packages and quick solution for machine learning like it always does for other areas.
    Best regards,
    Victoria Nguyen
    PhD in Econometrics
    Help you move forward and achieve your goals faster
    https://statatutoring.weebly.com/

  • #2
    Hi Victoria,

    Neither -chaidforest- nor -rforest- are decision tree commands. Both grow ensembles (or forests) of decision trees. As you outline above, you can can, using post-estimation commands, access the results from a single tree from -chaidforest- however with -estat gettree-.

    If your goal is to obtain a single decision tree, -chaid- (also on SSC) and the precursor to -chaidforest-, grows a single decision tree and shares the same format as -estat gettree-.

    -chaidforest- only allows a single response variable. The splitting/prediction features/variables are in "unordered()", "ordered()", and "xtile()" options. -chaidforest- is a supervised learning algorithm and attempts to learn how to predict the one response variable. It sounds like you are looking unsupervised learning/clustering given your question here.

    - joe
    Joseph Nicholas Luchman, Ph.D., PStatĀ® (American Statistical Association)
    ----
    Research Fellow
    Fors Marsh

    ----
    Version 18.0 MP

    Comment


    • #3
      Hello Victoria,

      your list is missing Neural Networks. The brain module is using C plugins and multiprocessing and compares well performance-wise with the Python and R implementations. Unfortunately, the plugins are not yet tested for Unix and Mac operating systems. If you have Windows, it will run just fine. As long as is it not tested for the other two OS, I am reluctant to put it on ssc. The following post directs you to the download location on GitHub:

      https://www.statalist.org/forums/for...s-for-unix-mac

      Best,
      Thorsten

      Comment

      Working...
      X