Stata with Machine Learning

Victoria Nguyen

Join Date: Mar 2020

Posts: 8
#1

Stata with Machine Learning

01 Apr 2020, 10:30

I am working on a data science/machine learning project and was happy to see Stata packages on machine learning topics:
https://www.stata.com/stata-news/news33-4/users-corner/
but I am not sure how effective these packages are in comparison with R or Python algorithms.

1. If you know any of these Stata packages work well in your case, please let me know and if you can share your example/project outcome, that would be fantastic:
Or if you know in which situation, it does not work, please share as well.
Support Vector Machines (SVM)

Random forest

Neural Network

Decision Trees

Bayesian Approach

Naive Bayes

kNN

Deep Learning

AI

2. I have a few questions on this particular example of decision trees.

I have tried decision trees using R and it worked pretty well both statistically and graphically. In Stata, I tried -rforest- and -chaidforest- as follows:

Code:

clear all webuse auto chaidforest foreign, unordered(rep78) minnode(2) minsplit(5) xtile(length weight, nquantiles(3)) alpha(.8) estat gettree, tree(1) graph rforest foreign weight length rep78 mpg, type(reg) iter(500) rforest foreign weight length rep78 mpg, type(class)

a. Which one is really a package for decision trees? The names of those packages are both random forest but it seems to me that -chaid- is more on decision trees because it provides nodes of decision and graphs on decision tree.
b. Can rforest provide a graph on decision trees as well? I only see the graph on variable importance
c. -chaidforest - only allows one variable to start the splitting with, i.e. "foreign". If user does not know which "important" variable to start with, what would be an efficient way to start the decision tree? How do we allow the whole set of variables in and the algorithm automatically make the decision?

Hope to hear from your experience! And thank you in advance for your answers.
Wish Stata can provide great packages and quick solution for machine learning like it always does for other areas.
Best regards,

Victoria Nguyen
PhD in Econometrics
Help you move forward and achieve your goals faster
https://statatutoring.weebly.com/
Tags: None
Joseph Luchman

Join Date: Mar 2014

Posts: 114
#2

17 Apr 2020, 11:42

Hi Victoria,

Neither -chaidforest- nor -rforest- are decision tree commands. Both grow ensembles (or forests) of decision trees. As you outline above, you can can, using post-estimation commands, access the results from a single tree from -chaidforest- however with -estat gettree-.

If your goal is to obtain a single decision tree, -chaid- (also on SSC) and the precursor to -chaidforest-, grows a single decision tree and shares the same format as -estat gettree-.

-chaidforest- only allows a single response variable. The splitting/prediction features/variables are in "unordered()", "ordered()", and "xtile()" options. -chaidforest- is a supervised learning algorithm and attempts to learn how to predict the one response variable. It sounds like you are looking unsupervised learning/clustering given your question here.

- joe

Joseph Nicholas Luchman, Ph.D., PStat® (American Statistical Association)
----
Research Fellow
Fors Marsh
----
Version 18.0 MP
Comment
Thorsten Doherr

Join Date: Nov 2018

Posts: 20
#3

21 Apr 2020, 08:37

Hello Victoria,

your list is missing Neural Networks. The brain module is using C plugins and multiprocessing and compares well performance-wise with the Python and R implementations. Unfortunately, the plugins are not yet tested for Unix and Mac operating systems. If you have Windows, it will run just fine. As long as is it not tested for the other two OS, I am reluctant to put it on ssc. The following post directs you to the download location on GitHub:

https://www.statalist.org/forums/for...s-for-unix-mac

Best,
Thorsten
Comment

Announcement

Stata with Machine Learning

Comment

Comment