How to inform Latent Dirichlet Allocation about two domains of topics and ignore irrelevant ones

caixiaodong

Join Date: Aug 2014

Posts: 6
#1

How to inform Latent Dirichlet Allocation about two domains of topics and ignore irrelevant ones

17 Sep 2019, 09:42

I am using -ldagibbs- to find topics from open-end question responses. Conceptually and empirically, responses may include three domains: area of work X + encountered problem Y + irrelevant blah Z. STATA command -ldagibbs- is able to identify statistical clustering as topics that contain elements of X, Y and Z (standalone or combinations). I wonder if there is a way to run LDA with predefined domains so the output would identify, for each response, the area of work x and encountered problem y and ignore all elements of Z.

Alternatively I could run "supervised" ML using -svmachines- against predefined X categories and predefined Y categories, but that would require constructing training data set which is quite labor-intensive.

I searched -ldagibbs- in this forum and didn't see many discussions.

Thanks!
Tags: LDA, machine learning, topic modeling
caixiaodong

Join Date: Aug 2014

Posts: 6
#2

02 Oct 2019, 09:55

I found a workaround/shortcut as opposed to using supervised ML (which requires training sets), regarding simultaneously figuring out “area of work X and challenge of Y” using -ldagibbs-

Conceptually there may be 10 topics within X and 5 topics within Y.
Run -ldagibbs- with topic(15) to initially explore strong/prominent combinations of x and y;

Identify (manually) keywords specific to Y and save as “stopwords.Y”;

Run -ldagibbs- with topic(10) after -txttool- taking out “stopwords.Y”; This figured out area of work topics;

Save all keywords specific to X (top 20 each topic, automated) as “stopwords.X”;

Load original data set and use -txttool- to take out “stopwords.X”;

Run -ldagibbs- with topic(5); this figured out challenge topics;

The results are quite clean and coherent.
Comment

Announcement

How to inform Latent Dirichlet Allocation about two domains of topics and ignore irrelevant ones

Comment