Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to inform Latent Dirichlet Allocation about two domains of topics and ignore irrelevant ones

    I am using -ldagibbs- to find topics from open-end question responses. Conceptually and empirically, responses may include three domains: area of work X + encountered problem Y + irrelevant blah Z. STATA command -ldagibbs- is able to identify statistical clustering as topics that contain elements of X, Y and Z (standalone or combinations). I wonder if there is a way to run LDA with predefined domains so the output would identify, for each response, the area of work x and encountered problem y and ignore all elements of Z.

    Alternatively I could run "supervised" ML using -svmachines- against predefined X categories and predefined Y categories, but that would require constructing training data set which is quite labor-intensive.

    I searched -ldagibbs- in this forum and didn't see many discussions.

    Thanks!

  • #2
    I found a workaround/shortcut as opposed to using supervised ML (which requires training sets), regarding simultaneously figuring out “area of work X and challenge of Y” using -ldagibbs-

    Conceptually there may be 10 topics within X and 5 topics within Y.
    1. Run -ldagibbs- with topic(15) to initially explore strong/prominent combinations of x and y;
    2. Identify (manually) keywords specific to Y and save as “stopwords.Y”;
    3. Run -ldagibbs- with topic(10) after -txttool- taking out “stopwords.Y”; This figured out area of work topics;
    4. Save all keywords specific to X (top 20 each topic, automated) as “stopwords.X”;
    5. Load original data set and use -txttool- to take out “stopwords.X”;
    6. Run -ldagibbs- with topic(5); this figured out challenge topics;

    The results are quite clean and coherent.

    Comment

    Working...
    X