Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrete choice model with multiple choices... is this the best way?


    Hi,
    I am trying to use discrete choice model for my paper and I have the following two questions. Feel free to share your opinion.



    Concern 1.
    I am examining if customer’s initial interest has an impact on them buying the product.
    • customer_ID : unique id for each customer.
    • Product_ID : unique id for each product
    • Initial_int : binary 1 if the customer said he/she has interest in the product.
    • Purchase : DV, 1 if purchased.
    • Max purchase : total number of products the customer purchased(control)
    • Customer income : customer income (control)
    customer_ID product_ID purchase max_purchase Initial_int customer_income
    1 A 1 2 1 70
    1 B 0 2 0 70
    1 C 1 2 0 70
    1 D 0 2 0 70
    1 E 0 2 0 70
    2 A 1 2 1 90
    2 B 1 2 0 90
    2 C 0 2 1 90
    2 D 0 2 0 90
    2 E 0 2 0 90
    3 A 0 2 0 70
    3 B 1 1 0 70
    3 C 0 1 1 70
    3 D 0 1 0 70
    3 E 0 1 0 70
    4 A 1 3 1 100
    4 B 1 3 0 100
    4 C 1 3 1 100
    4 D 0 3 0 100
    4 E 0 3 0 100


    For this analysis I did the following:
    cmset customer_ID product_ID
    cmclogit purchase initial_int, casevar (max_purchase customer_income)

    The problem is that majority of the customers bought more than 1 products.

    According to page 13 here (https://www.stata.com/manuals/cmcmclogit.pdf#cmcmclogit ), it seems like I have to make observations (cmsets) for these purchases.
    For example, for customer_ID #1, I make another five sets of observations for the second purchase…And for these second purchases, I identify them with the variable "purchase_number"

    customer_ID product_ID Initial_int purchase Purchase_number
    1 A 1 1 1
    1 B 0 0 1
    1 C 0 0 1
    1 D 0 0 1
    1 E 0 0 1
    1 A 1 0 2
    1 B 0 0 2
    1 C 1 1 2
    1 D 0 0 2
    1 E 0 0 2
    This is an example, so there are not many observations and controls. However, the number of customers I have is 700 and products are 300, making the data big.
    Is adding new sets of observations the only way? Are there any options or commands that would make this process simpler? Any feedback would be helpful!




    Concern 2.
    Also, for clarifications, regarding the command below, do I include controls inside the parentheses or before?
    1. cmclogit purchase initial_int, casevar (max_purchase customer_income CONTROLS HERE? )
    2. cmclogit purchase initial_int CONTROLS HERE?, casevar (max_purchase customer_income)
    Thank you for sharing your thoughts.


    Last edited by olivia kim; 02 May 2025, 14:02.

  • #2
    Hi there,

    You're on the right path. Modeling each purchase as a separate choice event is necessary when using cmclogit, but you don’t need to do it by hand. Alternatively, if your research questions don’t depend on modeling the choice set structure explicitly, panel models like xtlogit or cmxtmixlogit might get you to your answer more efficiently.
    1. Do I really need to create a new choice set for every purchase?

    Yes, you do. The cmclogit command is built to model a single choice from a set of alternatives, and it assumes one “winning” option per choice set. So, when a customer makes more than one purchase, you need to create multiple choice sets—one for each purchase occasion.

    That said, you’re right to be concerned about the size of the data. With 700 customers and 300 products, expanding the data to accommodate multiple purchases will make it grow quickly. But that expansion is necessary for the model to correctly interpret each purchase as its own decision.

    You don’t have to do this manually. Stata makes it possible to automate the process using a combination of expand, bysort, and custom logic to assign purchase numbers and replicate the choice sets. This can save a lot of time and reduce errors. 2. Are there simpler alternatives?

    If your primary goal is to estimate how initial interest affects purchase behavior, and you're less concerned with modeling the exact choice structure over alternatives, you might consider using a fixed-effects logistic regression (xtlogit) or a mixed logit model (cmxtmixlogit).
    • With xtlogit, you can account for customer-level unobserved heterogeneity without expanding the data:
      xtset customer_ID xtlogit purchase initial_int customer_income, fe
    • With logit and clustered standard errors, you keep the data in its current structure:
      logit purchase initial_int customer_income vce(cluster customer_ID)
    • If you want to remain in a discrete choice framework, but avoid manual expansion, consider cmxtmixlogit:
      cmxtmixlogit purchase initial_int, panel(customer_ID) casevars(i.customer_income)
    This lets you model multiple decisions per customer while allowing for random effects and alternative-specific covariates. 3. Where do the controls go in the cmclogit command?

    This is a good question, and it's something that trips up many users. Here's the general rule:
    • Variables that vary across alternatives within a choice set (like initial_int) should go before the comma.
    • Variables that are constant within a choice set (like customer_income, max_purchase) should go inside the casevars() option.
    So your syntax should look like this:
    cmclogit purchase initial_int, casevars(customer_income max_purchase)
    This tells Stata which variables are shared across all alternatives in a given decision, and which ones help explain variation in choices.

    Best,
    Josh

    Comment


    • #3
      Thanks Josh!

      Your answers are super helpful!


      To add,...

      My logic for using cmclogit was that a customer's one purchase would impact his/her second purchase. So, each purchase is not independent.

      Also, since my data is not panel data it seems like I can't use cmxtmixlogit. Let me know your thoughts.


      Thanks,
      Olivia

      Comment

      Working...
      X