Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Path Analysis Using sem – Dummy Coding for Categorical Independent Variables

    Hello everyone,

    Thank you so much in advance for your time and support.

    I am conducting a path analysis using the sem command in Stata. All of my mediators and outcome variables are continuous. The independent variables include continuous, dichotomous, and categorical variables.

    As I understand it, the sem command does not support factor variable notation (e.g., i.varname), so I created dummy variables manually for the categorical variables with three or more categories. I would appreciate it if you could review my approach and let me know if it is correct.


    Example: Race/Ethnicity Variable

    Code:
    tab racehisp_2015
    
                              racehisp_2015 |      Freq.     Percent        Cum.
    ----------------------------------------+-----------------------------------
                       0 Non-Hispanic White |        894       61.53       61.53
                       1 Non-Hispanic Black |        405       27.87       89.40
    2 Others (AI/AN/Asian/NHPI/Other/Hispan |        154       10.60      100.00
    ----------------------------------------+-----------------------------------
                                      Total |      1,453      100.00
    To set Non-Hispanic White (0) as the reference group, I created the following dummy variables:

    Code:
    gen black_dummy_2015 = (racehisp_2015 == 1)
    gen others_dummy_2015 = (racehisp_2015 == 2)
    These dummy variables are coded as 1 if the participant belongs to the specified group, and 0 otherwise.

    Code:
     tab black_dummy_2015
    
    black_dummy |
          _2015 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      1,048       72.13       72.13
              1 |        405       27.87      100.00
    ------------+-----------------------------------
          Total |      1,453      100.00
    Code:
     tab others_dummy_2015
    
    others_dumm |
         y_2015 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      1,299       89.40       89.40
              1 |        154       10.60      100.00
    ------------+-----------------------------------
          Total |      1,453      100.00
    Questions:

    1. Does this approach correctly treat Non-Hispanic White as the reference group?
      I understand that each dummy variable includes Non-Hispanic White and the remaining group(s) in the 0 category. Is this appropriate for creating dummy variables?
    2. Should all dichotomous variables in the model be coded as 0 and 1?
      For example, should “1” consistently indicate the presence of a characteristic or condition, and “0” indicate absence?
    Thank you again for your time and support. I would greatly appreciate your feedback.

  • #2
    1. Your coding is fine. Simply include black_dummy_2015 and others_dummy_2015 in your model and the non-hispanic white will be the reference group. You can also save time by using tab for creating dummies for you:
    Code:
    tab racehisp_2015, gen(race_dummies)
    2. Stata does not care whether 1 means "presence" or "absence" of a group or characteristic. Of course, a consistent pattern makes the interpretation of the results easier for you and helps you avoid mistakes.
    Best wishes

    Stata 18.0 MP | ORCID | Google Scholar

    Comment


    • #3
      Felix Bittmann Thank you so much for your review and suggestions. It is really helpful.

      Comment


      • #4
        Another way to automate the creation of indicator ("dummy") variables is with the -xi- command or -xi:- prefix. Although this is largely obsolete, its remaining role is for use in those commands that, like -sem- don't support factor variable notation. It looks a great deal like factor-variable notation. In your situation you could write your -sem- command as:
        Code:
        xi: sem (dv <- iv1 iv2 i.racehisp_2015)
        replacing the italicized parts with the actual corresponding variables in your model. Note that all though this approach resembles factor-variable notation typographically, it does not enable you to subsequently use -margins- correctly. See -help xi- for a full explanation of the -xi- command.

        Comment


        • #5
          Clyde Schechter Thank you very much for suggesting an alternative approach. This is very helpful!

          Comment


          • #6
            Note too that -gsem- allows factor variables. See the Intro 3 section here for more info.


            PS- You mentioned "mediators" in #1. IMO, it is better to call them putative or presumed mediators. The article by Fiedler et al. (2018) explains why I say that. I think this article should be required reading for doing mediation analysis. YMMV.

            Fiedler, K., Harris, C., & Schott, M. (2018). Unwarranted inferences from statistical mediation tests–An analysis of articles published in 2015. Journal of Experimental Social Psychology, 75, 95-102.
            Last edited by Bruce Weaver; 16 Jun 2025, 15:56. Reason: Added the PS.
            --
            Bruce Weaver
            Email: [email protected]
            Version: Stata/MP 19.5 (Windows)

            Comment


            • #7
              Bruce Weaver Thank you so much for your invaluable information and suggestions regarding the terminology. This has been truly helpful!

              Comment

              Working...
              X