Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • clogit for matched cases and controls with a cluster variable and missing data

    Hello,

    I have 168 participants who meet the criteria for being a case. 860 controls. 211 with missing data so cannot be grouped. From the variables of interest I have about 300 participants with entirely missing data. Should they be removed from being a case/control?

    I've matched so far my cases and controls 2:1 on age and gender. The overall age range of the sample is 7 years 8 months - 10 years 6 months. This is my code:

    Code:
     preserve
    
    . keep if case
    (860 observations deleted)
    
    . drop case
    
    . tempfile cases
    
    . save `cases'
    file C:\Users\Guest\AppData\Local\Temp\ST_4790_000008.tmp saved
    
    . restore
    
    . drop if case
    (379 observations deleted)
    
    . drop case
    
    . ds c1dage cdgender, not
    uniqueid      variables etc
    
    . rename (`r(varlist)') =_ctrl
    
    . tempfile controls
    
    . save `controls'
    file C:\Users\Guest\AppData\Local\Temp\ST_4790_000009.tmp saved
    
    . use `cases'
    
    . rangejoin c1dage -1 1 using `controls', by(cdgender)
      (using rangestat version 1.1.1)
    
    . set seed 8846
    
    . gen double shuffle = runiform()
    
    . duplicates drop
    
    Duplicates in terms of all variables
    
    (0 observations are duplicates)
    
    . by uniqueid (shuffle), sort: keep if _n <= 2
    (152,179 observations deleted)
    
    . by uniqueid (shuffle), sort: keep if _n <= 2
    (0 observations deleted)
    
    . drop shuffle
    I have then reshaped to long in preparation for clogit with the following code:

    Code:
    ds *_ctrl, not
    uniqueid      c1dage      variables etc
    . local vbles `r(varlist)'
    
    . rename (`vbles') =_case
    
    . gen long obs_num = _n
    
    . clonevar group_id = uniqueid_case
    
    . reshape long  `vbles', i(obs_num) j(cc) string
    (note: j = _case _ctrl)
    (note: cdgender_ctrl not found)
    (note: c1dage_ctrl not found)
    (note: c1dage_U_ctrl not found)
    
    Data                               wide   ->   long
    -----------------------------------------------------------------------------
    Number of obs.                      758   ->    1516
    Number of variables                  49   ->      28
    j variable (2 values)                     ->   cc
    xij variables:
                uniqueid_case uniqueid_ctrl   ->   uniqueid
            
                c1dage_U_case c1dage_U_ctrl   ->   c1dage_U
    other variables etc
    -----------------------------------------------------------------------------
    
    . drop obs_num
    
    . duplicates drop if cc == "_case"
    
    Duplicates in terms of all variables
    
    (2 observations deleted)
    I now am a bit stuck. I want to do a very simple test at this point just to see whether there is a difference between cases and controls for certain variables and whether this is statitsically significant. For example, I have a continuous variable - BMI. I want to know do cases have a higher BMI for controls and is this statistically significant.
    I also have another variable that asks does the participant have short sleep, which I have recoded from a continuous variable to categorical where 1 = yes and 0 = no. Can I enter both this continuous variable and categorical variable into clogit? Is it possible though to find out if each variable is individually significantly different for cases vs controls before entering into a model? I really just want something very simple.

    To complicate things further I have a cluster variable that has rather a large number of values that I need to control for (geographical area).

    Any help gratefully appreciated.
    Last edited by sladmin; 18 Jul 2025, 12:11. Reason: anonymize original poster

  • #2
    For the simple test, look at the documentation for test. Alternatively, you can run a regression with BMI as the dv and a dummy for case-control for the rhs.

    It is seldom a good idea to recode continuous to binary - it throws away information and adds measurement error.

    You certainly can have continuous and dummy variables in clogit. You can run the model with just one of the two if you like.

    Your final statement about the cluster variable complicates things. If you control for a pile of other stuff, you're not doing a "simple test" - you're looking for an effect holding other stuff constant. But, in general, clogit is quite happy with many variables as long as you have sufficient usable observations.

    Comment


    • #3
      I'd second what Phil said.

      Doing the simple t-test is (unfortunately) a bit trickier in the long format. I'm confused here because I don't quite understand variable names, structure, etc., but in general outline, the paired t-test could be done by an approach like this:
      Code:
      sort uniqueid cc
      by uniqueid: gen diff = BMI - BMI[2] // BMI as example from your earlier posting
      ttest diff == 0
      Having shown this, I'd say again that I'm not endorsing this t-test as a desirable analytic approach.

      Now, regarding your clogit model. Presuming that you have matched case and control pairs identified by uniqueid, and that cc is a variable that is 1 for cases and 0 for controls, you'd want something like:
      Code:
      clogit cc x1 x2 i.ClusterVariable, group(uniqueid)
      where x1 and x2 are covariates of interest. I suspect that entering the ClusterVariable as I have shown will be problematic, since having a large number of indicator variables from it (e.g., like 15 or so) will be problematic.

      What can you do if there are lots of values for the ClusterVariable? One thing that comes to mind is the possibility of using the uniqueid as a "fixed effect" and the ClusterVariable as a "random effect" in a mixed logit model (-melogit-), but I don't know enough to know whether this will give you the right estimates in a data set with case-control sampling. And all the missing data sounds problematic. I'd say that the Stata issues are no big deal here. What I'd worry about is figuring out an appropriate statistical treatment of what could be a complicated and non-standard situation. That would require a detailed description of your problem and data collection, presented to someone with better epi/biostat knowledge than I have. So, I'd say your problem here is best framed as "How should I analyze this data if my goal is to understand ..... "

      Comment


      • #4
        Right I think I did the clogit wrong then as I ran it like this;

        Code:
        clogit cc i.smoke area*, group(group_id) or
        which led to a crazy odds ratio of 8. So in that example I think I’ve done cc (case / control) as the DV, smoking as the independent variable and the area (cluster variable) as a covariate . I’ve also used the group_id instead of unique_id? This variable was produced when I reshaped to long.

        will post a dataex example when I get into work but can anyone see my error immediately from that coding?

        Comment


        • #5
          Code:
           clogit cc i.smoke area*, group( group_id ) iterate(1) or  
          
          note: multiple positive outcomes within groups encountered.
          note: 228 groups (407 obs) dropped because of all positive or       all negative outcomes.  Iteration 0:  
          
          log likelihood = -226.34064  
          Iteration 1:  
          log likelihood = -225.14419  
          convergence not achieved    
          
          Conditional (fixed-effects) logistic regression                                                  
          Number of obs     =        551                                                
          LR chi2(2)        =      23.23                                                
          Prob > chi2       =     0.0000 Log likelihood = -225.14419                    
          Pseudo R2         =     0.0491
           ------------------------------------------------------------------------------                
          
                 cc | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------          
          smoke |          
            Yes  |   8.219045   4.254183     4.07   0.000     2.980156    22.66751          
            area |     1.0006   .0005465     1.10   0.272     .9995295    1.001672
          ------------------------------------------------------------------------------  
          
          Warning: convergence not achieved
          Help?
          Last edited by sladmin; 18 Jul 2025, 12:03. Reason: anonymize original poster

          Comment


          • #6
            I presume your area variable is categorical. Stata does not know that, and has treated it as though it is continuous, which almost certainly makes no sense in your situation. No data analysis program I know of can know something like that without the user letting it know. That's why I used the i. notation. In Stata, see -help fvvarlist-.

            The kinds of difficulties you are running into suggest that you need broader and deeper advice than is possible in an online forum like this. I'd encourage you to seek out a consultation in a real-life setting, where give and take and the solicitation of detail is more efficient.

            Comment

            Working...
            X