Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • stptime - SMR, "using data not sorted" error

    Hello,

    I have individual-level survival data for a cohort of people with my exposure variable of interest coded 0-3. I would like to calculate SMR for exposure groups 1-3 using exposure = 0 as the reference group. I've therefore saved the age group-specific mortality rates in a separate file, sorted by age group.

    I can't share the data itself due to confidentiality restrictions but here is an example of what it looks like (simplifying to 3 age groups):

    Code:
    clear
    input byte exposure float(age_group failure persontime id)
    0 1 0  15  1
    0 1 0  15  2
    0 1 0   7  3
    0 1 1   3  4
    0 1 1 2.5  5
    0 2 0   5  6
    0 2 0  15  7
    0 2 0  15  8
    0 2 0  11  9
    0 2 1   5 10
    0 2 1   4 11
    0 2 1 2.2 12
    0 3 0 3.2 13
    0 3 0  15 14
    0 3 0  15 15
    0 3 0   7 16
    0 3 1 3.2 17
    0 3 1 9.4 18
    0 3 1  14 19
    1 1 0  15 20
    1 1 0  15 21
    1 1 0   8 22
    1 1 1   3 23
    1 2 0   9 24
    1 2 0  15 25
    1 2 0  15 26
    1 2 0   4 27
    1 2 1  12 28
    1 2 1  .5 29
    1 3 0   6 30
    1 3 0  15 31
    1 3 0  15 32
    1 3 1   8 33
    1 3 1   2 34
    1 3 1   4 35
    2 1 0  15 36
    2 1 0  14 37
    2 1 0   8 38
    2 1 1   6 39
    2 2 0  13 40
    2 2 0  15 41
    2 2 0   8 42
    2 2 0   5 43
    2 2 1   1 44
    2 2 1  15 45
    2 3 0   9 46
    2 3 0  15 47
    2 3 0  15 48
    2 3 1   9 49
    2 3 1   6 50
    2 3 1   1 51
    3 1 0  13 52
    3 1 0  10 53
    3 1 1   3 54
    3 1 1   7 55
    3 2 0  15 56
    3 2 0  15 57
    3 2 0   9 58
    3 2 1 6.3 59
    3 2 1   7 60
    3 2 1   1 61
    3 3 0  14 62
    3 3 0  15 63
    3 3 0  12 64
    3 3 0   8 65
    3 3 1  11 66
    3 3 1  12 67
    end


    When I try to run stptime with the SMR option, I get the following error message:
    "using data not sorted"

    Code:
    stptime, smr(age_group rate) using("U:\working\unexposed_rates.dta" by(exposure) per(100000)
    This is despite the using data definitely being sorted. I also get the same message if I try

    Code:
    stptime if exposure>0, smr(age_group rate) using("U:\working\unexposed_rates.dta" by(exposure) per(100000)
    I read this response that suggests that sorting the using data manually would fix the problem, but that hasn't been the case on my attempts: https://www.statalist.org/forums/for...-using-stptime

    Any suggestions as to where I might be going wrong gratefully received.

    Thanks

  • #2
    Stata is rarely wrong about whether data is or is not sorted, and Clyde Schechter is rarely wrong in his advice.

    If you run
    Code:
    describe using "U:\working\unexposed_rates.dta"
    just before you run the sptime command, the bottom of the output from describe will have either
    Code:
    Sorted by: exposure
    if it is sorted by exposure, which is your by-variable, or
    Code:
    Sorted by:
    if it is not sorted.

    Comment


    • #3
      Hi

      Thanks William - you're right, it hadn't saved the sorting order correctly when I created the file using post.

      However, I'm now getting the error "no observations merged, at() option not specified or incorrectly specified" when I run the stptime command using the syntax as below - I haven't specified an at() option as am looking for an overall SMR, though when I try to add this in keeping with the age bands in individual-level and standard data, I still get same message:

      Code:
       
       stptime if exposure>0, smr(age_group rate) using("U:\working\unexposed_rates.dta" by(exposure) per(100000)

      The structure of the standard population is as follows, in case it helps:
      Code:
       
       input byte exposure float(age_group rate) 0 1 52.9 0 2 88.2 0 3 125.7 end
      Also, I thought I should be sorting on age group (which is essentially the 'matching' variable between the individual-level data and the standard population), as exposure will be 0 for all rows in the standard population?

      Any advice gratefully received.

      Comment


      • #4
        I see a missing right parenthesis at the end of the -using- option:
        Code:
        using("U:\working\unexposed_rates.dta") by(exposure) ...

        Comment


        • #5
          Well-spotted Mike - transcription error* on my part I'm afraid, it still doesn't work even with the parenthesis...

          (*The secure analysis environment I'm using doesn't allow copy/paste)

          Comment


          • #6
            A casual look at stptime.ado suggests that perhaps stptime determines the levels of exposure without taking into account the effect of the if clause, and thus attempts to do whatever it does (sorry, I'm just a guy who trusts Clyde's advice and Stata's assertions about sorting) for observations with exposure==0 for which there are none after applying the if clause.

            Perhaps
            Code:
            keep if exposure>0
            stptime, smr(age_group rate) using("U:\working\unexposed_rates.dta") by(exposure) per(100000)
            will succeed.

            Comment


            • #7
              It shouldn't matter, but you don't need the variable exposure in the using file.

              I can't help with your specific problem without seeing more details of what you are doing. You could also try -strate- or estimate the SMRs from first principles (i.e., do the merging and calculation of expected rates yourself).

              However, it's not clear why you are calculating SMRs. Your analytic approach seems non-standard.

              I have individual-level survival data for a cohort of people with my exposure variable of interest coded 0-3. I would like to calculate SMR for exposure groups 1-3 using exposure = 0 as the reference group. I've therefore saved the age group-specific mortality rates in a separate file, sorted by age group.
              Why not just use the individual data to estimate the rate ratios for exposure groups 1-3 compared to group 0?

              SMRs are typically used when one does not have individual data on the unexposed so uses tabulated rates instead. You have individual data, which you are then using to tabulate rates, which are then used as the denominator in the rate ratio. This seems to be a lot of extra steps for no reason. Or maybe you have a reason that's not clear from your OP? I may be wrong, but it's possible the standard error of the SMR is calculated under the assumption that only the observed count is a random variable and the expected count is fixed and known. That is, by using your approach you are assuming that the variance of the rate among exposure category 0 is 0 and the covariance is zero. That is, your variance estimates may not be correct because you may be erroneously assuming that the rate among exposure group 0 is estimated without random error.

              Comment

              Working...
              X