Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching cases (1): 4 to controls by age category, sex ,State

    I have tried to follow the examples in the forum on matching cases / controls but have been unsuccessful.

    I have a large dataset of 21700, where 187 are cases. I have information on age, sex , state and cause of death for all the study population. I would like to match 1 case for 4 controls. I am drawing my controls from same population of the cases. I would like to match on sex, age (within 5 years range) , State of residence, and cause of death ( 3 categories) .
    Can I match this in steps; first on sex, age (within 5 years range) , State of residence. Then from this first pool match them by cause of death?
    How do I conduct this in STATA 9 the codes) and come up with cases and controls to answer my question of interest if the odds in my cases is higher than the controls

  • #2
    If you are using Stata 9, this is not easy. Frankly, just for the convenience of being able to do this simply, it would probably be worth your while to get the current (or at least a recent) version of Stata, or get a friend who has it to do this for you. With version 11 or beyond you can use Robert Picard's -rangejoin- command (available from SSC) and the whole thing becomes simpler and a whole lot faster.

    If you really must do this in version 9, let me assume you have already broken your cases and controls into two separate files, cases.dta and controls.dta. In addition to variables age, sex, state, cause_of_death, I assume that every observation has a distinct id number in a variable called id. First the slow part:

    Code:
    use cases, clear
    rename id case_id
    rename age case_age
    joinby sex state cause_of_death using controls, unmatched(master)
    keep if abs(age-case_age) <= 5
    Note that the -joinby- command will run slowly, and you will end up with a very large dataset, possibly exceeding your memory limitations.

    The data in memory now have each case paired up with every control to which it is a satisfactory match. The next part is to reduce it to just 4 per case. If you are willing to have the same control matched to more than one case (control sampling with replacement), then it's very easy from here:

    Code:
    set seed your_lucky_number_here
    gen double shuffle = runiform()
    by case_id (shuffle), sort: keep if  _n <= 4
    drop shuffle
    If you must sample controls without replacement, then it's more complicated than that. Post back if that is the case.


    Comment


    • #3
      not sure why you want to do this in steps (rather than all at once), but I tend to use one of the programs available at SSC (e.g., vmatch) for doing this kind of thing; also, do you really need/want exactly 4 controls per case; what if one case doesn't have 4 matches? do you then drop that case and whatever matches exist?

      I assume that one could use, e.g., vmatch for each step and then drop the unmatched controls and use it again for the next step, but doing it all at once appears to be much more efficient and less error-prone

      Comment


      • #4
        Thank you Clyde and Rich. I now have updated STATA 13 version. Would the same code work or is it easier to use the program ( how do i go about this) I am ok to have one control matched to more than one case ( with replacement ) . I do want to maintain all the cases even if it does not have 4 matches. My assumption and bargain is that as I have a small number of cases with a large sample e to draw controls from and also within an age range rather than by exact age that I will have a better match.
        Thanks

        Comment


        • #5
          First install -rangejoin- from SSC (-ssc install rangejoin-). Then you want code like this:

          Code:
          use cases, clear
          
          rangejoin age -5 5 using controls, by(sex state cause_of_death)
          set seed your_lucky_number_here
          gen double shuffle = runiform()
          by id (shuffle), sort: keep if _n < = 4
          You will find that -rangejoin- is much faster than -joinby- and demands much less memory. And it enables you to both exactly match sex, state, and cause_of_death, and match within the desired numerical range on age. Again, the above code assumes that the variables id, age, state, sex, and cause_of_deaeth exist, under those names, in both data sets.

          Do read -help rangejoin- for more information about things like renaming the variables in the -using- data set after they are imported.

          Comment


          • #6
            Thank you Clyde. After using the above code I now have a dataset of 555 observations. Is this then the number of observations that were able to match using specified criteria . would i be correct to say they are my controls in the case-control study. Thank you

            Comment


            • #7
              Following that code, each observation in your data set should contain information about one case and one of its matched controls. (It is also possible that some of your cases drew no matches at all and those observations could contain only case information.) There can be up to 4 such observations per case. Evidently some of your cases could not draw 5 matches that met your criteria; you would have had more observations at the end if they had. Anyway, whatever control IDs appear in that data set of 555 observations are the controls in your study. That doesn't mean you necessarily have 555 separate controls, as some controls may have matched to more than one case. You would need to get a count of the number of distinct values of the control id variable to know how many there are.

              Comment


              • #8
                Dear Prof. Clyde,

                My name is Akihiko, and I am a PhD student in Japan.
                In my research assessing the complications of ART pregnancy, I would like to match one case for one or two controls (normal pregnancy).
                I would like to mach on age with the range of one year.

                In the process, I find this sled, and have attempted to incorporate the above code using "rangejoin" in the control samplings.
                Yet, I finally realized that I must sample controls without replacement. How do I conduct this in Stata 14?

                Comment


                • #9
                  Sampling the controls without replacement is more complicated, and I do not have time to work it out today. If you search for other posts on this Forum on matching you will find some where this was implemented, and you can mark up the code there. I'm sorry I can't be of more help, but I have limited time today.

                  Comment


                  • #10
                    Prof. Clyde,

                    Thank you for your response.
                    I have finally made it, using the following post!
                    https://www.statalist.org/forums/for...s-and-controls

                    Best,

                    Akihiko

                    Comment

                    Working...
                    X