Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counts in a long dataset taking into account other variables including time

    Hello,

    I have a very large dataset that includes people ("personid") who could have been seen at 4 multiple sites (distinguished by "siteid"). The variable "monthyear" is the month and the year that the visit occurred. I want to determine:

    1) the counts of people who were seen at more than one site and what the combination of these sites were (for example, in the below dataex example of 4 sites, how many people were seen at sites 19 and 32, 19 and 24, 19 and 47, 32 and 47, etc.)
    2) how many people were seen at site 24 AFTER being seen at site 19 (essentially the same table as in my first question but incorporating time)
    3) how many people were seen specifically at site 24 AFTER site 19 within 3 months

    Any advice much appreciated!

    Sarah


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long personid byte siteid str7 monthyear
       2 19 "2021_03"
       2 19 "2020_07"
       3 19 "2019_11"
      51 19 "2020_03"
      51 19 "2019_01"
      51 19 "2019_07"
      51 19 "2019_01"
      51 19 "2020_12"
      52 32 "21-Jan" 
      52 19 "2020_03"
      52 32 "21-Feb" 
      52 19 "2020_01"
      52 19 "2020_03"
      53 19 "2021_09"
      53 19 "2020_02"
      82 19 "2019_10"
      82 19 "2021_04"
      82 19 "2020_02"
      83 19 "2022_02"
      83 19 "2019_07"
      83 19 "2022_02"
      83 19 "2021_02"
      83 19 "2022_02"
      84 19 "2019_01"
      84 47 "19-Oct" 
      84 19 "2019_04"
      84 19 "2019_01"
      84 19 "2019_05"
      84 19 "2020_08"
      84 19 "2019_01"
     145 32 "21-Apr" 
     145 32 "21-Mar" 
     145 32 "21-Feb" 
     214 32 "21-Jan" 
     217 47 "20-Jan" 
     246 47 "20-Sep" 
     257 32 "21-Jan" 
     300 47 "20-May" 
     306 47 "18-Apr" 
     335 47 "18-Dec" 
     347 32 "21-Sep" 
     347 32 "21-Feb" 
     375 47 "18-Oct" 
     379 32 "21-Jan" 
     379 32 "20-Nov" 
     399 47 "19-Feb" 
     432 32 "20-Oct" 
     432 32 "21-Mar" 
     432 32 "21-Jan" 
     432 32 "21-Apr" 
     432 32 "20-Sep" 
     432 32 "21-Jan" 
     432 32 "20-Mar" 
     432 32 "20-Dec" 
     432 32 "22-Jan" 
     432 32 "21-Jan" 
     432 32 "20-Dec" 
     451 47 "18-Mar" 
     451 47 "18-Mar" 
     451 47 "20-Oct" 
     451 47 "19-Jul" 
     451 47 "18-Jul" 
     451 47 "18-Mar" 
     452 32 "19-Oct" 
     452 32 "20-Dec" 
     457 47 "19-Mar" 
     463 47 "18-Jun" 
     539 47 "19-Feb" 
     545 47 "20-Jan" 
     545 47 "20-Sep" 
     545 47 "20-Jan" 
     552 47 "20-Mar" 
     570 47 "18-Jan" 
     570 47 "18-Mar" 
     570 47 "18-Mar" 
    8775 47 "20-Jan" 
    8779 24 "17-Feb" 
    8779 24 "17-Sep" 
    8781 47 "19-Aug" 
    8781 47 "18-Oct" 
    8795 47 "20-Oct" 
    8795 47 "18-Oct" 
    8798 32 "21-Nov" 
    8806 47 "22-Feb" 
    8806 47 "21-Aug" 
    8806 47 "21-Oct" 
    8806 47 "21-Sep" 
    8806 47 "21-Mar" 
    8814 47 "18-Nov" 
    8815 32 "19-May" 
    8815 32 "19-Nov" 
    8815 32 "19-Dec" 
    8815 32 "20-Mar" 
    8815 32 "19-Nov" 
    8815 32 "19-Nov" 
    8815 32 "20-Mar" 
    8828 47 "19-Sep" 
    8828 47 "19-Sep" 
    8838 47 "18-Oct" 
    8838 47 "18-Oct" 
    end

  • #2
    Here is the code I used on your data example. Your 3) is set as an exercise.

    See the comments for where techniques and tricks are documented. For example, search dm0055, entry in Stata yields a reference to a paper.

    In a much larger dataset, some of the dates might be more awkward, and there may be more values than table commands will show, but list will always work.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long personid byte siteid str7 monthyear
       2 19 "2021_03"
       2 19 "2020_07"
       3 19 "2019_11"
      51 19 "2020_03"
      51 19 "2019_01"
      51 19 "2019_07"
      51 19 "2019_01"
      51 19 "2020_12"
      52 32 "21-Jan" 
      52 19 "2020_03"
      52 32 "21-Feb" 
      52 19 "2020_01"
      52 19 "2020_03"
      53 19 "2021_09"
      53 19 "2020_02"
      82 19 "2019_10"
      82 19 "2021_04"
      82 19 "2020_02"
      83 19 "2022_02"
      83 19 "2019_07"
      83 19 "2022_02"
      83 19 "2021_02"
      83 19 "2022_02"
      84 19 "2019_01"
      84 47 "19-Oct" 
      84 19 "2019_04"
      84 19 "2019_01"
      84 19 "2019_05"
      84 19 "2020_08"
      84 19 "2019_01"
     145 32 "21-Apr" 
     145 32 "21-Mar" 
     145 32 "21-Feb" 
     214 32 "21-Jan" 
     217 47 "20-Jan" 
     246 47 "20-Sep" 
     257 32 "21-Jan" 
     300 47 "20-May" 
     306 47 "18-Apr" 
     335 47 "18-Dec" 
     347 32 "21-Sep" 
     347 32 "21-Feb" 
     375 47 "18-Oct" 
     379 32 "21-Jan" 
     379 32 "20-Nov" 
     399 47 "19-Feb" 
     432 32 "20-Oct" 
     432 32 "21-Mar" 
     432 32 "21-Jan" 
     432 32 "21-Apr" 
     432 32 "20-Sep" 
     432 32 "21-Jan" 
     432 32 "20-Mar" 
     432 32 "20-Dec" 
     432 32 "22-Jan" 
     432 32 "21-Jan" 
     432 32 "20-Dec" 
     451 47 "18-Mar" 
     451 47 "18-Mar" 
     451 47 "20-Oct" 
     451 47 "19-Jul" 
     451 47 "18-Jul" 
     451 47 "18-Mar" 
     452 32 "19-Oct" 
     452 32 "20-Dec" 
     457 47 "19-Mar" 
     463 47 "18-Jun" 
     539 47 "19-Feb" 
     545 47 "20-Jan" 
     545 47 "20-Sep" 
     545 47 "20-Jan" 
     552 47 "20-Mar" 
     570 47 "18-Jan" 
     570 47 "18-Mar" 
     570 47 "18-Mar" 
    8775 47 "20-Jan" 
    8779 24 "17-Feb" 
    8779 24 "17-Sep" 
    8781 47 "19-Aug" 
    8781 47 "18-Oct" 
    8795 47 "20-Oct" 
    8795 47 "18-Oct" 
    8798 32 "21-Nov" 
    8806 47 "22-Feb" 
    8806 47 "21-Aug" 
    8806 47 "21-Oct" 
    8806 47 "21-Sep" 
    8806 47 "21-Mar" 
    8814 47 "18-Nov" 
    8815 32 "19-May" 
    8815 32 "19-Nov" 
    8815 32 "19-Dec" 
    8815 32 "20-Mar" 
    8815 32 "19-Nov" 
    8815 32 "19-Nov" 
    8815 32 "20-Mar" 
    8828 47 "19-Sep" 
    8828 47 "19-Sep" 
    8838 47 "18-Oct" 
    8838 47 "18-Oct" 
    end
    
    * cleaned up monthly date variable: see -help datetime- 
    gen mdate = monthly("20" + monthyear, "YM") if strpos(monthyear, "-") == 3 
    replace mdate = monthly(monthyear, "YM") if missing(mdate)
    format mdate %tm 
    tab mdate if missing(mdate)
    
    * row concatenation: see -search pr0071, entry- 
    bysort personid (siteid mdate) : gen sites = strofreal(siteid) if _n == 1 
    by personid : replace sites = cond(siteid != siteid[_n-1], sites[_n-1] + " " + strofreal(siteid), sites[_n-1]) if _n > 1 
    by personid : replace sites = sites[_N]
    
    bysort personid (mdate) : gen history = strofreal(siteid) if _n == 1 
    by personid : replace history = cond(siteid != siteid[_n-1], history[_n-1] + " " + strofreal(siteid), history[_n-1]) if _n > 1 
    by personid : replace history = history[_N]
    
    tabdisp personid, c(sites history) 
    tab1 sites history 
    
    tab sites if wordcount(sites) > 1 
    tab history if wordcount(history) > 1 
    
    levelsof siteid, local(sites) 
    
    * see -search dm0055, entry- Section 9 
    foreach s of local sites {
        egen first`s' = min(cond(siteid == `s', mdate, .)), by(personid)
        local firstvars `firstvars' first`s'
    }
    
    tabdisp personid, c(`firstvars')
    Results

    Code:
    . * cleaned up monthly date variable: see -help datetime- 
    . gen mdate = monthly("20" + monthyear, "YM") if strpos(monthyear, "-") == 3 
    (27 missing values generated)
    
    . replace mdate = monthly(monthyear, "YM") if missing(mdate)
    (27 real changes made)
    
    . format mdate %tm 
    
    . tab mdate if missing(mdate)
    no observations
    
    . 
    . * row concatenation: see -search pr0071, entry- 
    . bysort personid (siteid mdate) : gen sites = strofreal(siteid) if _n == 1 
    (61 missing values generated)
    
    . by personid : replace sites = cond(siteid != siteid[_n-1], sites[_n-1] + " " + strofreal(siteid), s
    > ites[_n-1]) if _n > 1 
    (61 real changes made)
    
    . by personid : replace sites = sites[_N]
    (9 real changes made)
    
    . 
    . bysort personid (mdate) : gen history = strofreal(siteid) if _n == 1 
    (61 missing values generated)
    
    . by personid : replace history = cond(siteid != siteid[_n-1], history[_n-1] + " " + strofreal(siteid
    > ), history[_n-1]) if _n > 1 
    (61 real changes made)
    
    . by personid : replace history = history[_N]
    (9 real changes made)
    
    . 
    . tabdisp personid, c(sites history) 
    
    ----------------------------------
     personid |      sites     history
    ----------+-----------------------
            2 |         19          19
            3 |         19          19
           51 |         19          19
           52 |      19 32       19 32
           53 |         19          19
           82 |         19          19
           83 |         19          19
           84 |      19 47    19 47 19
          145 |         32          32
          214 |         32          32
          217 |         47          47
          246 |         47          47
          257 |         32          32
          300 |         47          47
          306 |         47          47
          335 |         47          47
          347 |         32          32
          375 |         47          47
          379 |         32          32
          399 |         47          47
          432 |         32          32
          451 |         47          47
          452 |         32          32
          457 |         47          47
          463 |         47          47
          539 |         47          47
          545 |         47          47
          552 |         47          47
          570 |         47          47
         8775 |         47          47
         8779 |         24          24
         8781 |         47          47
         8795 |         47          47
         8798 |         32          32
         8806 |         47          47
         8814 |         47          47
         8815 |         32          32
         8828 |         47          47
         8838 |         47          47
    ----------------------------------
    
    . tab1 sites history 
    
    -> tabulation of sites  
    
          sites |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             19 |         18       18.00       18.00
          19 32 |          5        5.00       23.00
          19 47 |          7        7.00       30.00
             24 |          2        2.00       32.00
             32 |         30       30.00       62.00
             47 |         38       38.00      100.00
    ------------+-----------------------------------
          Total |        100      100.00
    
    -> tabulation of history  
    
        history |      Freq.     Percent        Cum.
    ------------+-----------------------------------
             19 |         18       18.00       18.00
          19 32 |          5        5.00       23.00
       19 47 19 |          7        7.00       30.00
             24 |          2        2.00       32.00
             32 |         30       30.00       62.00
             47 |         38       38.00      100.00
    ------------+-----------------------------------
          Total |        100      100.00
    
    . 
    . tab sites if wordcount(sites) > 1 
    
          sites |      Freq.     Percent        Cum.
    ------------+-----------------------------------
          19 32 |          5       41.67       41.67
          19 47 |          7       58.33      100.00
    ------------+-----------------------------------
          Total |         12      100.00
    
    . tab history if wordcount(history) > 1 
    
        history |      Freq.     Percent        Cum.
    ------------+-----------------------------------
          19 32 |          5       41.67       41.67
       19 47 19 |          7       58.33      100.00
    ------------+-----------------------------------
          Total |         12      100.00
    
    . 
    . levelsof siteid, local(sites) 
    19 24 32 47
    
    . 
    . * see -search dm0055, entry- Section 9 
    . foreach s of local sites {
      2.         egen first`s' = min(cond(siteid == `s', mdate, .)), by(personid)
      3.         local firstvars `firstvars' first`s'
      4. }
    (70 missing values generated)
    (98 missing values generated)
    (65 missing values generated)
    (55 missing values generated)
    
    . 
    . tabdisp personid, c(`firstvars')
    
    ----------------------------------------------------------
     personid |    first19     first24     first32     first47
    ----------+-----------------------------------------------
            2 |        726                                    
            3 |        718                                    
           51 |        708                                    
           52 |        720                     732            
           53 |        721                                    
           82 |        717                                    
           83 |        714                                    
           84 |        708                                 717
          145 |                                733            
          214 |                                732            
          217 |                                            720
          246 |                                            728
          257 |                                732            
          300 |                                            724
          306 |                                            699
          335 |                                            707
          347 |                                733            
          375 |                                            705
          379 |                                730            
          399 |                                            709
          432 |                                722            
          451 |                                            698
          452 |                                717            
          457 |                                            710
          463 |                                            701
          539 |                                            709
          545 |                                            720
          552 |                                            722
          570 |                                            696
         8775 |                                            720
         8779 |                    685                        
         8781 |                                            705
         8795 |                                            705
         8798 |                                742            
         8806 |                                            734
         8814 |                                            706
         8815 |                                712            
         8828 |                                            716
         8838 |                                            705
    ----------------------------------------------------------

    Comment


    • #3
      Note that the results of tabulate (when called as tab or through tab1) in #2 count observations, not individuals. The way to count individuals is to create

      Code:
      egen tag = tag(personid)
      which tags each individual just once. Then subsequent commands can add the qualifier if tag to avoid multiple counting. Naturally you may wish to combine that with other conditions.

      In contrast, it is a feature of tabdisp that it selects each distinct category just once, and the trick is to ensure that you get shown what you want to see.
      Last edited by Nick Cox; 25 Mar 2023, 03:34.

      Comment


      • #4
        Nick,

        This is SO helpful- thank you so much. I have successfully run the code except the last tabdisp:
        tabdisp personid, c(`firstvars') which is giving me the error cellvar() required. Any thoughts on why that might be happening?

        Thank you very very much!

        Sarah

        Comment


        • #5
          Did you define the local firstvars ?Are you running the code in chunks from a do-file editor? If so, the reference to firstvars can't see the definition given earlier.

          See

          Code:
          SJ-20-2 dm0102  . . . . . . . . . Stata tip 138: Local macros have local scope
                  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
                  Q2/20   SJ 20(2):499--503                                (no commands)
                  focuses on a common misunderstanding of how local macros work


          -- or (the solution) run the code as a script from the Do-file editor.

          Comment


          • #6
            Hi Nick,

            Thank you. I guess I need to educate myself on local macros. When I run it for the first time, it does work (although it says too many values). I think I need to create some additional variables that could be used to summarize. Is there a way to create a binary variable that just indicates whether or not the personid was seen within 3 months at organization 47 after being seen at organization 19 (for example?)

            Thank you!

            Sarah

            Comment


            • #7
              First seen at 19? Last seen at 19?

              You could create indicators such as

              Code:
              bysort personid (mdate) : gen wanted = siteid==42 & siteid[_n-1]==19 & (mdate - mdate[_n-1]) <= 3

              Comment

              Working...
              X