Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • panel data xtabs first and last occurrence

    Hello,

    I am new to the Stata community and was hoping for some help with panel data analysis. I have health care data with 'clientid' as the panel variable. However, a client may have multiple visits on the same day with different or missing data recorded on the outcome variable (surveyid provides a unique identifier for observations on the same day). Here is a sample of my dataset:

    clientid date medfreq severity surveyid
    1 01/02/18 1 4 12345
    1 02/21/18 4 12789
    1 03/08/18 12890
    1 06/15/18 3 2 12990
    2 01/02/18 4 12681
    2 01/02/18 2 3 12688
    2 03/08/18 1 3 13450
    2 06/15/18 3 3 13560

    I have declared my dataset as panel data without indicating the time variable since there are repeated dates, which I want to preserve. Is this the correct approach?

    I want to identify the first and last occurrences that are ≥3 months apart with data recorded on "medfreq." I would then like to run a xtab to see how clients changed over time from their first to last occurrence. I didn't find any literature on this specific issue but would be grateful for any suggested code or if someone can kindly refer me to an existing reference.

    Thank you,
    Samira
    Last edited by Samira Soleimanpour; 25 Feb 2019, 13:10.

  • #2
    Your example implies that your aim is not well-defined. For example, individual 2 has two observations for 1 Feb 2018. Which is to be regarded as "first"?

    Comment


    • #3
      Thank you, Nick. The "first" would have to be identified by either the first observation where the data appear in that field and/or the smaller surveyid. Does that help clarify?

      Comment


      • #4
        OK, but your data example doesn't make full sense as you have different numbers of values in each observation. Nevertheless this may help.

        After sorting, the first and last measurements are directly available, but the trick is to put values only in one observation for each client to avoid multiple counting. That must be the same observation to allow tabulation to take place at all. Note that lag operators are out of the question here unless you have an artificial time variable, which you don't really need.

        Code:
        clear 
        input clientid str8 date medfreq severity surveyid
        1 "01/02/18" 1 4 12345
        1 "02/21/18" 4 . 12789
        1 "03/08/18" . . 12890
        1 "06/15/18" 3 2 12990
        2 "01/02/18" 4 . 12681
        2 "01/02/18" 2 3 12688
        2 "03/08/18" 1 3 13450
        2 "06/15/18" 3 3 13560
        end 
        
        bysort clientid (surveyid) : gen first = medfreq[1] if _n == 1 
        by clientid : gen last = medfreq[_N] if _n == 1 
        
        tab first last 
        
        
                   |    last
             first |         3 |     Total
        -----------+-----------+----------
                 1 |         1 |         1 
                 2 |         1 |         1 
        -----------+-----------+----------
             Total |         2 |         2
        There is at least one extra twist. Suppose a patient is recorded only once but you don't want the last observation (of one) to be regarded as the last observation because it wasn't at a different time. Then one command should be different:


        Code:
        by clientid : gen last = medfreq[_N] if _n == 1  &  _N > 1

        Comment


        • #5
          Hi Nick,

          The coding worked, thank you! Do you know how I can get Stata to ignore missing values and only use the first (and last) occurrence where data appear in the "medfreq" field? The sample data I included is part of a larger dataset (n=2,000+) where clients often have missing values in their first and/or last visits but will have values recorded during their visits in between.

          Thanks again,
          Samira

          Comment


          • #6
            Something like (not tested)

            Code:
            gen ismissing = missing(medfreq)
            bysort ismissing clientid (surveyid) : gen first = medfreq[1] if _n == 1 & !ismissing  
             by ismissing clientid : gen last = medfreq[_N] if _n == 1 & !ismissing

            Comment

            Working...
            X