Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to mark a sample containing the balanced panel of observations with nonmissing data.

    I'm trying to mark a sample for a user program such that the sample is a balanced panel given certain conditions and potentially missing values for a given set of input variables. The estimation routine requires a balanced panel, but I would prefer to not require the user to pre-validate that the input data is a balanced panel. Rather, I would prefer that the program can take an input varlist and then mark the sample for which the data is nonmissing for all observations which span the dates of the panel id with the longest time series (and then report the sample that was used).

    I wrote the following test code which checks that the total number of nomissing observations for each id is the maximum number of conditional observations, but it falls down if there is a date mismatch as in the last case. This case is invalid because there isn't any group which has valid data at both t=5 and t=8 (or the minimum and maximum date of the panel if used unconditionally). I can put a failure check which calculates the in sample min and max values of the date, but I wanted to ask if there was a more natural way to do this. I assumed that this is a somewhat common concern, but I don't have a good sense of the best way to approach it.


    Code:
    clear
    input float(id date variable)
    1 5 .88
    1 6  .2
    1 7 .89
    2 5 .58
    2 6 .37
    2 7 .85
    3 5 .39
    3 6 .12
    3 7   .
    4 6  .7
    4 7 .69
    4 8 .93
    end
    
    
    capture program drop balanced
    program define balanced
        syntax varlist [if], Generate(string)
        marksample touse
        tempvar obs balanced
        by id (date): gen `obs' = sum(`touse')
        qui sum `obs', meanonly
        local maxobs = `r(max)'
        qui by id (date): replace `touse' = 0 if `obs'[_N] != `maxobs'
        gen `generate' = `touse'
    end
    tsset id date
    balanced variable if inrange(date,5,7), g(bal57)
    balanced variable if inrange(date,5,6), g(bal67)
    balanced variable if inrange(date,6,7), g(bal56)
    balanced variable if inrange(date,5,8), g(bal58)  /// Produces an incorrect result, should probably be made to generate an error
    Here's the listed output:
    Code:
         +------------------------------------------------------+
         | id   date   variable   bal57   bal67   bal56   bal58 |
         |------------------------------------------------------|
      1. |  1      5        .88       1       1       0       1 |
      2. |  1      6         .2       1       1       1       1 |
      3. |  1      7        .89       1       0       1       1 |
      4. |  2      5        .58       1       1       0       1 |
      5. |  2      6        .37       1       1       1       1 |
      6. |  2      7        .85       1       0       1       1 |
      7. |  3      5        .39       0       1       0       0 |
      8. |  3      6        .12       0       1       0       0 |
      9. |  3      7          .       0       0       0       0 |
     10. |  4      6         .7       0       0       1       1 |
     11. |  4      7        .69       0       0       1       1 |
     12. |  4      8        .93       0       0       0       1 |
         +------------------------------------------------------+
    Edit: For added context, the user program this logic is being used for doesn't expect an arbitrary unbalanced panel. The program expects a mostly balanced panel with a well defined beginning and end date where most observations of the dependent variable exist beginning and end date, but may be occasionally be missing and where the independent variables may be missing for certain groups in certain specifications but not others. The goal therefore isn't to search through arbitrary data looking for the ``best'' balanced panel, but rather to mark out out the ids which have missing data somewhere and probably to throw an error if the date range is not as ``expected.''
    Last edited by Malcolm Wardlaw; 09 Dec 2022, 21:48.
Working...
X