Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping panels with too few observations in some years

    I have a panel dataset where the rows (panels) are local government areas and the columns are years. The outcome of interest is violent re-offending and the principal independent variable is amphetamine use. I want drop local government areas where the number of amphetamine users in any given year falls below (say) five. Can anyone advise how to do this?

  • #2
    When asking for help with code it is always wise and usually essential to show example data. The -dataex- command is the best way to do that. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    In the meantime, from your description (though even the best descriptions are often misleading or incomplete, which is why example data is so important), it sounds like your data is in wide layout. You will find that there is very little you can do in Stata with that. So the first step is almost surely going to be to -reshape- the data to long. Once that is done, it will be very simple to remove the observations you want to omit. If you post back with example data, I will respond with code.

    Also please clarify whether you want to omit just those particular combinations of local government and year where the number of amphetamine users is below 5, or whether you want to omit all of the observations for a local government if the number of amphetamine users ever falls below 5 in any year.
    Last edited by Clyde Schechter; 21 Dec 2020, 17:25.

    Comment


    • #3
      Sorry about that Clyde. I've done what you suggest but a bit unsure how I convey the sample data to you so I'll just paste it (see below). The sample only shows some 2001 data but the years run up to 2016. I don't want to omit just those particular combinations of local government and year where the number of amphetamine users falls below five. I only want to include lgas that have a reasonable number of amphetamine users (actually: arrests for amphetamine use). So I want to drop all lgas that have less than 5 observations in any year. Hope this is clearer, regards, Don


      input long lga float(year prop_methuser mean15_25 prop_ATSI prop_male prop_priorviol prop_priordrug prop_violentreoff)
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652
      50 2001 .001858736 .4749071 .1262136 .8364312 .2444238 .15520446 .1695652

      Comment


      • #4
        I can't make any sense of this example data in relation to your stated problem and description. First of all, what you show is 13 copies of the exact same numbers. Why do you have the same observation repeated over and over? That is, at best, an odd data structure and usually indicates something seriously wrong with the data management up to this point.

        Next, you have no variable that designates the number of users. The closest you come to that is prop_methuser which, I guess, is the proportion of all drug arrests or of all arrests (or all something????) that are for methamphetamine. But without a variable that gives the count of the total from which this is a proportion, you're stuck.

        Comment


        • #5
          Well I don't want to tax your patience Clyde but below is what I get when I type dataex to the original (individual level) dataset. the mspdi is an identifier for each individual. Here's the code I used to generate the variables:

          drop if indexyear>2016

          gen RODcutoff=mdy(1,10,2019)
          gen days_post_first_reoff = RODcutoff-reoff1offdate
          replace days_post_first_reoff = dod-reoff1offdate if !missing(dod)
          gen offender_exptime_reoff = reoff1freetime + days_post_first_reoff if !missing(reoff1offdate)
          gen violent_reoff = .
          // offended within 24-months free-time window
          replace violent_reoff = 1 if !missing(reoff1offdate) & (reoff1freetime <= 730) & (offender_exptime_reoff>=730) & (reoff1anzsoc > 210 & reoff1anzsoc < 299)
          // offended but outside of 24-months free-time window and have 24-months free-time window to observe
          replace violent_reoff = 0 if !missing(reoff1offdate) & (reoff1freetime >= 730) & (reoff1anzsoc > 210 & reoff1anzsoc < 299)
          // did not offend and have 24-months free-time window to observe
          replace violent_reoff = 0 if missing(reoff1offdate) & (reoff1freetime >= 730)

          gen methuser=0
          replace methuser = 1 if drugtype==1 & index_poanzsoc==1041
          g year = year(indexdate)


          ge x1 = 1 if inrange(age,15,25)
          replace x1 = 0 if !inrange(age,15,25)
          bysort lga year: egen prop15_25 = mean(x1)
          drop x1


          gen x1 = 1 if priorviol==1
          replace x1 = 0 if priorviol==0
          bysort lga year: egen prop_priorviol= mean(x1)
          drop x1


          gen x1 = 1 if priordrug==1
          replace x1 = 0 if priordrug==0
          bysort lga year: egen prop_priordrug=mean(priordrug)

          drop x1






          input long(mspdi indexcaseno indexdate lga) int age long(dob dod) int index_poanzsoc long reoff1offdate int(reoff1anzsoc reoff1freetime) long sex float(priordrug offender_exptime_reoff violent_reoff methuser year prop_priorviol prop_priordrug prop15_25)
          900355248 1001001111 15150 50 36 1832 . 1041 . . 6430 2 0 . 0 0 2001 .2444238 .15520446 .4749071
          900838151 601008409 15320 50 14 10202 . 823 . . 6260 1 0 . 0 0 2001 .2444238 .15520446 .4749071
          213057677 601001346 15217 50 16 9246 . . 15628 1041 411 2 0 6342 . 0 2001 .2444238 .15520446 .4749071
          204115259 201007145 15020 50 18 8172 . 831 15302 831 282 2 0 6539 . 0 2001 .2444238 .15520446 .4749071
          295058257 1001000285 15275 50 33 3148 . . 20510 1434 5235 2 0 6284 . 0 2001 .2444238 .15520446 .4749071
          197003799 201094693 15264 50 20 7728 . 1311 15548 213 284 2 0 6295 1 0 2001 .2444238 .15520446 .4749071
          201082476 201044802 15074 50 26 5253 . 211 . . 6506 2 0 . 0 0 2001 .2444238 .15520446 .4749071
          201059608 201044791 15088 50 31 3553 . 212 19628 213 4540 2 0 6471 0 0 2001 .2444238 .15520446 .4749071
          295058257 201036443 15053 50 32 3148 . 1219 15274 . 221 2 0 6506 . 0 2001 .2444238 .15520446 .4749071
          308002710 201058665 15111 50 35 2294 . 831 15187 829 76 1 1 6448 . 0 2001 .2444238 .15520446 .4749071

          Comment


          • #6
            OK, this makes a lot more sense now! Thanks.

            So each observation is a person. And you want to drop all observations on any lga that has fewer than five meth users in any year. Now what remains unclear to me is which variable identifies a person: is it mspdi or indexcaseno? For the purposes of illustrating code, I will assume it is mspdi. If not, you can just replace all references to mspdi with the name of the correct identifier. I will assume that the same person can occur more than once in the same lga in the same year--that this is basically a dataset of arrests and some people offend multiple times.

            Code:
            //  GET A COUNT OF NUMBER OF METH USERS IN EACH LGA IN EACH YEAR
            by lga year mspdi, sort: egen byte meth_this_year = max(methuser)
            by lga year mspdi: gen byte flag = (_n == 1)
            by lga year: egen n_methusers = total(cond(flag), meth_this_year, 0)
            
            //  NOW DROP THE ENTIRE LGA IF ANY YEAR HAS FEWER THAN 5 METH USERS
            by lga (n_methusers), sort:  drop if n_methusers[1] < 5

            Comment


            • #7
              It works! This is fantastic Clyde. Thank you so much. However, if you'll indulge me I have a couple of other questions that will deepen my understanding of what you've done (1) I understand that 'byte' is a way of storing numbers but what's its use in this context? (2) Why do you seek the maximum number of methusers in each lga/year rather than the minimum in line 1? (3) what is happening in line three? It's opaque to me.

              thanks again, Don

              Comment


              • #8
                (1) It's of no importance. It's a habit I developed back in the 60's when I first started programming computers. In those days, memories were very small and you had to make maximum use of every bit of it. So the practice of forcing things into the smallest amount of storage that could hold them became a habit for me. It is rare for programs in Stata to run up against memory constraints in modern computers, so it isn't really necessary to do this. It's just something that almost comes from my fingers without even giving it much thought.

                (2) That's not seeking the maximum number of methusers. At that point, the variable methuser is not a count of methusers. It's a 0/1 variable showing whether the person was designated as a methuser in that record. By getting the maximum value of that variable, I'm determining whether that person was ever designated as a methuser that year.

                (3) To understand line 3 you first have to understand line 2. Line 2 simply picks out a single observation for each person (mspdi) in a given lga#year combination. Then in line 3, cond(flag, meth_this_year, 0) tells Stata to look at each observation and determine whether or not the variable flag is true. If so, the value of the expression is meth_this_year, and if not, the value of the expression is 0. We then total those up over all observations for the lga#year combination. The result is I'm adding up the number of methuser indications in the lga#year combination, but counting only one of them from each mspdi.

                Comment


                • #9
                  You are a very clever man Clyde. Thanks again for your help. regards, Don

                  Comment

                  Working...
                  X