Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting observations from a panel data according to a criteria

    Hi,

    I am quite new to STATA (shame on me) and am trying to figure out how to get a desired dataset from a big longitudinal dataset. I have a balanced dataset with individuals (id) over 9 years (2006-2014). But I am not interested in all of them. I only want to have a dataset of individuals who lived in two groups of districts in the first 5 years (2006-2010). So let's say one set contains Treatment={a, b, c, d} and the other set contains Control={e, f, g, h}.

    So for each id in the dataset, check if
    -for all t from 2006 to 2010 if id(district) is in Treatment then keep
    OR
    -for all t from 2006 to 2010 if id(district) is in Control then keep
    IF NEITHER HOLDS
    then delete the id and all observations for all variables for all the years in the dataset (i.e. 2006-2004).

    I know in my head what I want STATA to do, but I cannot write the code for it.. I am sorry if it is a too primitive question. I hope I'll improve over time and come with better questions.

    Best,
    Merve

  • #2
    Being new to Stata is fine, but please give a data example. https://www.statalist.org/forums/help#stata explains. See also #18 in that document.

    Comment


    • #3
      Yes of course. I will be using the HILDA dataset (Australia), which is a longitudinal database. I have already made the dataset to be balanced. Individuals are identified with their "xwaveid" (e.g. 010003, 010014, ..) and I have observations for each one of these individuals for each "wave" which is the year and takes on values from 2006 - 2014. There are many variables that I will take into consideration (around 100). One of them is the geographical location code "hhssa1". This is a number and there are in total 57,523 many different ssa1's.

      I introduce my data as follows:
      xtset xwaveid wave, yearly

      What I want to do now is to:
      keep all "xwaveid"s which satisfy for wave=2006 & 2007 & 2008 & 2009 & 2010 the condition that hhssa1 is an element of Treatment={31000, 33960, 34580, 36580}
      AND
      keep all "xwaveid"s which satisfy for wave=2006 & 2007 & 2008 & 2009 & 2010 the condition that hhssa1 is an element of Control={50210, 50350, 50420, 50490}
      AND delete all 'xwaveid" rows which do not satisfy these two conditions.

      What this means is that, I only want to keep individuals who lived in either a control district or a treatment district and delete all the rest so that I can perform a difference in difference model later on. Is that a clear enough data example?

      P.S. Thank you for guiding me to #18 too.

      Comment


      • #4
        A data example means an example of your data presented like this. (A schematic explanation is helpful, but -- unless exceptionally someone knows your data source -- we still have to invent data for us to be able to experiment easily and show code that will work.)

        Code:
         
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(company year invest mvalue kstock time)
        1 1935  317.6 3078.5    2.8  1
        1 1936  391.8 4661.7   52.6  2
        1 1937  410.6 5387.1  156.9  3
        1 1938  257.7 2792.2  209.2  4
        1 1939  330.8 4313.2  203.4  5
        1 1940  461.2 4643.9  207.2  6
        1 1941    512 4551.2  255.2  7
        1 1942    448 3244.1  303.7  8
        1 1943  499.6 4053.7  264.1  9
        1 1944  547.5 4379.3  201.6 10
        1 1945  561.2 4840.9    265 11
        1 1946  688.1 4900.9  402.2 12
        1 1947  568.9 3526.5  761.5 13
        1 1948  529.2 3254.7  922.4 14
        1 1949  555.1 3700.2 1020.1 15
        1 1950  642.9 3755.6   1099 16
        1 1951  755.9   4833 1207.7 17
        1 1952  891.2 4924.9 1430.5 18
        1 1953 1304.4 6241.7 1777.3 19
        1 1954 1486.7 5593.6 2226.3 20
        2 1935  209.9 1362.4   53.8  1
        2 1936  355.3 1807.1   50.5  2
        2 1937  469.9 2676.3  118.1  3
        2 1938  262.3 1801.9  260.2  4
        2 1939  230.4 1957.3  312.7  5
        2 1940  361.6 2202.9  254.2  6
        2 1941  472.8 2380.5  261.4  7
        2 1942  445.6 2168.6  298.7  8
        2 1943  361.6 1985.1  301.8  9
        2 1944  288.2 1813.9  279.1 10
        2 1945  258.7 1850.2  213.8 11
        2 1946  420.3 2067.7  132.6 12
        2 1947  420.5 1796.7  264.8 13
        2 1948  494.5 1625.8  306.9 14
        2 1949  405.1   1667  351.1 15
        2 1950  418.8 1677.4  357.8 16
        2 1951  588.2 2289.5  342.1 17
        2 1952  645.5 2159.4  444.2 18
        2 1953    641 2031.3  623.6 19
        2 1954  459.3 2115.5  669.7 20
        3 1935   33.1 1170.6   97.8  1
        3 1936     45 2015.8  104.4  2
        3 1937   77.2 2803.3    118  3
        3 1938   44.6 2039.7  156.2  4
        3 1939   48.1 2256.2  172.6  5
        3 1940   74.4 2132.2  186.6  6
        3 1941    113 1834.1  220.9  7
        3 1942   91.9   1588  287.8  8
        3 1943   61.3 1749.4  319.9  9
        3 1944   56.8 1687.2  321.3 10
        3 1945   93.6 2007.7  319.6 11
        3 1946  159.9 2208.3    346 12
        3 1947  147.2 1656.7  456.4 13
        3 1948  146.3 1604.4  543.4 14
        3 1949   98.3 1431.8  618.3 15
        3 1950   93.5 1610.5  647.4 16
        3 1951  135.2 1819.4  671.3 17
        3 1952  157.3 2079.7  726.1 18
        3 1953  179.5 2371.6  800.3 19
        3 1954  189.6 2759.9  888.9 20
        4 1935  40.29  417.5   10.5  1
        4 1936  72.76  837.8   10.2  2
        4 1937  66.26  883.9   34.7  3
        4 1938   51.6  437.9   51.8  4
        4 1939  52.41  679.7   64.3  5
        4 1940  69.41  727.8   67.1  6
        4 1941  68.35  643.6   75.2  7
        4 1942   46.8  410.9   71.4  8
        4 1943   47.4  588.4   67.1  9
        4 1944  59.57  698.4   60.5 10
        4 1945  88.78  846.4   54.6 11
        4 1946  74.12  893.8   84.8 12
        4 1947  62.68    579   96.8 13
        4 1948  89.36  694.6  110.2 14
        4 1949  78.98  590.3  147.4 15
        4 1950 100.66  693.5  163.2 16
        4 1951 160.62    809  203.5 17
        4 1952    145    727  290.6 18
        4 1953 174.93 1001.5  346.1 19
        4 1954 172.49  703.2  414.9 20
        5 1935  39.68  157.7  183.2  1
        5 1936  50.73  167.9    204  2
        5 1937  74.24  192.9    236  3
        5 1938  53.51  156.7  291.7  4
        5 1939  42.65  191.4  323.1  5
        5 1940  46.48  185.5    344  6
        5 1941   61.4  199.6  367.7  7
        5 1942  39.67  189.5  407.2  8
        5 1943  62.24  151.2  426.6  9
        5 1944  52.32  187.7    470 10
        5 1945  63.21  214.7  499.2 11
        5 1946  59.37  232.9  534.6 12
        5 1947  58.02    249  566.6 13
        5 1948  70.34  224.5  595.3 14
        5 1949  67.42  237.3  631.4 15
        5 1950  55.74  240.1  662.3 16
        5 1951   80.3  327.3  683.9 17
        5 1952   85.4  359.4  729.3 18
        5 1953   91.9  398.4  774.3 19
        5 1954  81.43  365.7  804.9 20
        end
        format %ty year
        I got that output in a form fit for copy and paste by doing this

        Code:
        webuse grunfeld 
        dataex
        If you try that and dataex triggers an error message, then please follow the link in #2 for explanation. Problems with sensitivity, security, privacy or confidentiality of data are discussed at that place too.

        Comment


        • #5
          Welcome to Statalist, Merve.

          Without data, like Nick I'm reluctant to supply untested code.

          Here's an essay outlining the approach I would take.
          1. Create a variable I'll call inwave which is 1 if the observation is in one of the five waves you care about, and 0 otherwise.
          2. Create a variable I'll call treatment which is 1 if the observation is in one of the four hssa1 values that represent treatment.
          3. Create a variable I'll call control which is 1 if the observation is one of the four hssa1 values that represent control.
          4. Create a variable I'll call keep_t if the observation has inwave==1 and treatment==1
          5. Create a variable I'll call keep_c if the observation has inwave==1 and control==1
          Once you've done that, you want to look at the totals of keep_t and keep_c for each xwaveid and keep those xwaveid's for which one of those totals is 5.
          Code:
          bysort xwaveid (wave): egen tot_keep_t = total(keep_t)
          will create a variable I'll call tot_keep_t which will have, in every observation for a given xwaveid, the number of observations for which keep_t is 1.

          You create a variable I'll call tot_keep_c similarly, which will have, in every observation for a given xwaveid, the number of observations for which keep_c is 1.

          Then
          Code:
          keep if tot_keep_t==5 | tot_keep_c==5
          will, I think, do what I understand you to want.

          Perhaps this is enough to get you started in the right direction in the absence of example data on which to build more complete code.

          But you can see, an essay about code is nowhere near as helpful as actual code, takes a lot more effort to produce, is subject to misunderstanding in a way that code is not, and is possibly misleading if the data is not as I imagine it. Even the best descriptions of data are no substitute for an actual example of the data.

          If you need further guidance on this, please use the dataex command to show example data. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

          When asking for help with code, always show example data. When showing example data, always use dataex.

          The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

          Comment


          • #6
            Hi Merve Kucuk, and welcome to Statalist!

            So, as both Nick & William mentioned, it is a lot easier to help if you can post a small snippet of your data using Stata's dataex command. If you need help with that, I created a YouTube video to help walk people through the process.

            See Statalist - Using CODE delimiters and Stata's dataex command (I would watch at speed 1.5x or 2x ).
            Part on using code delimiters on Statalist (1:30 to 3:00)
            Part on using -dataex- (3:00 to 5:30, I keep going on, but you can stop there).

            To do the condition that "hhssa1 is an element of Treatment={31000, 33960, 34580, 36580}" you will probably use Stata's inlist command:

            Code:
            * I created some toy data to illustrate
            * dataex id year hhssa1
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte id int year long hhssa1
            1 2006 31000
            1 2007 31000
            1 2008 31000
            1 2009 31000
            1 2010 31000
            1 2011 31000
            1 2012 31000
            2 2006 31500
            2 2008 31500
            2 2009 31500
            2 2010 31500
            2 2011 31500
            2 2012 31500
            3 2006 50210
            3 2009 50210
            3 2010 50210
            4 2006 50500
            4 2007 50500
            4 2008 50500
            4 2009 50500
            4 2010 50500
            5 2006 50350
            5 2007 50350
            5 2008 50350
            5 2009 50350
            5 2010 50350
            end

            Code:
            * This is a variation on William's code in post #5
            bysort id (year): gen count = _N  // total number of years person is in data
            bysort id (year): egen year_count = count(year) if year>=2006 & year<=2010
            
            gen treat_dist = 0
            replace treat_dist = 1 if inlist(hhssa1, 31000, 33960, 34580, 36580)
            * This could be combined in single command as "gen treat_district = inlist(hhssa1, 31000, 33960, 34580, 36580)"
            
            gen control_dist = inlist(hhssa1, 50210, 50350, 50420, 50490)
            
            gen keep_years = (year_count >=5 & year_count!=.)   // 1 if in data all years 2006-2010,  0 otherwise
            gen keep_dist = (treat_dist ==1 | control_dist ==1)  // 1 if in either the treatment or control districts
            gen to_keep = (keep_years==1 & keep_dist==1)
            
            
            * This is what data looks like at end
            list, sepby(id) abbrev(12)
            
                 +--------------------------------------------------------------------------------------------------------+
                 | id   year   hhssa1   count   year_count   treat_dist   control_dist   keep_years   keep_dist   to_keep |
                 |--------------------------------------------------------------------------------------------------------|
              1. |  1   2006    31000       7            5            1              0            1           1         1 |
              2. |  1   2007    31000       7            5            1              0            1           1         1 |
              3. |  1   2008    31000       7            5            1              0            1           1         1 |
              4. |  1   2009    31000       7            5            1              0            1           1         1 |
              5. |  1   2010    31000       7            5            1              0            1           1         1 |
              6. |  1   2011    31000       7            .            1              0            0           1         0 |
              7. |  1   2012    31000       7            .            1              0            0           1         0 |
                 |--------------------------------------------------------------------------------------------------------|
              8. |  2   2006    31500       6            4            0              0            0           0         0 |
              9. |  2   2008    31500       6            4            0              0            0           0         0 |
             10. |  2   2009    31500       6            4            0              0            0           0         0 |
             11. |  2   2010    31500       6            4            0              0            0           0         0 |
             12. |  2   2011    31500       6            .            0              0            0           0         0 |
             13. |  2   2012    31500       6            .            0              0            0           0         0 |
                 |--------------------------------------------------------------------------------------------------------|
             14. |  3   2006    50210       3            3            0              1            0           1         0 |
             15. |  3   2009    50210       3            3            0              1            0           1         0 |
             16. |  3   2010    50210       3            3            0              1            0           1         0 |
                 |--------------------------------------------------------------------------------------------------------|
             17. |  4   2006    50500       5            5            0              0            1           0         0 |
             18. |  4   2007    50500       5            5            0              0            1           0         0 |
             19. |  4   2008    50500       5            5            0              0            1           0         0 |
             20. |  4   2009    50500       5            5            0              0            1           0         0 |
             21. |  4   2010    50500       5            5            0              0            1           0         0 |
                 |--------------------------------------------------------------------------------------------------------|
             22. |  5   2006    50350       5            5            0              1            1           1         1 |
             23. |  5   2007    50350       5            5            0              1            1           1         1 |
             24. |  5   2008    50350       5            5            0              1            1           1         1 |
             25. |  5   2009    50350       5            5            0              1            1           1         1 |
             26. |  5   2010    50350       5            5            0              1            1           1         1 |
                 +--------------------------------------------------------------------------------------------------------+
            At the end a couple of questions remain:
            1) If individual #1 and #5 (the two that meet all the sample conditions), do you want to keep years 2011-2014 if it is in the data, or can those years be safely deleted?
            2) What happens if id #1 is in district 31000 for the first three years, and then moves to a non-included district in 2009? In other words, does the person have to be in the same district all 5 yrs to be included in your analysis?

            Comment


            • #7
              Hi Nick, William and David,

              Thank you very much for the information and answers. I should have done my background research better before posting. Following the steps by William (thank you very much), I managed to write the following code after some trials and errors. It looks so far like it works.

              Code:
              // Make the string "xwaveid" to a numeric variable starting from 1 on.
              // If we would have added ", label" next to the below command we would have gotten the same names for the variables as in xwaveid.
              egen newid = group(xwaveid), label
              // Introducing the panel data
              xtset newid wave, yearly
              // Now we will make sure that we are only left with individuals from either our treatment or control group.
              // Create a variable I'll call inwave (I chose predisaster) which is 1 if the observation is in one of the five waves you care about, and 0 otherwise.
              gen predisaster=0
              replace predisaster=1 if wave<2011
              // Create a variable I'll call treatment (I chose treatgroup) which is 1 if the observation is in one of the four hssa1 values that represent treatment.
              gen treatgroup=0
              replace treatgroup=1 if (hhslga==31000 | hhslga==33960 | hhslga==34580 | hhslga==36580)
              // Create a variable I'll call control (I chose contgroup) which is 1 if the observation is one of the four hssa1 values that represent control.
              gen controlgroup=0
              replace controlgroup=1 if (hhslga==50210 | hhslga==50350 | hhslga==50420 | hhslga==50490 | hhslga==51330 | hhslga==51750 | hhslga==53150 | hhslga==53430 | hhslga==53780 | hhslga==54200 | hhslga==55320 | hhslga==55740 | hhslga==56090 | hhslga==56580 | hhslga==56930 | hhslga==57080 | hhslga==57840 | hhslga==57980 | hhslga==58050 | hhslga==58330 | hhslga==58510 | hhslga==58570 | hhslga==58680 | hhslga==59370)
              // Create a variable I'll call keep_t if the observation has inwave==1 and treatment==1
              gen keep_t=0
              replace keep_t=1 if (predisaster==1 & treatgroup==1)
              // Create a variable I'll call keep_c if the observation has inwave==1 and control==1
              gen keep_c=0
              replace keep_c=1 if (predisaster==1 & controlgroup==1)
              // Now we want to look at the totals of keep_t and keep_c for each xwaveid and keep those xwaveid's for which one of those totals is 5.
              // Create a variable I'll call tot_keep_t which will have, in every observation for a given xwaveid, the number of observations for which keep_t is 1.
              bysort newid (wave): egen tot_keep_t = total(keep_t)
              // Similarly, create a variable I'll call tot_keep_c similarly, which will have, in every observation for a given xwaveid, the number of observations for which keep_c is 1.
              bysort newid (wave): egen tot_keep_c = total(keep_c)
              // Keeping only the ones who were in the treatment or control area in the entire predisaster period, which is 2006-2010.
              keep if tot_keep_t==5 | tot_keep_c==5
              I really loose a lot of observations (from 55000 down to 4000) but I think it is due to my restrictions probably, because the code looked like it worked. I am very sorry that it took me do long with dataex. Thank you for the link. Please find below the data example.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input float(newid wave) long hhslga
                6 2006 55320
                6 2007 55320
                6 2008 55320
                6 2009 55320
                6 2010 55320
                6 2011 55320
                6 2012 55320
                6 2013 55320
                6 2014 55320
                7 2006 55320
                7 2007 55320
                7 2008 55320
                7 2009 55320
                7 2010 55320
                7 2011 55320
                7 2012 55320
                7 2013 55320
                7 2014 55320
               35 2006 58510
               35 2007 58510
               35 2008 58510
               35 2009 58510
               35 2010 58510
               35 2011 58510
               35 2012 58510
               35 2013 58510
               35 2014 58510
               39 2006 31000
               39 2007 31000
               39 2008 31000
               39 2009 31000
               39 2010 31000
               39 2011 33960
               39 2012 33960
               39 2013 33960
               39 2014 33960
               40 2006 58510
               40 2007 58510
               40 2008 58510
               40 2009 58510
               40 2010 58510
               40 2011 58510
               40 2012 58510
               40 2013 58510
               40 2014 58510
               48 2006 31000
               48 2007 31000
               48 2008 31000
               48 2009 31000
               48 2010 31000
               48 2011 31000
               48 2012 31000
               48 2013 31000
               48 2014 31000
              Thank you very much for your patience with me. I will certainly give my best to improve over time.
              Best.

              Comment


              • #8
                Congratulations! It does indeed look like you have succeeded, your code looks as I would expect it to look, given my understanding of your data and your problem. Note that I assumed you wanted to keep all the observations for the selected individuals, while David assumed you only wanted the observations from what you call the pre-disaster years. I point this out only to emphasize the many ways explanations can go wrong in the absence of data to demonstrate it on, and to use to explain the desired results with concrete examples. As I'm fond of posting, the more you help others understand your problem, the more likely others are to be able to help you solve your problem.

                Let me share a few Stata tips for your learning pleasure.

                Although xwaveid is a string variable, it contains seven digits, including leading zeroes, and there is no need for your panel identifier to be a number starting at one. So
                Code:
                destring xwaveid, generate(newid)
                format newid %07.0f
                would be adequate to create a numeric variable (it will be created as type long) suitable for use as the panel identifier, with the side benefit that when you look at the data, it is immediately obvious what the original xwaveid is (the format will cause newid to be displayed as a seven-digit number with leading zeroes). Actually, once you're certain you are done merging all your data using the string version of xwaveid common across your input datasets, you can equally well do
                Code:
                destring xwaveid, replace
                format xwaveid %07.0f
                and use the now-numeric xwaveid as your panel identifier.

                Logical expressions yield a value of 0 or 1, so constructs like
                Code:
                gen predisaster=0
                replace predisaster=1 if wave<2011
                can be expressed as
                Code:
                gen predisaster = wave<2011
                Similarly, the inlist() function allows you to shorten
                Code:
                gen treatgroup=0
                replace treatgroup=1 if (hhslga==31000 | hhslga==33960 | hhslga==34580 | hhslga==36580)
                to
                Code:
                gen treatgroup = inlist(hhslga,31000,33960,34580,36580)
                You might find the command
                Code:
                xtdescribe
                useful in confirming that your data meets your expectations. See the output of help xtdescribe for details on its use.

                Comment


                • #9
                  Hi William,

                  I am very thankful for your additional suggestions. I have implemented them in my code. Particularly the inlist() command saved me a lot of copy-pasting. Thank you very very much. And you are very right about me wanting to keep all the observations for the remaining individuals. I have a long way to go and learn, but I thank you very much for your guidance and hope to have more sophisticated questions in the not so distant future. Thank you..

                  (Merry Christmas)

                  Comment


                  • #10

                    I wonder if there is any difference between "bysort id (year): gen count = _N" and "bysort id: gen count=_N". There does not seem to be any difference. Thank you.

                    Comment


                    • #11
                      #10 No difference in terms of result, as counting observations does not depend on their order. The difference may lie in a different sort order once the command has finished.

                      #10 has nothing (obviously) to do with the title of the thread!

                      Comment

                      Working...
                      X