Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating composite counts of variables in panel data

    Hello,

    I have the following dataset of certain incidents occurring in districts across India. The data is in the following format:
    district count_totalincidents startingyear titleofincident (string)
    AP 8 1995 abc
    AP 8 1998 def
    AP 8 2000 ghi
    AP 8 2002 jkl
    AP 8 2004 mno
    AP 8 2007 pqr
    AP 8 2008 stu
    AP 8 2009 vwx

    I want to convert this data into a panel dataset, with a new variable year that will have values from 2001-10. I also want three new variables that measures the count of the incidents based on their starting year. If you look below it is clear the count_before2001 and count_after2010 will 0 since those years are not being included in the panel. The variable count_before2001 has 3 because in the table above, there are 3 incidents occurring before 2001.

    What I am facing a problem in doing is generating the count_between200110 variable. I want this variable to dynamically change based on the starting year variable. As you can see below, this variable takes on the value of 1 when the year becomes 2002, (there is an incident in 2002 in the table above), increases to 2 in 2004 (another incident in 2004), and so on.

    year district count_totalincidents count_before2001 count_between200110 count_after 2010
    2001 AP 8 3 0 0
    2002 AP 8 3 1 0
    2003 AP 8 3 1 0
    2004 AP 8 3 2 0
    2005 AP 8 3 2 0
    2006 AP 8 3 2 0
    2007 AP 8 3 3 0
    2008 AP 8 3 4 0
    2009 AP 8 3 5 0

    I need to replicate this for all the districts in my dataset.
    Hope this information is sufficient, if not, I will provide further details on asking.

    Thanking the community in advance.

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str3 district byte count_totalincidents int startingyear str3 titleofincidentstring
    "AP " 8 1995 "abc"
    "AP " 8 1998 "def"
    "AP " 8 2000 "ghi"
    "AP " 8 2002 "jkl"
    "AP " 8 2004 "mno"
    "AP " 8 2007 "pqr"
    "AP " 8 2008 "stu"
    "AP " 8 2009 "vwx"
    end
    
    encode district, gen(n_district)
    tsset n_district startingyear
    tsfill
    by n_district (district), sort: replace district = district[_N]
    by district (count_totalincidents), sort: replace count_totalincidents = ///
        count_totalincidents[1]
        
    isid district startingyear, sort
    by district (startingyear): egen count_before2001 = ///
        total(cond(startingyear < 2001, !missing(titleofincidentstring), .))
    by district (startingyear): egen count_after_2010 = ///
        total(cond(startingyear > 2010, !missing(titleofincidentstring), .))
    by district (startingyear): gen count_between200110 = ///
        sum(cond(inrange(startingyear, 2001, 2010), !missing(titleofincidentstring), .)), ///
        before(count_after_2010)
    keep if inrange(startingyear, 2001, 2009)
    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.



    Comment


    • #3
      Much thanks, Professor Schechter. I will keep in mind your advice about using -dataex-

      On running the code, I'm getting the following error: "repeated time values within panel".

      This maybe due to repeated values of the variable 'startingyear'. I'm putting a snippet of the data that has repeated values of the startingyear (1995 and 2004 are repeated twice).

      clear
      input str3 district byte count_totalincidents int startingyear str3 titleofincidentstring
      "MH" 8 1995 "abc"
      "MH" 8 1995 "def"
      "MH" 8 2000 "ghi"
      "MH" 8 2002 "jkl"
      "MH" 8 2004 "mno"
      "MH" 8 2004 "pqr"
      "MH" 8 2008 "stu"
      "MH" 8 2009 "vwx"
      end

      What can be the workaround in this case?

      Thanks,
      Dibyajyoti

      Comment


      • #4
        OK. That requires a few changes to the code. They are sufficiently numerous that rather than highlighting them in the original, I'm just posting the entire revision--just discard what you were given in #2.

        Code:
        clear
        input str3 district byte count_totalincidents int startingyear str3 titleofincidentstring
        "MH" 8 1995 "abc"
        "MH" 8 1995 "def"
        "MH" 8 2000 "ghi"
        "MH" 8 2002 "jkl"
        "MH" 8 2004 "mno"
        "MH" 8 2004 "pqr"
        "MH" 8 2008 "stu"
        "MH" 8 2009 "vwx"
        end
        
        encode district, gen(district_num)
        gen n_incidents = !missing(titleofincidentstring)
        collapse (sum) n_incidents (first) district_num count_totalincidents, ///
            by(district startingyear)
        
        tsset district_num startingyear
        tsfill
        by district_num (district), sort: replace district = district[_N]
        by district (count_totalincidents), sort: ///
            replace count_totalincidents = count_totalincidents[1]
        replace n_incidents = 0 if missing(n_incidents)    
        isid district startingyear, sort
        by district (startingyear): egen count_before2001 = ///
            total(cond(startingyear < 2001, n_incidents, .))
        by district (startingyear): egen count_after_2010 = ///
            total(cond(startingyear > 2010, n_incidents, .))
        by district (startingyear): gen count_between200110 = ///
            sum(cond(inrange(startingyear, 2001, 2010), n_incidents, .)), ///
            before(count_after_2010)
        keep if inrange(startingyear, 2001, 2009)

        Comment


        • #5
          Thank you so much for your help, Professor! This code worked like a treat.

          Comment

          Working...
          X