Collapse panel data

Alex Carr

Join Date: May 2017

Posts: 38
#1

Collapse panel data

16 May 2017, 15:05

Hi, i have a panel data with 3 years.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str27 revenue str44 schooldistrict long year "100" "school101" 2009 "200" "school101" 2010 "150" "school101" 2011 "110" "school102" 2009 "120" "school102" 2010 "250" "school102" 2011 end

When I collapse the data by school districts:

Code:

collapse(sum) Revenue, by(SchoolDistrict)

it ignores the year variable and makes it a one-year data. Is there anyone I can keep the year variable so that I can still have a 3-year panel data?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30127
#2

16 May 2017, 15:52

I don't get what you're asking for. If you keep the year variables, then you will have a separate record for each year, but all of the information for a given school district will be the same in all years. Why would you want that? It doesn't make any sense to me.
Comment
Alex Carr

Join Date: May 2017

Posts: 38
#3

16 May 2017, 16:31

Originally posted by Clyde Schechter View Post

I don't get what you're asking for. If you keep the year variables, then you will have a separate record for each year, but all of the information for a given school district will be the same in all years. Why would you want that? It doesn't make any sense to me.

I'm sorry I missed coded the example. The data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str27 revenue str44 schooldistrict long year "100" "school101" 2009 "110" "school101" 2009 "200" "school101" 2010 "150" "school101" 2010 "210" "school101" 2011 "230" "school101" 2011 end

So i want my data looks like this after collapsing:

Code:

"210" "school101" 2009 "350" "school101" 2010 "440" "school101" 2011

basically I want to sum the revenue by school district within the same year, instead of summing across all years. Thank you.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

16 May 2017, 17:43

Perhaps just adding Year to the by option is what you need.

Code:

. input long Revenue str16 SchoolDistrict long Year

          Revenue    SchoolDistrict          Year
  1. 100 "school101" 2009
  2. 110 "school101" 2009
  3. 200 "school101" 2010
  4. 150 "school101" 2010
  5. 210 "school101" 2011
  6. 230 "school101" 2011 
  7. end

. collapse(sum) Revenue, by(SchoolDistrict Year)

. list, clean

       SchoolD~t   Year   Revenue  
  1.   school101   2009       210  
  2.   school101   2010       350  
  3.   school101   2011       440  

.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35736
#5

16 May 2017, 17:53

One implication of William's excellent advice is to destring Revenue before you try to do additions!
1 like
Comment

Chinmay Korgaonkar

Join Date: Dec 2022
Posts: 34

18 Nov 2023, 14:08

Hello everyone.

I have the panel data about property tax rates organized by the school districts in NY state for 15 years.

Sample:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Roll_Year long SchoolCode str30 SchoolName str23 Municipality str13 County long County_Tax_Levy double(County_Tax_Rate_Outside_Village County_Tax_Rate_Inside_Village)
2021  10100 "Albany"                  "Albany"       "Albany" 19199755 3.48 0
2021  10300 "Cohoes"                  "Cohoes"       "Albany"  3541490 3.48 0
2021  11800 "Watervliet"              "Watervliet"   "Albany"  1556372 3.48 0
2021  12001 "Berne-Knox-Westerlo"     "Berne"        "Albany"  1095964 3.48 0
2021  13403 "Voorheesville"           "Berne"        "Albany"  1095964 3.48 0
2021 193201 "Greenville"              "Berne"        "Albany"  1095964 3.48 0
2021 433801 "Middleburgh"             "Berne"        "Albany"  1095964 3.48 0
2021  12206 "Bethlehem"               "Bethlehem"    "Albany" 14328013 3.48 0
2021  12402 "Ravena-Coeymans-Selkirk" "Bethlehem"    "Albany" 14328013 3.48 0
2021  13002 "Guilderland"             "Bethlehem"    "Albany" 14328013 3.48 0
2021  12402 "Ravena-Coeymans-Selkirk" "Coeymans"     "Albany"  2059921 3.48 0
2021 193201 "Greenville"              "Coeymans"     "Albany"  2059921 3.48 0
2021  11800 "Watervliet"              "Colonie"      "Albany" 36046478 3.48 0
2021  12601 "South Colonie"           "Colonie"      "Albany" 36046478 3.48 0
2021  12605 "North Colonie"           "Colonie"      "Albany" 36046478 3.48 0
2021  12615 "Menands"                 "Colonie"      "Albany" 36046478 3.48 0
2021 422401 "Niskayuna"               "Colonie"      "Albany" 36046478 3.48 0
2021 422803 "Rotterdam-Mohonasen"     "Colonie"      "Albany" 36046478 3.48 0
2021  12801 "Green Island"            "Green Island" "Albany"   727901 3.48 0
2021  12601 "South Colonie"           "Guilderland"  "Albany" 15046953 3.48 0
end

I want to collapse the data by county and municipality separately.

I tried the following code for the county, but it does not work for the County Tax Levy.

Code:

collapse (sum) County_Tax_Levy (mean) County_Tax_Rate_Outside_Village (mean) County_Tax_Rate_Inside_Village (mean), by(County Roll_Year)

I think the reason is that, as seen from the data, the County Tax Levy figures are the same for school districts within a municipality in a state.

Thanks in advance!

Last edited by Chinmay Korgaonkar; 18 Nov 2023, 14:55.

Comment

Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 34
#7

18 Nov 2023, 21:54

Hello everyone,

The following code worked in my case thanks to https://www.stata.com/support/faqs/d...-observations/

Code:

sort Roll_Year County Municipality County_Tax_Levy quietly by Roll_Year County Municipality County_Tax_Levy: gen dup = cond(_N==1,0,_n) drop if dup>1 collapse (sum) County_Tax_Levy (mean) County_Tax_Rate_Outside_Village (mean) County_Tax_Rate_Inside_Village, by(County Roll_Year)

I realized that I had to delete duplicate entries. I could not sort only by "County Municipality" as some municipality names were repeated with a county.

Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30127
#8

19 Nov 2023, 10:15

What you show in #7 may work, but it is dangerous and it may create the appearance of working while producing incorrect and irreproducible results.

The problem is this: if the observations that are duplicates on Roll_Year, County, Municipality, and County_Tax_Levy contain different information for the variables County_Tax_Rate_Outside_Village and County_Tax_Rate_Inside_Village, then your -sort- does not uniquely identify observations. When you do a -sort- that does not uniquely identify observations, the order of the observations on variables not in the sort key is randomized and irreproducible. In other words, your code will select one arbitrary observation at random from among these not-quite-duplicate observations. Consequently the values you get at the end for the means of County_Tax_Rate_Outside_Village and County_Tax_Rate_Inside_Village will be wrong based on incomplete, and irreproducibly created, data.

Now, it may be that this situation doesn't actually arise in your data. But before you do something like this, you should verify that this is the case. The following code will, if your data is OK, get you the same results. If your data is not OK, it will break with an error message after the -assert- command and you will know that this approach is not viable.

Code:

keep Roll_Year County Municipality County_Tax_Levy /// County_Tax_Rate_Outside_Village County_Tax_Rate_Inside_Village /// County Roll_Year duplicates drop by Roll_Year County Municipality County_Tax_Levy, sort: assert _N == 1 collapse (sum) County_Tax_Levy (mean) County_Tax_Rate_Outside_Village /// (mean) County_Tax_Rate_Inside_Village, by(County Roll_Year)

Finally, the question naturally arises why you have these duplicate observations on the variables Roll_Year, County, Municipality, and County_Tax_Level that were interfering with your original computation. If they are complete duplicates (i.e. they agree on all variables in the data set), then why are they there in the first place? Often this situation represents an error in the data management that created the data set. So you should review the data management that created this data set--the optimal solution here is to fix the errors that led to this situation. And sometimes you will find other errors as well, and you can correct those. If they are not complete duplicates, then there is less concern about the preceding dta management, but then working with these data requires the kind of caution that I am urging on you in this post.
Comment
Chinmay Korgaonkar

Join Date: Dec 2022

Posts: 34
#9

02 Dec 2023, 18:52

Thanks a lot. My code indeed produced inconsistent results when we did random checks. Your code worked perfectly.
The duplicate entries were present in the raw data downloaded from the state website, as the County Tax Levy figures were the same for all school districts within a municipality in a state.
Thank you once again.
Comment

Announcement