Table of Counts by Variable

Vish Kavshi

Join Date: Feb 2021

Posts: 7
#1

Table of Counts by Variable

21 Feb 2023, 09:39

Hello, I have a panel dataset of people and various characteristics. I want to generate a table that displays counts for each of these variables and group them by year. It's easy to do for one variable using the 'tab' function, but I cannot seem to find an option for multiple variables. For some given data (ID, Year, Sex, Age, Education), I have:

ID Year Age Sex Education

1 2000 25 1 PhD

1 2002 27 1 PhD

2 2000 20 0 --

2 2001 21 0 --

3 2000 23 1 College

3 2003 24 1 College

And I'd like to generate an output similar to the one below, where I'm counting the frequency of observations by year:

Sex Age Education

2000 3 3 2

2001 1 1 0

2002 1 1 1

2003 1 1 1

Is there a feasible way to do this in Stata? I can't seem to do it with the tab function. Any help would be appreciated!
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35696

21 Feb 2023, 10:04

Please follow FAQ Advice #12 and use dataex to give a data example.

You seem to want counts of non-missing values. If so, string values of -- are not helpful. This may help:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte d int year byte(age sex) str7 education
1 2000 25 1 "PhD"    
1 2002 27 1 "PhD"    
2 2000 20 0 "--"     
2 2001 21 0 "--"     
3 2000 23 1 "College"
3 2003 24 1 "College"
end

replace education = "" if education == "--"
preserve
gen educ = cond(education == "", ., 1) 
collapse (count) age sex educ, by(year)
list
restore

Results

Code:

 
    +-------------------------+
     | year   age   sex   educ |
     |-------------------------|
  1. | 2000     3     3      2 |
  2. | 2001     1     1      0 |
  3. | 2002     1     1      1 |
  4. | 2003     1     1      1 |
     +-------------------------+

Comment

Vish Kavshi

Join Date: Feb 2021

Posts: 7
#3

21 Feb 2023, 11:18

Thank you, Nick! This worked greatly.

Luckily, my variables are all doubles, so the missing value conversion is not necessary for my dataset. However, my actual variable list has 10+ variables, so I don't suppose there is a more clean way to display these tables in the way you've set it up? So far, the output appears to be an individual table for each year (probably because my actual var list is quite large). Any further suggestions? Thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

21 Feb 2023, 12:14

Further suggestions would benefit from a data example, as requested in #2,
Comment

Vish Kavshi

Join Date: Feb 2021
Posts: 7

21 Feb 2023, 14:08

Sure, here is an example with more variables.

Code:

clear
input byte d int year byte(age sex income industry vehicles household_size city jobs circle square) str7 education
1 2000 25 1 70 30 1 2 2 3 4 5 "PhD"    
1 2002 27 1 70 30 1 2 2 3 4 5 "PhD"    
2 2000 20 0 55 12 2 1 1 1 5 2 "--"     
2 2001 21 0 55 12 3 1 1 1 5 2 "--"     
3 2000 23 1 65 21 2 3 1 3 6 3 "College"
3 2003 24 1 65 21 2 4 1 3 6 3 "College"
end

replace education = "" if education == "--"
preserve
gen educ = cond(education == "", ., 1) 
collapse (count) age sex income industry vehicles household_size city jobs circle square educ, by(year)
list
restore

The resulting tables (which I'm not sure how to display here) are not as organized as before. I believe that it has to do with the amount of variables that I'm trying to count.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

22 Feb 2023, 03:30

Thanks for your data example in #5 which flatly contradicts the statement in #3 that "my variables are all doubles".

#5 is just a variation on what I guessed in #2, so that is welcome confirmation, but the answer is not going to be much different.

Once more, what to just about any researcher is obviously an indication of missing -- namely "--" -- will not be regarded as missing by Stata, and if you have any other similar conventions they will frustrate your goal.

The only extra trick here is that you can segregate string variables.

The code here assumes that you have no variable work, but any other name not in use would be fine, as would using a temporary variable.

Code:

clear input byte d int year byte(age sex income industry vehicles household_size city jobs circle square) str7 education 1 2000 25 1 70 30 1 2 2 3 4 5 "PhD" 1 2002 27 1 70 30 1 2 2 3 4 5 "PhD" 2 2000 20 0 55 12 2 1 1 1 5 2 "--" 2 2001 21 0 55 12 3 1 1 1 5 2 "--" 3 2000 23 1 65 21 2 3 1 3 6 3 "College" 3 2003 24 1 65 21 2 4 1 3 6 3 "College" end preserve ds, has(type string) foreach v in `r(varlist)' { replace `v' = "" if `v' == "--" gen work = cond(missing(`v'), ., 1) drop `v' rename work `v' } ds year, not collapse (count) `r(varlist)', by(year) list restore
Comment
Vish Kavshi

Join Date: Feb 2021

Posts: 7
#7

22 Feb 2023, 19:57

Thank you Nick! Yes, my example does contradict what you said before, as this is just a toy example with multiple variables. This should be helpful in other scenarios.

It appears that the output tables are similar to what I've generated before; they are separate tables instead of one large organized table. I suppose that is fine if Stata cannot aggregate the output tables into a single large table.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#8

23 Feb 2023, 00:55

You showed what you wanted in =1.In principle and in practice, that will be so many rows and so many columns.

If you really or also want something else, you also need to specify what that is. For example, you can reshape the dataset produced by collapse to another layout

Much of the difficulty in coding here arises from needing to fix a non-standard method of holding missing values.

Last edited by Nick Cox; 23 Feb 2023, 01:42.
Comment

ID	Year	Age	Sex	Education
1	2000	25	1	PhD
1	2002	27	1	PhD
2	2000	20	0	--
2	2001	21	0	--
3	2000	23	1	College
3	2003	24	1	College

Announcement

Table of Counts by Variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment