Frequency distribution for multiple variables (with same categories) conditioned by another variable

Ale Dima

Join Date: May 2022
Posts: 1

Frequency distribution for multiple variables (with same categories) conditioned by another variable

24 May 2022, 13:03

Hi all,
I apologize if this question is already on the forum, but I could not find it. I also apologize for my not perfect English.
I would need to find a way to get a frequency table of several variables (actually the "same variable" spread over several columns), conditioning the result to a category of an additional variable.
Here a dataset example, in wh:

City	Var1	Var2	Var3
New York	Eggs	Bread
London	Bread	Eggs	Milk
New York	Eggs	Milk
New York	Milk	Eggs
London	Bread	Milk	Egg
New York	Milk	Eggs
Paris	Bread	Milk	Eggs
London	Milk	Bread
Mumbai	Eggs	Milk
Mumbai	Eggs	Bread

My "final result" should be something like this:

	New York	New York%	London	London%	Paris	Paris%	Mumbai	Mumbai%
Eggs	4	50.0	2	25.0	1	33.3	2	50.0
Bread	1	12.5	3	37.5	1	33.3	1	25.0
Milk	3	25.0	3	37.5	1	33.3	1	25.0

If it is possible, I would like to get the result without creating an additional variable.
Thank you in advance!

Tags: frequency distribution, multiple variables, question

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

24 May 2022, 14:04

I have no idea what you mean when you say "without creating an additional variable" since none of the variables you ask for in your result exist in the original data, so there is no way to avoid creating new variables for this.

Next, the exact layout you ask for in the final result is not possible in Stata. This is because the first variable (column) in the result must have a name, and because neither blank spaces nor % characters are permitted in the names of the other variables. But you can come close with:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str9 city str6(var1 var2) str4 var3 "New York " "Eggs " "Bread " "" "London " "Bread " "Eggs " "Milk" "New York " "Eggs " "Milk " "" "New York " "Milk " "Eggs " "" "London " "Bread " "Milk " "Egg" "New York " "Milk " "Eggs " "" "Paris " "Bread " "Milk " "Eggs" "London " "Milk " "Bread " "" "Mumbai " "Eggs " "Milk " "" "Mumbai " "Eggs " "Bread " "" end // CORRECT DATA INCONSISTENCIES replace var3 = "Eggs" if var3 == "Egg" foreach v of varlist var1-var3 { replace `v' = trim(itrim(`v')) } gen long obs_no = _n reshape long var, i(obs_no) drop if missing(var) contract city var by city (var), sort: egen _pct = pc(_freq) replace city = strtoname(city) reshape wide @_freq @_pct, i(var) j(city) string format *_pct %2.1f

Note that the process begins with some correction of data inconsistencies. I suspect the real data set is even more in need of cleaning than the example. But remember that to Stata, "Eggs " is not the same thing as "Eggs", nor is either of those the same as "Egg". The data must be consistent about the use of spacing, spelling, punctuation (if any), and capitalization for things to work out properly.

You will find at least one instance in which the percentage produced by this code disagrees with your example result table, but hand calculation reveals that the code is correct and the example is not.

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Announcement

Frequency distribution for multiple variables (with same categories) conditioned by another variable

Comment