Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Frequency distribution for multiple variables (with same categories) conditioned by another variable

    Hi all,
    I apologize if this question is already on the forum, but I could not find it. I also apologize for my not perfect English.
    I would need to find a way to get a frequency table of several variables (actually the "same variable" spread over several columns), conditioning the result to a category of an additional variable.
    Here a dataset example, in wh:

    City Var1 Var2 Var3
    New York Eggs Bread
    London Bread Eggs Milk
    New York Eggs Milk
    New York Milk Eggs
    London Bread Milk Egg
    New York Milk Eggs
    Paris Bread Milk Eggs
    London Milk Bread
    Mumbai Eggs Milk
    Mumbai Eggs Bread

    My "final result" should be something like this:

    New York New York% London London% Paris Paris% Mumbai Mumbai%
    Eggs 4 50.0 2 25.0 1 33.3 2 50.0
    Bread 1 12.5 3 37.5 1 33.3 1 25.0
    Milk 3 25.0 3 37.5 1 33.3 1 25.0

    If it is possible, I would like to get the result without creating an additional variable.
    Thank you in advance!

  • #2
    I have no idea what you mean when you say "without creating an additional variable" since none of the variables you ask for in your result exist in the original data, so there is no way to avoid creating new variables for this.

    Next, the exact layout you ask for in the final result is not possible in Stata. This is because the first variable (column) in the result must have a name, and because neither blank spaces nor % characters are permitted in the names of the other variables. But you can come close with:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str9 city str6(var1 var2) str4 var3
    "New York " "Eggs "  "Bread " ""    
    "London "   "Bread " "Eggs "  "Milk"
    "New York " "Eggs "  "Milk "  ""    
    "New York " "Milk "  "Eggs "  ""    
    "London "   "Bread " "Milk "  "Egg"
    "New York " "Milk "  "Eggs "  ""    
    "Paris "    "Bread " "Milk "  "Eggs"
    "London "   "Milk "  "Bread " ""    
    "Mumbai "   "Eggs "  "Milk "  ""    
    "Mumbai "   "Eggs "  "Bread " ""    
    end
    
    //  CORRECT DATA INCONSISTENCIES
    replace var3 = "Eggs" if var3 == "Egg"
    foreach v of varlist var1-var3 {
        replace `v' = trim(itrim(`v'))
    }
    
    gen long obs_no = _n
    reshape long var, i(obs_no)
    drop if missing(var)
    
    contract city var
    by city (var), sort: egen _pct = pc(_freq)
    
    replace city = strtoname(city)
    reshape wide @_freq @_pct, i(var) j(city) string
    format *_pct %2.1f
    Note that the process begins with some correction of data inconsistencies. I suspect the real data set is even more in need of cleaning than the example. But remember that to Stata, "Eggs " is not the same thing as "Eggs", nor is either of those the same as "Egg". The data must be consistent about the use of spacing, spelling, punctuation (if any), and capitalization for things to work out properly.

    You will find at least one instance in which the percentage produced by this code disagrees with your example result table, but hand calculation reveals that the code is correct and the example is not.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment

    Working...
    X