Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping many variables to wide (when wildcards are not allowed in reshape)

    Hi all

    I am trying to make a dataset showing some statistics of the collective of graduates from different universities in different years. Currently I have one observation and then a lot of variables in the form 'variablejx', where j is the university and x is the year. So for example the proportion of males among students who graduated from university 7 in year 2006 is shown by 'male_7_6' and the grade point average from highschool among the students who graduated from university 2 in year 2008 is shown by 'gpa_2_8'. As I have 8 variables, 8 universities and 16 years I have 8*8*16=1.024 variables. I want to reshape this very wide dataset into long format with each row corresponding to a university-year observation. I suppose I could do this by first reshaping into wide having the "j-variable" be the year and then reshaping into wide again having the "j-variable" be the university. My problem is that I don't want to type in 1.024 different stubs in the first reshape command, but wildcards are - as far as I have understood - not allowed in reshape commands.

    Here is an example of what my data looks like (where I have restricted the number of variables to 2, universities to 3 and years to 3, so to make the data-example not too comphrehensive):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(gpa_1_1 gpa_1_2 gpa_1_3 gpa_2_1 gpa_2_2 gpa_2_3 gpa_3_1 gpa_3_2 gpa_3_3) float(male_1_1 male_1_2 male_1_3 male_2_1 male_2_2 male_2_3 male_3_1 male_3_2 male_3_3 i)
    1 2 3 4 5 6 7 8 9 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
    end
    Is there some workaround the no-wildcard rule, which can be applied in my case?


    ****

    In case anybody wonders how I got into this akward situation (and in case someone perhaps could provide an easier solution to my "real" task by looking at the steps before) I started by having a dataset with individual data on all individuals graduating from the 8 universities. I am trying to get from there, where my rows correspond to individuals observed in years, to a dataset where my rows correspond to graduate-years from the universities. I have currently made variables for each university in each graduate-year by

    Code:
    forval j = 1/8{
    forval x = 1/16{
    egen male_`j'_`x' = mean(male) if university==`j' & gradyear==`x'
    }
    }
    ...where male shows the sex of the individual (1=male) and 'university' and 'gradyear' shows the university and year the individual graduated from. Then after doing this for each of my 8 variables I wrote (just writing male in the code as an example, so not to put too much text in this post):

    Code:
    foreach x of varlist male*{
    sort `x'
    replace `x' = `x'[1]
    }
    
    keep in 1
    And now I am in the situation with 1 observation and 1.024 variables which I hope to reduce to 8 variables after two succesful reshape commands. I would be happy if anyone could see a smarter solution to my problem which in generic terms could be stated as "how to go from a dataset on individuals clustered in institutions to a dataset on institutions", but for the purpose of knowing how to work around the problem of not being able to use wildcards in reshape-commands I would also much appreciate if anyone could point me in the direction of a solution to first problem.
    Last edited by Emil Alnor; 13 Sep 2022, 07:51.

  • #2
    From the present ultra-wide data set you can get to what you want with a single-reshape and no worries about writing out a long series of variables:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(gpa_1_1 gpa_1_2 gpa_1_3 gpa_2_1 gpa_2_2 gpa_2_3 gpa_3_1 gpa_3_2 gpa_3_3) float(male_1_1 male_1_2 male_1_3 male_2_1 male_2_2 male_2_3 male_3_1 male_3_2 male_3_3 i)
    1 2 3 4 5 6 7 8 9 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
    end
    
    gen long obs_no = _n
    reshape long gpa_ male_, i(obs_no) j(_j) string
    rename *_ *
    split _j, parse("_") destring gen(x)
    rename x1 university
    rename x2 grad_year
    drop obs_no _j
    order university grad_year, first
    There is this variable i at the end of your example data that I don't understand. Perhaps it is intended as an observation number. If so, you don't need to create obs_no, you can use i in the -i()- option of -reshape-. If it is something else, the above code just carries it along, extending it into each new observation created by the -reshape-.

    And, yes, it does sound like you could have avoided this problem with a sharper approach to management of the original data. You do not show example data from that, but looking at the loops you showed, it seems like you could have gotten directly from the individual data to the long data set with something like:
    Code:
    collapse (mean) male gpa, by(university grad_year)

    Comment


    • #3
      In order to avoid the complicated -reshape-, try:
      Code:
      collapse (mean) male, by(university gradyear)

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        From the present ultra-wide data set you can get to what you want with a single-reshape and no worries about writing out a long series of variables:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte(gpa_1_1 gpa_1_2 gpa_1_3 gpa_2_1 gpa_2_2 gpa_2_3 gpa_3_1 gpa_3_2 gpa_3_3) float(male_1_1 male_1_2 male_1_3 male_2_1 male_2_2 male_2_3 male_3_1 male_3_2 male_3_3 i)
        1 2 3 4 5 6 7 8 9 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
        end
        
        gen long obs_no = _n
        reshape long gpa_ male_, i(obs_no) j(_j) string
        rename *_ *
        split _j, parse("_") destring gen(x)
        rename x1 university
        rename x2 grad_year
        drop obs_no _j
        order university grad_year, first
        There is this variable i at the end of your example data that I don't understand. Perhaps it is intended as an observation number. If so, you don't need to create obs_no, you can use i in the -i()- option of -reshape-. If it is something else, the above code just carries it along, extending it into each new observation created by the -reshape-.

        And, yes, it does sound like you could have avoided this problem with a sharper approach to management of the original data. You do not show example data from that, but looking at the loops you showed, it seems like you could have gotten directly from the individual data to the long data set with something like:
        Code:
        collapse (mean) male gpa, by(university grad_year)
        Hi Clyde, thanks for providing an answer to both of my questions. It is always fascinating to see when Stata has a one-line solution to something you tried to do by a long wall of coding. With regards to the 'i' in my dataexample that was indeed intended as an observation number.

        Comment

        Working...
        X