New to understanding stata/ Help with multiple variables

kyle coran

Join Date: Feb 2023

Posts: 12
#1

New to understanding stata/ Help with multiple variables

11 Feb 2023, 20:11

Hello,

I am new but somewhat familiar with how to use STATA. I had a question about how to create variables with multiple response values. For example, I want to combine the following with the multiple responses for each varname. I have already coded for the missing items (96, 97, 98). Apologies if this does not make sense, I am unsure of how to describe exactly what it is I want to do and am unfamiliar with the names for things.
Item 1 H2RM11 (resident mother) and H2RF11 (resident father): How often is the resident mother/father at home when you leave for school?
Response label and response value (#):
Always (1), Most of the time (2), Some of the time (3), Almost never (4), Never (5), He/She takes me to school (6), Refused (96), Legitimate Skip (97), Don’t know (98), Not applicable (99).

Item 2 H2RM12 (resident mother) and H2RF12 (resident father): How often is the resident mother/father at home when you return from school?
Response label and response value (#):
Always (1), Most of the time (2), Some of the time (3), Almost never (4), Never (5), He/She takes me to school (6), Refused (96), Legitimate Skip (97), Don’t know (98), Not applicable (99).

Item 3 H2RM13 (resident mother) and H2RF13 (resident father): How often is the resident mother/father at home when you go to bed?
Response label and response value (#):
Always (1), Most of the time (2), Some of the time (3), Almost never (4), Never (5), Refused (96), Legitimate Skip (97), Don’t know (98), Not applicable (99).
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

11 Feb 2023, 21:24

It is unclear what your question is. Do you already have a Stata data set containing these variables? If so, you are far more likely to get a useful response if you show example data from it. For that, we have the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data. So please post back using -dataex- to show example data.

If the data have not yet been imported to Stata, in what form do they exist? Do you have them in a single spreadsheet? Multiple spreadsheets? Text files of some kind? Data files intended for some other application (relational data base, non-Stata statistics package)? Something else?

Finally, in general terms, what sorts of calculations do you hope to perform with this data? Different kinds of data organization may facilitate different kinds of analysis.

In short, in order for somebody to help you draw a roadmap, they need to know where you are and where you want to go. I also suggest that you read the Forum FAQ for excellent guidance on how to post questions in ways that maximize your chances of getting a timely and helpful response before responding. You can get to the FAQ by clicking in the black area underneath the Statalist banner.

I have already coded for the missing items (96, 97, 98).

Do you mean that you created the codes 96, 97, and 98 for these categories of non-response? If so, I'm sorry to tell you that's a terrible idea in Stata and will lead to no end of problems with data analysis. One of the first things you will need to do is convert those to Stata missing values.

If you mean that you have already eliminated the 96, 97, and 98 values in the original data, converting them to Stata missing values, that was a great first step towards making your data usable.
2 likes
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#3

12 Feb 2023, 10:17

This will create a variable for all possible combinations, but there may be thousands of them and sorting it out would be difficult.

Code:

egen group athome_grp = group(H2RM12 H2RF11 H2RM12 H2RF12 H2RM13 H2RF13)

It may make sense to combine some groupings to shrink the dimension.

G
Comment
kyle coran

Join Date: Feb 2023

Posts: 12
#4

12 Feb 2023, 13:05

Hi Clyde,

Yes, I already have a dataset containing those variables. I am unsure of what you mean by dataex...when I type it in, it is coming up as an error saying that the dataset is too large to generate.

Maybe to make my question more clear, I have a set of variables that use a Likert scale (values ranging 0-3). I would like to create the variables so that each value corresponds to a specific value. For example variable X1 = 1 if X1 = 1, X1 =2 if X2 = 2, etc. I then want to sum up all the scores to create a 'parental supervision' scale.

Yes, the values 96-98 were converted to missing values. After converting the dataset to stata, I went in with the model of the following code to change those to missing: replace varname = . if varname == #.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#5

12 Feb 2023, 13:26

I would like to create the variables so that each value corresponds to a specific value. For example variable X1 = 1 if X1 = 1, X1 =2 if X2 = 2, etc.

???
X1 =1 if X1 = 1 is inevitably true. I still don't get what you want. Let me ignore that part and run with

I have a set of variables that use a Likert scale (values ranging 0-3). ...
I then want to sum up all the scores to create a 'parental supervision' scale.

I'll assume that these variables are, in fact, numeric--it seems like that given what you did with missing values. Then these data are ready, as is, for what you want to do. All you need is:

Code:

egen parental_supervision = rowtotal(H2R?11 H2R?12 H2R?13)

and you have it.

That said, I think that creating this scale by adding up the variables is not a good idea. Because there can be missing values, adding them up treats the missing values as if they were zero responses. Apart from the fact that zero isn't even a valid item response option, ordinally it acts as something like "more often than always." I think this approach is psychometrically unsound. It would use the mean of the item responses instead.

Code:

egen parental_supervision = rowmean(H2R?11 H2R?12 H2R?13)

This approach simply skips over the missing values, which is mathematically equivalent to treating them as equal to the mean of the same person's responses to the non-missing items. This approach is called ipsative mean imputation. Often it it is qualified by restricting it to people who have non-missing responses to at least some minimum number of the items, leaving the scale score as missing for those with too many items missed. For example, if you want to use ipsative mean imputation, but not score people who only answered 1 item, it would be:

Code:

egen valid_responses = rownonmiss(H2R?11 H2R?12 H2R?13) egen parental_supervision = rowmean(H2R?11 H2R?12 H2R?13) if valid_responses > 1

All of that said, I'm still puzzled about the data. In #1 you show items with a response set ranging from 1 to 6 (excluding the 9# missing value codes), yet now you say you have scores ranging from 0 to 3. So which is it? Or did you transform the 1 to 6 variables into 0 to 3 by combining some categories together? I would also note that the response set ranging from 1 to 6 looks psychometrically inappropriate to me. I say that because option 6, "..takes me to school", is not mutually exclusive of the 1-5 responses. That is to say, one parent might be home any proportion of time at all, between never and always, inclusive, and also take the child to school. So it would seem to me that the responses to those items would be contaminated with noise reflecting which response was chosen when both 6 and one of 1-5 were true. Not good item design, it seems to me.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#6

13 Feb 2023, 08:52

Taking a guess, it sounds like the goal is to identify the location of parents at 3 important time periods a day for a child (going to school, getting home from school, and going to bed). Are both always home? Are both always away? and so forth.

With 6 legit options for each of 6 questions, the dimension of the problem is huge (unworkably so, I think: 46656 possibilities). Cut it 3 answers by grouping responses (say, always+almostalwys, almostnever+never), it will be become more manageable but still a lot of possible combinations.

And as Clyde notes, the "takes me to school" would have to be groups, but I'd call that "always/almostalways home". But, it may be that taking your kid to school is different than just being home.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#7

13 Feb 2023, 12:31

But, it may be that taking your kid to school is different than just being home.

Yes. Consider a situation with divorced parents where the non-custodial parent picks the children up and takes them to school. That parent is never home, but always takes the children to school. The other parent is always at home but never takes the children to school

Or consider a situation with divorced parents who share custody, with the children alternating weeks in the different homes, but one of them retains full-time responsibility for taking the children to school. That parent as at home half the time, but always takes the children to school.

One can imagine numerous other situations as well. I think pretty much all combinations of at-home-ness and taking the children to school are possible.

Last edited by Clyde Schechter; 13 Feb 2023, 12:34.
Comment

George Ford

Join Date: Aug 2014
Posts: 3152

13 Feb 2023, 14:28

Not sure why you'd create a variable X1=1 if X1=1, since you already have it.

but a sum of multiple variables is just

egen Xsum = rowtotal(X1 X2 X3)

not sure exactly what that gets you as you are summing an ordinal statistic. it would make more sense to create dummies for each type then sum those.

I'd be thinking of "high engagement" "moderate engagement" and "low engagement".

Code:

clear
set obs 1000

forv i = 1/6 {
g x`i' = int(runiform(1,7))
}

g high = cond(x1<=2,1,0)
forv i = 2/ 6 {
    replace high = high+cond(x`i'<=2,1,0)
}

g mod = cond(x1>=3 & x1<=4,1,0)
forv i = 2/ 6 {
    replace mod = mod+cond(x1>=3 & x1<=4,1,0)
}

g low = cond(x1>=5,1,0)
forv i = 2/ 6 {
    replace low = low+cond(x`i'>=5,1,0)
}    

*higher values of high indicate higher engagement, higher values of low indicate more lowness, but you could recode so that is flows in a sensible order. 


* or you can create a continuous measure of engagement.  I've seen some research saying combining multiple likert scales variable is legit for a continuous variable.

forv i = 1/6 {
    g z`i' = x`i'/6
}

egen engaged = rowtotal(z1 z2 z3 z4 z5 z6)
replace engaged = 6-engaged

HTML Code:

https://www.statisticssolutions.com/can-an-ordinal-likert-scale-be-a-continuous-variable/

Announcement

New to understanding stata/ Help with multiple variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment