Multiple observations per ID according to a second variable

CEdward

Join Date: Nov 2014

Posts: 131
#1

Multiple observations per ID according to a second variable

02 Dec 2014, 12:07

Hi,

As the title suggests I am having a problem where I have multiple observations per ID according to a second variable.

My data looks roughly like this:

ID M ME
1 1 30
1 1 30
1 1 30
1 1 30
1 2 45
2 3 60
2 3 60
2 3 60
2 3 60

I am basically trying to reshape my data from long to wide but it says my j() isn't unique within i. How do I fix this?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

02 Dec 2014, 12:17

Well clearly, within ID, you have multiple observations with the same values of M and ME. If these are the only variables in your data set, and if there is nothing informative about how many copies of a particular observation there are, you could just eliminate all the duplicates with -duplicates drop-.

But if there are other variables in your data which, perhaps, do vary within ID X M groups, then you should be wary of proceeding in that way, unless you are sure that the information in those other variables (or the information reflected by the number of copies of a given observation) is of no importance to you. You might then need to develop a new variable that does take on distinct values within ID. Something like -by ID M, sort: gen _j = _n- would do in order to pass muster with -reshape-. Better, though, would be if you could find a variable that distinguishes the observations within ID X M groups in a more meaningful way.

By the way, you have received help on this forum before. Please, now, help us maintain the professional atmosphere by conforming to our preference that user names be real given and family names. Use the "Contact Us" button to request a change of your user name. Thank you.
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#3

02 Dec 2014, 12:20

Originally posted by Nick Cox

mad_titan889:

Here's a signal in public but a purely personal one.

Often I have a go with questions like these, but I won't here.

You've been asked more than once to change your identifier. It's your right not to do that, and my right to decline to support you given that.

I've sent in a request for it. If you don't want to help, then don't. There's no need to make it explicit as you have done. Nevertheless, I have sent in the request for the name change.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

02 Dec 2014, 12:25

Thanks for requesting a change, already effective.

I made my signal explicit for the same reason that traffic cops make themselves evident. Drivers should know the law, but that is not sufficient. It is important that others see signals of what we ask for when it's evident that they have not yet found out, or paid attention.

Clyde's request of a similar kind underlined a strong collective feeling about using real names among the people who answer most questions here.

And it worked, and we need say no more about it, except thanks again.

Last edited by Nick Cox; 02 Dec 2014, 12:28.
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#5

02 Dec 2014, 12:38

Originally posted by Clyde Schechter View Post

Well clearly, within ID, you have multiple observations with the same values of M and ME. If these are the only variables in your data set, and if there is nothing informative about how many copies of a particular observation there are, you could just eliminate all the duplicates with -duplicates drop-.

But if there are other variables in your data which, perhaps, do vary within ID X M groups, then you should be wary of proceeding in that way, unless you are sure that the information in those other variables (or the information reflected by the number of copies of a given observation) is of no importance to you. You might then need to develop a new variable that does take on distinct values within ID. Something like -by ID M, sort: gen _j = _n- would do in order to pass muster with -reshape-. Better, though, would be if you could find a variable that distinguishes the observations within ID X M groups in a more meaningful way.

By the way, you have received help on this forum before. Please, now, help us maintain the professional atmosphere by conforming to our preference that user names be real given and family names. Use the "Contact Us" button to request a change of your user name. Thank you.

Hi Clyde,

I am very new to STATA so please bear with me.

ID M ME
1 1 30
1 1 30
1 1 30
1 1 30
1 2 45
2 3 60
2 3 60
2 3 60
2 3 60

As shown by the variable ME the duplicates of, say 30, are basically the averages of a measurement pertaining to the same M "1". I have other variables that are of interest as well and some of them do vary. What should I do?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#6

02 Dec 2014, 12:45

You can only reshape wide if observations are individually identifiable. Perhaps something distinguishes the different observations other than the variables you show us, or you could create an arbitrary identifier.

But you haven't yet given a specific reason for wanting to reshape wide. If you spell out what you want to do that you think requires a reshape wide, then we might get a clue on how should you should do it -- or indeed whether.

I'd venture a wild generalisation that 95% of problems in Stata are easier with a long structure and 5% are easier with a wide structure when there's a choice between them. So, you already have the more tractable structure.
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#7

02 Dec 2014, 12:57

The reason why I want to reshape wide is basically I want to work with one ID (e.g. 1 of ID=1, 1 of ID=2, etc.) Having multiple IDs=1 doesn't help me.
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#8

02 Dec 2014, 13:02

My problem is very similar to this:
http://www.stata.com/support/faqs/da...-with-reshape/

Re: Question regarding multiple observations.
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#9

02 Dec 2014, 13:13

I'll reiterate Nick's point that many data management and creation tasks can be done as easily, if not more-so, with long format. It's frequently the case that new Stata users will look for solutions in wide format because that seems more intuitive on the surface. Once you learn a few Stata tricks, though, it often requires much less code to do the same calculations in long form.

To be able to help you we really need to know more about what's in your data. If you're willing and able to tell us why you want the data in wide form we'll be even more likely to be able to help you achieve your goals. In the absence of more details about what data you have or what outcome you desire, we can't do much more than guess at the right answer.

Given what you've shared, the only way forward I can see is to make an arbitrary j identifier within ID. You would do this by issuing a command like (as Clyde already indicated)

Code:

bysort id: gen j=_n

Then you would proceed with your reshape command. This may not be the right answer for your data. However, we would need more information to help you determine that.

I would encourage you to re-read the FAQ (link located in the upper left of every forum screen). This document describes the kinds of information that is helpful and offers advice on how to ask questions in a way that increases the chances that forum members will be able to help you find the right solution to the problem.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#10

02 Dec 2014, 13:20

I am familiar with that FAQ, as its author, but I don't see which part you are referring to.

With repeated occurrences of the same identifier, a user is often moving back and forth between two versions of the same data, a complete dataset and a reduced dataset with one observation per identifier. But that need not mean a wide structure. Although there are tricks for working across rows, it is often hard work even for an experienced user.
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#11

02 Dec 2014, 13:24

Originally posted by Sarah Edgington View Post

I'll reiterate Nick's point that many data management and creation tasks can be done as easily, if not more-so, with long format. It's frequently the case that new Stata users will look for solutions in wide format because that seems more intuitive on the surface. Once you learn a few Stata tricks, though, it often requires much less code to do the same calculations in long form.

To be able to help you we really need to know more about what's in your data. If you're willing and able to tell us why you want the data in wide form we'll be even more likely to be able to help you achieve your goals. In the absence of more details about what data you have or what outcome you desire, we can't do much more than guess at the right answer.

Given what you've shared, the only way forward I can see is to make an arbitrary j identifier within ID. You would do this by issuing a command like (as Clyde already indicated)

Code:

bysort id: gen j=_n

Then you would proceed with your reshape command. This may not be the right answer for your data. However, we would need more information to help you determine that.

I would encourage you to re-read the FAQ (link located in the upper left of every forum screen). This document describes the kinds of information that is helpful and offers advice on how to ask questions in a way that increases the chances that forum members will be able to help you find the right solution to the problem.

The reason I want to switch to a wide format is because I want to see among all of the IDs which ones satisfy some criterion, e.g. they scored over 500 on a test. I could do this in long format but each ID is repeated multiple times over and over (hence long dataset) and that wouldn't tell me how many of the IDs out of say 100 scored over 500 (it would give me much larger number).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#12

02 Dec 2014, 13:45

Not so. Others have travelled this path before you and created tools to help. That's a perfectly easy problem with just one trick needed.

Code:

egen max = max(score), by(id) egen tag = tag(id) count if max > 500 & tag

Besides, how you would do this with a wide structure? In this case, there is a rowmax() function to egen, but there are other fairly simple problems harder to tackle with a wide structure.
1 like
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#13

02 Dec 2014, 14:10

Originally posted by Nick Cox View Post

Not so. Others have travelled this path before you and created tools to help. That's a perfectly easy problem with just one trick needed.

Code:

egen max = max(score), by(id) egen tag = tag(id) count if max > 500 & tag

Besides, how you would do this with a wide structure? In this case, there is a rowmax() function to egen, but there are other fairly simple problems harder to tackle with a wide structure.

Hi Nick,

I think this may be it! But, I just want to clarify that I have done the right thing. So, if I have, e.g. 100 unique IDs, but for each ID it gets repeated multiple times (as shown above) and I want to figure out how many of my unique IDs satisfy some criterion (e.g. test score>500), then your input there should tell me that?

Also, what if I want to repeat the same procedure, but I want to do it within certain subgroups, e.g. 15 grade eleven students scored >500, 20 grade 12 students scored>500, etc.

Thanks.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#14

02 Dec 2014, 14:22

Yes on the first, with the quibble that "distinct" is a much better word here than "unique". See http://www.stata-journal.com/sjpdf.h...iclenum=dm0042

On the second you have already calculated what you need, so you just keeping insisting on the tag to avoid counting each identifier repeatedly:

Code:

egen max = max(score), by(id) egen tag = tag(id) tab grade if max > 500 & tag

Incidentally, what tricks there are for rowwise calculations are reviewed in http://www.stata-journal.com/sjpdf.h...iclenum=pr0046
Comment
CEdward

Join Date: Nov 2014

Posts: 131
#15

07 Dec 2014, 08:19

Hi Nick,

Thanks so much this is fantastic. But, I have one last question. How would I do what you just did above, but add a restriction. For example, I only want those counts to apply to certain scores that were obtained after a certain time period?

Thanks again.
Comment

Announcement

Multiple observations per ID according to a second variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment