Reshaping issues

Jessica Brooks

Join Date: Sep 2017

Posts: 5
#1

Reshaping issues

23 Sep 2017, 07:24

Hi all,

I have been struggling with reshaping from long to wide one of my datasets. I have 3 variables and 8,076,812 observations. I am relatively new to StataSE as well. I am using the maximum allowed amount of variables. I also optimized the data variable's type of storage without forcing.

Here is a partial picture of the dataset that is currently in long format. I have duplicates of my dupersid variable, which is my primary id for participants. I would like to reshape the dataset so my narcANALS variables will be spread wide along the dupersids (with then, no duplicate dupersids). It is the narcANALS that is causing the duplicates of dupersid. NarcANALS is a 0 or 1 for a certain medication. Participants are receiving scripts for multiple medications.

I used reshape for other datasets and it has worked for me. However, it is not working for the current dataset. I am receiving the error message under my code below. Do I need to split up my dataset?

reshape wide narcANALS, i(dupersid) j(newid)

variable newid takes on too many values
r(134);

+-----------------------------+
| dupersid narcAN~S newid |
|-----------------------------|
1. | 20004103 0 1 |
2. | 20005101 0 2 |
3. | 20005101 0 2 |
4. | 20005101 0 2 |
5. | 20005101 0 2 |
|-----------------------------|
6. | 20005101 0 2 |
7. | 20005101 0 2 |
8. | 20005101 0 2 |
9. | 20005101 0 2 |
10. | 20005101 0 2 |
|-----------------------------|
11. | 20005101 0 2 |
12. | 20005101 0 2 |
13. | 20005101 0 2 |
14. | 20005101 0 2 |
15. | 20005101 0 2 |
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4485
#2

23 Sep 2017, 09:47

I'm confused by your data set up (in the future, please follow the advice in the FAQ and use -dataex- to show data; see the FAQ); each of your observations with a dupersid equal to 20005 has a newid of "2" - this will not work for reshape;

Code:

help reshape

in general, Stata routines are easier and faster if data are in long format; why do you want to reshape to wide format?
Comment
Jessica Brooks

Join Date: Sep 2017

Posts: 5
#3

25 Sep 2017, 11:31

Thank you for your reply, Rich. I will use the FAQ section for reference in the future when showing data on here. For the newid variable, the numbers increase from 1 to 31653 across the observations.

I wanted to reshape to wide format in order to combine/merge my observations. I have multiple observations per person (i.e., dupersid) in this dataset's medication file--due to multiple medications prescribed to each person. I had multiple observations in a medical conditions file as well and I was able to successfully reshape that dataset. I have not conducted any further analyses though because I need to further merge my demographics/medical conditions file with the medications file. Do you know if there would be a better method for reorganizing this current dataset on medications in Stata? I may be missing something obvious. Thank you again for your support.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

25 Sep 2017, 12:01

There is something rather bizarre about your data. It seems as if newid is just a recoding into consecutive integers of dupersid, starting from 1. Is that true in your data as a whole? Also, at least in the example you show, the variable narcANALS never changes within person. Is that true in your data as a whole?

If both of those are true, then you can reduce to one observation per person with:

Code:

by dupersid: keep if _n == 1

Let me assume, however, that in the real data narcANALS does, at least for some people, change within person. But I will retain the assumption that newid is just a redundant recoding of dupersid. I gather what you want is to have the consecutive values of narcANALS laid out in separate variables for each person in that person's observation, but not in other people's observations. In that case, what you need is a sequence number variable that identifies separate observations in each person (but restarts at 1 in each new person.) Then you can use that variable in the -j()- option of -reshape-.

Code:

by dupersid: gen int seq = _n reshape wide narcANALS, i(dupersid) j(seq)

Now let's talk about whether you really should be doing this. As Rich has indicated, and as I find myself reiterating several times a week here on Statalist, nearly everything in Stata is easier in long layout. So one should always think twice, three times, and a fourth time at least before going wide.

It seems your major concern here is that you want to be able to merge this data set with another data set having patient demographics and diagnoses. But such a data set would typically have just one observation per person. If that is true in your case, you don't need to reduce the narcANALS data to one observation per person: you can just -merge m:1 dupersid using demograhpics_file-. It is only if your demographics file also contains multiple observations per person that a problem arises. In that case, in order to merge, one of the data sets must be reduced to a single observation per person. Which one is better to reduce would depend on the actual contents and where you're going with it, so I won't go farther along this path.

Last edited by Clyde Schechter; 25 Sep 2017, 12:03.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#5

25 Sep 2017, 12:49

It's Schechter's Law.

https://www.statalist.org/forums/for...served-0-value
Comment
Jessica Brooks

Join Date: Sep 2017

Posts: 5
#6

25 Sep 2017, 13:00

Wow, thank you, Clyde. Your post is extremely helpful, in spite of my difficult-to-follow questions/explanations. I very much appreciate it. You are correct that newid is a redundant recoding of dupersid. narcANALS changes--it can include multiples of "0" but only up to one value of "1" for each person, so it can change across individuals and within the person (most people will only have multiples of "0"; the minority will have one value of "1" and also multiples of "0"). You have given me a lot to think about. I think I will probably have to use the second set of code to reduce the observations and then reshape.

By the way, I had been reading different opinions on merge m:1 before and was nervous about this approach, but hearing your encouragement is helpful. I would like to use this command in the future.

Another quick question for the future. If I was able to use the current dataset for your first set of code, would I need to sort narcANALS before using the "keep if _n == 1" code? Will this code potentially discard values of "1" if they are not the first displayed observation? For the current example, I would want to keep all "1" values instead of "0" values for persons with "1" and then keep only one "0" value for persons without a "1" value.

I will work on this tonight--thanks again for your tremendous support!
Comment
Jessica Brooks

Join Date: Sep 2017

Posts: 5
#7

25 Sep 2017, 13:04

Thank you, Nick. If I needed to reshape from long to wide initially, is it possible to reshape back to long? Without messing up the original intention of the reshape? In my case, the intention is to reduce multiple, unneeded observations.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#8

25 Sep 2017, 13:06

I don't think there is much controversy about -merge m:1-. What is "controversial" is -merge m:m-. StataCorp insists on keeping this a legal command, and unwary users continue to use it and unknowingly create a meaningless jumble of data, whereas you will find frequent posts from experienced users on Statalist warning that -merge m:m- has almost no legitimate real world applications. If you're thinking about using -merge m:m-, my unequivocal advice is: don't go there. I can guarantee you that it's wrong in your situation. But -merge m:1- and -merge 1:m- are both perfectly fine in their contexts.

As for your quick question for the future, if you want to keep only the observation with narcANALS = 1 when there is one, and a zero observation when that is the only value, then you do, indeed, need to worry about the sort order. Remember that -sort-ing narcANALS will put the 1 observation last. So what you would want to do is:

Code:

by dupersid (narcANALS), sort: keep if _n == _N
Comment
Jessica Brooks

Join Date: Sep 2017

Posts: 5
#9

25 Sep 2017, 21:02

Thank you again, Clyde! The last line of code you provided worked for both datasets. I was able to keep them in the long format by using that.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment