Trouble with reshaping data

Victoriaaa Spelt

Join Date: Aug 2018
Posts: 1

Trouble with reshaping data

19 Aug 2018, 14:23

Hi all, was hoping someone could please help me as this is important for my THESIS.

I have data in this long format. In the total data the ID ranges from 1001 to 2019. The Year ranges from 2006 to 2016.
An example of what the first 14 rows of data look like are below.

I have tried several times to reshape the data wide. I used the following ''reshape wide VARIABLE A, i(ID) j(YEAR).

Unfortunately, this does not work and the stata output says

----- ''values of variable year not unique within ID.
Your data are currently long. You are performing a reshape wide. You specified i(ID) and j(YEAR). There are observations within i(ID) with the same value of
j(YEAR). In the long data, variables i() and j() together must uniquely identify the observations-----

ID	VARIABLE A	YEAR

1001	200	2015
1001	333	2016
1001	444	2017
1002	555	2010
1003	662	2011
1003	456	2012
1003	783	2014
1004	459	2015
1005	239	2016
1005	348	2016
1006	452	2012
1006	324	2013
1008	365	2014
1009	216	2011

I thought to change ID numbering from 4 digits to 1 digit to prevent ID and YEAR having the same numbers further down the data. So I changed the ID from 1001 to 1, and 1002 to 2. 1003 to 3 and so on. I then attempted to reshape again but it still did not work and the output again was

----- ''values of variable year not unique within ID.
Your data are currently long. You are performing a reshape wide. You specified i(ID) and j(YEAR). There are observations within i(ID) with the same value of
j(YEAR). In the long data, variables i() and j() together must uniquely identify the observations-----

I really need to reshape this data wide so that I have VARIABLE A_YEAR columns. All data is numerical! No strings.

Please could someone kindly help me out?

Many thanks

ID	VARIABLE A	YEAR

1	200	2015
1	333	2016
1	444	2017
2	555	2010
3	662	2011
3	456	2012
3	783	2014
4	459	2015
5	239	2016
5	348	2016
6	452	2012
6	324	2013
8	365	2014
9	216	2011

Tags: continuous, reshape, wide

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

19 Aug 2018, 14:38

No, nobody can help you do this because what you are asking is not possible with your data. Your data is probably wrong and has to be fixed (and you are lucky that Stata discovered the problem now rather than later when it might cause obscure difficulties). Once you do that, the code you already are using will work just fine.

Even within your example (and I assume the problem is still more pervasive in the full data set) look at the 9th and 10th observations. Both have id 1005 and year 2016. Think about what you are asking for. You want an observation for id 1005 to have variables called variable_a2006 through variable_a2016. But what values should Stata put for variable_a2010? Is it the 239 that appears in observation 9 or the 348 that occurs in observation 10? So your data poses insurmountable problems for reshape.

So you have to think about how your data came to be this way. Why are there two observations with id = 1005 and year = 2016? Perhaps the data are just wrong and need to be cleaned better to eliminate the duplicates. Running

Code:

duplicates tag id year, gen(flag) browse if flag

will show you all the offending observations. Then you can see whether some of them need to be removed from the data set because they are wrong. Or perhaps they are not wrong, but need to be combined in some way, such as, perhaps averaging the values of variable_a, or some other such maneuver.

Or perhaps you will realize that there is nothing wrong with the data: the reason there are two such observations is that these observations really do some in pairs. Perhaps id refers to a household, and one of the observations refers to the head of the household and the other refers to the spouse. In that case, the solution is not to change the data, but to include the variable(s) distinguishing these different observations in the -i()- option of your reshape command.

So do review your data and figure out what's going on.

That said, you probably shouldn't be doing this reshaping in the first place. Nearly all Stata analysis and data management commands work optimally (or even only) with long layout data. Wide layout is used by a small handful of commands, and is really mostly needed for creating a data set that is not to be further analyzed but is to be grasped quickly by human eyes, or for creating graphs with multiple y-variables plotted against a single x. Other than those things, you will probably wish you had just left your data long to begin with. (Still, you should investigate why id and year do not uniquely identify observations in your data as, if the data are wrong, they should be corrected even if you ultimately stick with long layout.)

Last edited by Clyde Schechter; 19 Aug 2018, 14:42.
2 likes
Comment

Announcement

Trouble with reshaping data

Comment