Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble with reshaping data

    Hi all, was hoping someone could please help me as this is important for my THESIS.

    I have data in this long format. In the total data the ID ranges from 1001 to 2019. The Year ranges from 2006 to 2016.
    An example of what the first 14 rows of data look like are below.

    I have tried several times to reshape the data wide. I used the following ''reshape wide VARIABLE A, i(ID) j(YEAR).

    Unfortunately, this does not work and the stata output says

    ----- ''values of variable year not unique within ID.
    Your data are currently long. You are performing a reshape wide. You specified i(ID) and j(YEAR). There are observations within i(ID) with the same value of
    j(YEAR). In the long data, variables i() and j() together must uniquely identify the observations-----

    ID VARIABLE A YEAR
    1001 200 2015
    1001 333 2016
    1001 444 2017
    1002 555 2010
    1003 662 2011
    1003 456 2012
    1003 783 2014
    1004 459 2015
    1005 239 2016
    1005 348 2016
    1006 452 2012
    1006 324 2013
    1008 365 2014
    1009 216 2011

    I thought to change ID numbering from 4 digits to 1 digit to prevent ID and YEAR having the same numbers further down the data. So I changed the ID from 1001 to 1, and 1002 to 2. 1003 to 3 and so on. I then attempted to reshape again but it still did not work and the output again was

    ----- ''values of variable year not unique within ID.
    Your data are currently long. You are performing a reshape wide. You specified i(ID) and j(YEAR). There are observations within i(ID) with the same value of
    j(YEAR). In the long data, variables i() and j() together must uniquely identify the observations-----


    I really need to reshape this data wide so that I have VARIABLE A_YEAR columns. All data is numerical! No strings.

    Please could someone kindly help me out?

    Many thanks

    ID VARIABLE A YEAR
    1 200 2015
    1 333 2016
    1 444 2017
    2 555 2010
    3 662 2011
    3 456 2012
    3 783 2014
    4 459 2015
    5 239 2016
    5 348 2016
    6 452 2012
    6 324 2013
    8 365 2014
    9 216 2011










  • #2
    No, nobody can help you do this because what you are asking is not possible with your data. Your data is probably wrong and has to be fixed (and you are lucky that Stata discovered the problem now rather than later when it might cause obscure difficulties). Once you do that, the code you already are using will work just fine.

    Even within your example (and I assume the problem is still more pervasive in the full data set) look at the 9th and 10th observations. Both have id 1005 and year 2016. Think about what you are asking for. You want an observation for id 1005 to have variables called variable_a2006 through variable_a2016. But what values should Stata put for variable_a2010? Is it the 239 that appears in observation 9 or the 348 that occurs in observation 10? So your data poses insurmountable problems for reshape.

    So you have to think about how your data came to be this way. Why are there two observations with id = 1005 and year = 2016? Perhaps the data are just wrong and need to be cleaned better to eliminate the duplicates. Running
    Code:
    duplicates tag id year, gen(flag)
    browse if flag
    will show you all the offending observations. Then you can see whether some of them need to be removed from the data set because they are wrong. Or perhaps they are not wrong, but need to be combined in some way, such as, perhaps averaging the values of variable_a, or some other such maneuver.

    Or perhaps you will realize that there is nothing wrong with the data: the reason there are two such observations is that these observations really do some in pairs. Perhaps id refers to a household, and one of the observations refers to the head of the household and the other refers to the spouse. In that case, the solution is not to change the data, but to include the variable(s) distinguishing these different observations in the -i()- option of your reshape command.

    So do review your data and figure out what's going on.

    That said, you probably shouldn't be doing this reshaping in the first place. Nearly all Stata analysis and data management commands work optimally (or even only) with long layout data. Wide layout is used by a small handful of commands, and is really mostly needed for creating a data set that is not to be further analyzed but is to be grasped quickly by human eyes, or for creating graphs with multiple y-variables plotted against a single x. Other than those things, you will probably wish you had just left your data long to begin with. (Still, you should investigate why id and year do not uniquely identify observations in your data as, if the data are wrong, they should be corrected even if you ultimately stick with long layout.)
    Last edited by Clyde Schechter; 19 Aug 2018, 14:42.

    Comment

    Working...
    X