reshape error

Anne-Claire Jo

Join Date: Feb 2021
Posts: 162

03 Jul 2022, 11:17

Dear Stata users,

I am facing problem with -reshape wide- command.
More specifically, I would like to make my dataset as the following format:
[country] [year] [wealth_inequality_p0p100] [wealth_inequality_p50p100] ... etc

here is the example of my dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str55 country str7 percentile int year double wealth_inequality
"Afghanistan" "p90p100" 1990     .
"Afghanistan" "p90p100" 1991     .
"Afghanistan" "p90p100" 1992     .
"Afghanistan" "p90p100" 1993     .
"Afghanistan" "p90p100" 1994     .
"Afghanistan" "p90p100" 1995 .5836
"Afghanistan" "p90p100" 1996  .582
"Afghanistan" "p90p100" 1997 .5831
"Afghanistan" "p90p100" 1998 .5843
"Afghanistan" "p90p100" 1999 .5835
"Afghanistan" "p90p100" 2000 .5834
"Afghanistan" "p90p100" 2001 .5859
"Afghanistan" "p90p100" 2002 .5871
"Afghanistan" "p90p100" 2003 .5856
"Afghanistan" "p90p100" 2004 .5871
"Afghanistan" "p90p100" 2005 .5883
"Afghanistan" "p90p100" 2006 .5899
"Afghanistan" "p90p100" 2007  .586
"Afghanistan" "p90p100" 2008 .5852
"Afghanistan" "p90p100" 2009 .5853
"Afghanistan" "p90p100" 2010 .5867
"Afghanistan" "p90p100" 2011  .584
"Afghanistan" "p90p100" 2012 .5847
"Afghanistan" "p90p100" 2013 .5851
"Afghanistan" "p90p100" 2014 .5847
"Afghanistan" "p90p100" 2015  .584
"Afghanistan" "p90p100" 2016 .5862
"Afghanistan" "p90p100" 2017 .5885
"Afghanistan" "p90p100" 2018 .5887
"Afghanistan" "p90p100" 2019 .5884
"Afghanistan" "p90p100" 2020 .5869
"Afghanistan" "p90p100" 2021 .5884
"Afghanistan" "p0p50"   1990     .
"Afghanistan" "p0p50"   1991     .
"Afghanistan" "p0p50"   1992     .
"Afghanistan" "p0p50"   1993     .
"Afghanistan" "p0p50"   1994     .
"Afghanistan" "p0p50"   1995 .0481
"Afghanistan" "p0p50"   1996 .0483
"Afghanistan" "p0p50"   1997 .0482
"Afghanistan" "p0p50"   1998  .048
"Afghanistan" "p0p50"   1999 .0481
"Afghanistan" "p0p50"   2000 .0481
"Afghanistan" "p0p50"   2001 .0478
"Afghanistan" "p0p50"   2002 .0476
"Afghanistan" "p0p50"   2003 .0478
"Afghanistan" "p0p50"   2004 .0476
"Afghanistan" "p0p50"   2005 .0474
"Afghanistan" "p0p50"   2006 .0472
"Afghanistan" "p0p50"   2007 .0478
"Afghanistan" "p0p50"   2008 .0479
"Afghanistan" "p0p50"   2009 .0479
"Afghanistan" "p0p50"   2010 .0477
"Afghanistan" "p0p50"   2011 .0481
"Afghanistan" "p0p50"   2012  .048
"Afghanistan" "p0p50"   2013 .0479
"Afghanistan" "p0p50"   2014  .048
"Afghanistan" "p0p50"   2015 .0481
"Afghanistan" "p0p50"   2016 .0478
"Afghanistan" "p0p50"   2017 .0475
"Afghanistan" "p0p50"   2018 .0475
"Afghanistan" "p0p50"   2019 .0475
"Afghanistan" "p0p50"   2020 .0477
"Afghanistan" "p0p50"   2021 .0475
"Afghanistan" "p99p100" 1990     .
"Afghanistan" "p99p100" 1991     .
"Afghanistan" "p99p100" 1992     .
"Afghanistan" "p99p100" 1993     .
"Afghanistan" "p99p100" 1994     .
"Afghanistan" "p99p100" 1995 .2458
"Afghanistan" "p99p100" 1996 .2438
"Afghanistan" "p99p100" 1997 .2448
"Afghanistan" "p99p100" 1998 .2462
"Afghanistan" "p99p100" 1999 .2458
"Afghanistan" "p99p100" 2000 .2448
"Afghanistan" "p99p100" 2001 .2484
"Afghanistan" "p99p100" 2002 .2498
"Afghanistan" "p99p100" 2003 .2481
"Afghanistan" "p99p100" 2004  .249
"Afghanistan" "p99p100" 2005 .2505
"Afghanistan" "p99p100" 2006 .2521
"Afghanistan" "p99p100" 2007 .2479
"Afghanistan" "p99p100" 2008  .247
"Afghanistan" "p99p100" 2009  .247
"Afghanistan" "p99p100" 2010 .2497
"Afghanistan" "p99p100" 2011 .2471
"Afghanistan" "p99p100" 2012  .248
"Afghanistan" "p99p100" 2013 .2484
"Afghanistan" "p99p100" 2014 .2491
"Afghanistan" "p99p100" 2015 .2475
"Afghanistan" "p99p100" 2016 .2496
"Afghanistan" "p99p100" 2017 .2527
"Afghanistan" "p99p100" 2018  .253
"Afghanistan" "p99p100" 2019 .2525
"Afghanistan" "p99p100" 2020 .2498
"Afghanistan" "p99p100" 2021 .2525
"Africa"      "p90p100" 1990     .
"Africa"      "p90p100" 1991     .
"Africa"      "p90p100" 1992     .
"Africa"      "p90p100" 1993     .
end

I have tried this command:
reshape wide wealth_inequality, i(country year) j(percentile) string

However, the stata showed this error:
values of variable percentile not unique within country year
Your data are currently long. You are performing a reshape wide. You specified i(country year) and
j(percentile). There are observations within i(country year) with the same value of j(percentile). In the
long data, variables i() and j() together must uniquely identify the observations.

long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
Type reshape error for a list of the problem variables.

Could someone please help me with this issue?

Thank you in advance,
AC

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30356
#2

03 Jul 2022, 11:36

Well, the code you show works just fine with the example data you show:

Code:

. reshape wide wealth_inequality, i(country year) j(percentile) string (j = p0p50 p90p100 p99p100) Data Long -> Wide ----------------------------------------------------------------------------- Number of observations 100 -> 36 Number of variables 4 -> 5 j variable (3 values) percentile -> (dropped) xij variables: wealth_inequality -> wealth_inequalityp0p50 wealth_inequalityp90p100 wealth_inequalityp99p100 ----------------------------------------------------------------------------- . end of do-file

So the problem lies elsewhere in your data set. What Stata is telling you is that there is (are) some combination(s) of country and year where one or more of the values "p0p50", "p90p100", "p99p100" appears more than once. Consequently, Stata has no way to know which of those observations, if any, is the correct one for that category of variable percentile.

So you need to fix your data set. The first step is to find the surplus observations. You can do that with:

Code:

duplicates tag country year percentile, gen(flag) browse if flag

That will show them to you. After you inspect them, you have to figure out two things:
1. How did this happen in the first place? and 2. how to fix it.

As for how did this happen: when a data set like this contains surplus observations it usually means that errors were made in the data management that created the data set up to that point. This can be true even if the data was purchased from a reputable curator. If the data management is within your purview, you should review it carefully and try to find out where the excess observations slipped in. And since code that contains one error often contains others, also look for other problems that have not yet bitten. Fix up all the code, re-run the data management, verify that the revised data set contains no surplus observations, and then try your -reshape- again.

As for how to fix it: if the data management that created the data set is not within your purview, then you should notify whoever created or provided it and ask them to fix the problem. If that is not feasible, then you have no choice but to patch the data set in some way. There are several possibilities, depending on the nature of the surplus observations. If the surplus observations are exact duplicates on all variables, then running -duplicates drop- will retain only one copy from each group, and you will be good to go. If the surplus observations, however, disagree on some other variables, then you may have to do some research to figure out which observation contains the correct value, and remove the incorrect ones. Or perhaps one observation has correct values for some variables and another has correct values for others. Then you will have to write some code to create a new observation that combines the correct values in a single observation and deletes the original ones. And so on.

Anyway, the problem is not with your code, it's the data.
1 like
Comment

Announcement

reshape error

Comment