Reshape error: variable id does not uniquely identify the observations

Jim Callegari

Join Date: Jan 2021

Posts: 35
#1

Reshape error: variable id does not uniquely identify the observations

10 Jan 2021, 15:21

Edit: Please see my reply below--instead of file uploads, I attached an output of what my data looks like! Thanks so much

Dear All,

I am trying to use the reshape command on the data attached, but I keep receiving this error.

Code:

reshape long a, i(combined) j(prod, string) (note: j = a aa ab ac ad ae af ag ah ai aj ak al am an ao ap aq ar as at au av aw ax ay az b ba bb bc bd be bf bg bh b > i bj bk bl bm bn bu bv bw bx by bz c ca cb cc cd ce cf cg ch ck cl cm cn co cp cq cr cs ct cu cv cw cx cy cz d da db > dc dd de df dg dh di dj dk dl dm dn do dp dq dr ds dt du dv dw dx dy dz e ea eb ec ed ee ef eg eh ei ej ek el em en > eo ep eq er es et eu f g h i j k l m n o p q r s t u v w x y z)

Code:

variable id does not uniquely identify the observations Your data are currently wide. You are performing a reshape long. You specified i(combined) and j(prod). In the current wide form, variable combined should uniquely identify the observations. Remember this picture: long wide +---------------+ +------------------+ | i j a b | | i a1 a2 b1 b2 | |---------------| <--- reshape ---> |------------------| | 1 1 1 2 | | 1 1 3 2 4 | | 1 2 3 4 | | 2 5 7 6 8 | | 2 1 5 6 | +------------------+ | 2 2 7 8 | +---------------+ Type reshape error for a list of the problem observations. r(9);

My goal of this code is to create a "prod" variable with all of the variable names (such as "White bread, 1 kg (supermarket)") inputted as individual entries underneath the "prod" variable. The reason I have 2 rows which (one of which has variable names aa, ab, ac, etc. and the other with the actual product names themselves, "White bread, 1 kg (supermarket)," is due to STATA's 32 character limit. It was causing me issues, so I tried to work around it by reshaping with an additional row.

I would also like to make another variable "store_type" in which it either says "supermarket" or "mid-priced store" and no longer having this information captured in the individual data entires themselves. So, to clarify, in the end I would have variables: year, country, city, currency, prod, and store_type. Under "prod" it would display "White bread, 1 kg," for example, and under "store_type" it would show either "supermarket" or "mid-priced store."

I'm sorry if this is a simple question--I'm just becoming more familiar with STATA and appreciate any and all suggestions you all may have.

Many thanks,
Jim

Last edited by Jim Callegari; 10 Jan 2021, 15:56.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#2

10 Jan 2021, 15:49

May I suggest you review the Forum FAQ for advice on how to get the most out of your Statalist experience. In particular, attachments are discouraged. Many people who respond here will not download attachments from people they do not know--it just isn't safe for people who rely on their computers for their livelihoods. To show example data, use the -dataex- command. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When you post back in that way, you are likely to get a timely and helpful response. (Also, make sure the example data you post exhibits the same problem you are encountering.)
Comment

Jim Callegari

Join Date: Jan 2021
Posts: 35

10 Jan 2021, 15:55

Thanks Clyde! I did not know that. Here is an example of the data that I outputted with that command:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str23 combined str40 aa str45 ab str36 ac
"YearCountryCityCurrency" "abs_priceWhite bread, 1 kg (supermarket)" "abs_priceWhite bread, 1 kg (mid-priced store)" "abs_priceButter, 500 g (supermarket)"
"2014UKManchesterGBP"     "1.75"                                     "1.75"                                          "3.4"                                 
"2015UKManchesterGBP"     "1.38"                                     "1.6"                                           "3.58"                                
"2016UKManchesterGBP"     "1.3"                                      "1.94"                                          "3.4"                                 
"2017UKManchesterGBP"     "1.3"                                      "2.13"                                          "3.3"                                 
"2018UKManchesterGBP"     "1.38"                                     "2.03"                                          "4"                                   
"2019UKManchesterGBP"     "1.38"                                     "1.55"                                          "3.7"                                 
"2020UKManchesterGBP"     "1.44"                                     "1.59"                                          "3.95"                                
"2014UKLondonGBP"         "1.3"                                      "1.69"                                          "3.4"                                 
"2015UKLondonGBP"         "1.2"                                      "1.69"                                          "3.4"                                 
"2016UKLondonGBP"         "1.13"                                     "1.6"                                           "3"                                   
"2017UKLondonGBP"         "1.3"                                      "1.4"                                           "3.4"                                 
"2018UKLondonGBP"         "1.3"                                      "1.3"                                           "4"                                   
"2019UKLondonGBP"         "1.38"                                     "1.39"                                          "4"                                   
"2020UKLondonGBP"         "1.41"                                     "1.44"                                          "3.68"                                
"2014IrelandDublinEUR"    "1.86"                                     "2.15"                                          "3.25"                                
"2015IrelandDublinEUR"    "1.81"                                     "1.88"                                          "3.25"                                
"2016IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.25"                                
"2017IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.25"                                
"2018IrelandDublinEUR"    "1.61"                                     "1.75"                                          "3.63"                                
"2019IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.29"                                
"2020IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.29"                                
                              
end

I only took from a handful of variable names "aa," "ab," and "ac," but I have many more than these as well.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30096

10 Jan 2021, 17:33

OK. This is substantially more complicated than a simple -reshape-, though -reshape- is ultimately at the center of it. The first observation in your data set is not data, it is meta-data and needs to be taken out, but preserved for subsequent use.

I have to say that I cannot explain from your example why you are getting the particular error message you have gotten. Your variable combined does, indeed uniquely identify observations. And I have never known -reshape- to be wrong when it gives that error message. This tells me that somewhere in your data, you have some observations with duplicate values of combined. You can find those by running:

Code:

duplicates list combined

You should then investigate why those are there--they probably reflect errors in the data management that created your data in the first place. So you will need to review that and fix it. Alternatively, perhaps there is no reason that duplicate observations of combined should be excluded. My code below will work whether there are duplicates of combined or not. But I beg you not to use my code until you have first looked into the matter: don't just barge ahead because I have shown you code that eliminates the error message. Error messages are usually warnings that something is wrong, and sweeping them under the rug just means that trouble, often worse and harder to fix, will pop up again later. So investigate that before proceeding. Once you have done that:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str23 combined str40 aa str45 ab str36 ac
"YearCountryCityCurrency" "abs_priceWhite bread, 1 kg (supermarket)" "abs_priceWhite bread, 1 kg (mid-priced store)" "abs_priceButter, 500 g (supermarket)"
"2014UKManchesterGBP"     "1.75"                                     "1.75"                                          "3.4"                                
"2015UKManchesterGBP"     "1.38"                                     "1.6"                                           "3.58"                                
"2016UKManchesterGBP"     "1.3"                                      "1.94"                                          "3.4"                                
"2017UKManchesterGBP"     "1.3"                                      "2.13"                                          "3.3"                                
"2018UKManchesterGBP"     "1.38"                                     "2.03"                                          "4"                                  
"2019UKManchesterGBP"     "1.38"                                     "1.55"                                          "3.7"                                
"2020UKManchesterGBP"     "1.44"                                     "1.59"                                          "3.95"                                
"2014UKLondonGBP"         "1.3"                                      "1.69"                                          "3.4"                                
"2015UKLondonGBP"         "1.2"                                      "1.69"                                          "3.4"                                
"2016UKLondonGBP"         "1.13"                                     "1.6"                                           "3"                                  
"2017UKLondonGBP"         "1.3"                                      "1.4"                                           "3.4"                                
"2018UKLondonGBP"         "1.3"                                      "1.3"                                           "4"                                  
"2019UKLondonGBP"         "1.38"                                     "1.39"                                          "4"                                  
"2020UKLondonGBP"         "1.41"                                     "1.44"                                          "3.68"                                
"2014IrelandDublinEUR"    "1.86"                                     "2.15"                                          "3.25"                                
"2015IrelandDublinEUR"    "1.81"                                     "1.88"                                          "3.25"                                
"2016IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.25"                                
"2017IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.25"                                
"2018IrelandDublinEUR"    "1.61"                                     "1.75"                                          "3.63"                                
"2019IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.29"                                
"2020IrelandDublinEUR"    "1.86"                                     "1.88"                                          "3.29"                                
end

//  SAVE THE METADATA FROM OBSERVATION 1
local titles
foreach v of varlist aa-ac {
    local `v'_title = `v'[1]
    local titles `titles' `v'_title
    rename `v' price`v'
}

//  REMOVE ROW 1; IT IS NOT DATA
drop in 1

//  GIVE EACH OBSERVATION A UNIQUE IDENTIFIER
gen long obs_no = _n

//  GO LONG
reshape long price, i(obs_no) j(item) string

//  REPLACE THE PROXY NAMES AA-AC (ETC.) WITH WHAT THEY STAND FOR
levelsof item, local(items)
foreach i of local items {
    replace item = `"``i'_title'"' if item == `"`i'"'
}

//  MAKE PRICE A NUMERIC VARIABLE
destring price, replace

//  obs_no IS NO LONGER NEEDED
drop obs_no

Of course, where I refer to aa-ac, replace that with an expression that includes all of the variables in question.

Comment

Jim Callegari

Join Date: Jan 2021

Posts: 35
#5

11 Jan 2021, 16:08

Thank you so much Clyde! I first found the issue with duplicates (apparently, there were many blank spaces that did not have any data in them which registered as duplicate data). Your code did wonders! My remaining question is: how can I separate the variable "combined" into multiple variables without having a particular letter I want to parse? There are no spaces between the words, and there are varying year, country, city, and currency measurements. Please see what the current variable looks like below (I deleted some of the generated values because there are many, but these three show the differences that I'm talking about:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str23 combined "2014UKManchesterGBP" "2015UKLondonGBP" "2016IrelandDublinEUR" end

I am trying to create "year" "country" "city" and "currency" variables with either strings or I can encode them to be numbers if that would be easier--either way, do you possibly have any suggestions for this? Thank you so much, again--I really appreciate it.

Last edited by Jim Callegari; 11 Jan 2021, 16:11.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#6

11 Jan 2021, 16:43

Yeah, that variable is a mess. I was hoping that you had created it yourself from its constituents and still had those available. I don't see a way to completely undo these. I can get you part of the way there:

Code:

gen year = substr(combined, 1, 4) gen currency = substr(combined, -3, 3) gen country_city = subinstr(combined, year, "", 1) replace country_city = substr(country_city, 1, strlen(country_city)-3) destring year, replace

Note: Assumes that all currencies are, as in your example, 3 characters.

But I do not see a way to pry the city and country apart from each other. There is nothing of constant string length to rely on, nor any parsing character. Even the location of upper case letters isn't a reliable clue.

What you might do for that is create a new data set consisting of only the variable country_city that my code above creates. Then drop the duplicates (-duplicates drop-). Now open that data set in the Data Editor and, by eye, create the country and city as separate variables. Save that crosswalk data file and -merge- it with the full data. That's tedious, but I can't think of anything better.

Last edited by Clyde Schechter; 11 Jan 2021, 16:45.
Comment
Jim Callegari

Join Date: Jan 2021

Posts: 35
#7

11 Jan 2021, 18:51

Thanks Clyde, that was great. I ended up solving it using this:

Code:

gen country="UK" replace country="Ireland" if country_city=="IrelandDublin" gen city="Manchester" replace city="London" if country_city=="UKLondon" replace city="Dublin" if country_city=="IrelandDublin" drop country_city

Thanks so much for your help getting it down to just those two that I had to fix up!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#8

11 Jan 2021, 19:15

Oh, if I had realized that the entire data set, not just your example, was in either UK Manchester, UK London, or Ireland Dublin, I would have come up with that, too. I just assumed that there would be a plethora of combinations of country and city that would be impossible to systematically sort out without intelligence about what is a country name and what is a city name. Glad your problem is resolved.
Comment

Announcement