how to convert multi-variable stacked long format dataset into multi-variable wide format ?

Raymon Lucas

Join Date: Jul 2021

Posts: 25
#1

how to convert multi-variable stacked long format dataset into multi-variable wide format ?

27 Sep 2022, 08:42

Hey colleagues,

I encountered a setback in converting the attached long format dataset (the source) into the wide format (the target). It is primarily because in the long format the target variable "seriesname" contains two prospective variables: tpop and upop.

My logfiles are as follows:

encode seriesname, gen(seriesname2) /*convert seriesname into a numeric variable*/
reshape wide yr, i(country) j(seriesname2)

However, Stata responds with the folloiwng: "values of variable seriesname2 not unique within country."

I have no idea how to solve this problem and make it to convert the dataset. Is it that "reshape" is not the right instrument for my purpose?

Looking forward to replies!

I greatly appreciate your help!

Sincerely
Raymong Lucas

Attached Files

long format.dta (8.0 KB, 1 view)

wide format.dta (5.8 KB, 1 view)
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

27 Sep 2022, 09:42

Please note that attachments are discouraged in this Forum. Many members will not risk downloading attachments from strangers--I am among them. So I have not looked at your data. The preferred way to show example data is by using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

As I have not seen your data, the advice below is generic with regard to this very common type of problem encountered with -reshape-. It may be that your situation is different enough that something else is going on, however. If this approach does not work, please post back showing example data using -dataex- so that proper troubleshooting can be done.

The error message that Stata gave you says that there are some instances of the same country having two or more observations with the same value of seriesname2. It is not possible to -reshape- the data to wide in a situation like that. Suppose, for example that country "USA" has two different observations both with seriesname2 = 1. In the wide data set, from which of those observations should Stata pick the value of yr to put into the new variable yr1 that you want to create?

Either your data set is incorrect, or you do not fully understand your data set. The first step to solving this problem is to identify the offending observations:

Code:

duplicates tag country seriesname2, gen(flag) browse if flag

Examine these offending observations carefully. If the surplus observations are exact duplicates on all variables, you will be able to proceed by simply running -duplicates drop-. But I don't recommend doing that. With correct data management, one does not normally end up with exact duplicate observations in a data set. So probably something was done wrong in the process of creating this data set, and where one mistake has been made, it is likely that others have as well. So a full review of how this data set was created is in order, and any errors found should be fixed.

If the surplus observations are not exact duplicates then there are two possible situations:
The observations differ on some other variable(s) that, together with country and seriesname2, do uniquely identify observations in the data set in some natural way. For example, there might be a variable that specifies regions within countries, and each of the offending observations is from a different region. If this is the case, you have only to add the additional variable(s) to the -i()- option of -reshape- and all will be well.

The other variables on which the observations differ are not sub-identifiers but are actual data (worst case scenario, they are values of the variable yr.) This means that your data are inconsistent. There is nothing useful that can be done with a self-contradictory data set. So you must again review how this data set got created, find out how this came about, and fix it. You must find out which of the contradictory observations, if any, are correct, and revise the data management so you retain only the correct ones. If that is not possible, you may need to find some other way to reconcile the contradictions, such as averaging their values or taking the minimum, or the first, or whatever, and using only that.

Last edited by Clyde Schechter; 27 Sep 2022, 09:46.
Comment

Announcement

how to convert multi-variable stacked long format dataset into multi-variable wide format ?

Comment