Reshape long to wide multiple variables renaming

Elena Draghici

Join Date: Jan 2021

Posts: 43
#16

28 Feb 2023, 10:31

Thank you so much for the reply, Clyde! I should have clarified that each year is a stand alone datafile. So the loop would go into something like:

Code:

cd "C:\Users\Elena\Desktop\New folder" local files: dir "." files "*dta" tempfile combined clear save `combined', emptyok foreach f of local files { use `"`f'"', clear ds *2016 //except here would account for the other possible years local stubs `r(varlist)' local stubs: subinstr local stubs "2016" "", all reshape long `stubs', i(person_id) j(year) destring `stubs' , replace gen source_file = `"`f'"' append using `combined' save `"`combined'"', replace } save all_years_data, replace

The goal is to go into each year of data, reshape it and create a year variable, then append them all into one panel dataset. Even doing it one by one with the code your provided and changing the year is helpful. but if there is a straightforward way to get above loop to run to come up with that goal, that would be appreciated.

Last edited by Elena Draghici; 28 Feb 2023, 10:58.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#17

28 Feb 2023, 11:33

I think the code you show in #16 is the best way to do it in principle. The question is what to do with that -ds *2016- command, because that is going to change from file to file. Here's one way to do it:

Code:

cd "C:\Users\Elena\Desktop\New folder" local files: dir "." files "*dta" tempfile combined clear save `combined', emptyok foreach f of local files { use `"`f'"', clear ds taxes???? local year = substr("`r(varlist)'", -4, 4) ds *`year' local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all reshape long `stubs', i(person_id) j(year) destring `stubs' , replace gen source_file = `"`f'"' append using `combined' save `"`combined'"', replace } save all_years_data, replace

The assumption here is that each of the files will have a variable called taxes#### (where #### is the four-digit year). So we extract the final four characters of that variable name to get the year in a local macro named year. Then after that the code is the same as before but with all references to 2016 replaced by `year'.

Last edited by Clyde Schechter; 28 Feb 2023, 11:41.
Comment
Elena Draghici

Join Date: Jan 2021

Posts: 43
#18

28 Feb 2023, 14:01

Thank you so much! Unfortunately, the issue is that all the variables vary in length at the stub, but all have the year at the end. for the ds taxes part:

Code:

ds taxes????

I tried removing it, replacing with

Code:

ds *

and

Code:

ds

. All led to the same error: invalid 'i' (r198).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#19

28 Feb 2023, 14:16

Show me the output you got from -ds taxes????-. Based on the example data you showed, I was expecting it to be just taxes2016 in the 2016 file, taxes2017 in the 2017 file, etc. No other variables were expected. But perhaps there is something else going on that is not reflected in the examples you gave.

Also, I see that the code block in #17 got messed up a little:

Code:

local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all

should be two separate lines

Code:

local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all

Last edited by Clyde Schechter; 28 Feb 2023, 14:19.
Comment

Elena Draghici

Join Date: Jan 2021
Posts: 43

#20

28 Feb 2023, 14:36

Apologies, I was not clear. What I tried to convey was that the variable names are all different lengths. So in the 2016 file, for example, there are variables named:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9(A B) str12 C str15 D str16 E str7 F str11 G
"person_id" "taxes2016" "fedtaxes2016" "childcredit2016" "famtaxcredit2016" "age2016" "marital2016"
"1"         "3565"      "4985"         "5687"            "8798"             "65"      "1"          
"2"         "2049"      "7453"         "6989"            "2566"             "25"      "2"          
"3"         "1649"      "6366"         "629"             "4875"             "19"      "2"          
"4"         "1000"      "200"          "300"             "203"              "45"      "1"          
"5"         "500"       "100"          "789"             "100"              "70"      "6"          
end

So they are indeed all different lengths. The only difference in the other files is that the year is different, the stubs are the same.

Now that I have updated the code as per your instruction:

Code:

cd “path”
local files: dir "." files "*dta"
tempfile combined
clear
save `combined', emptyok
foreach f of local files {
    use `"`f'"', clear
              ds taxes??
     local year = substr("`r(varlist)'", -4, 4)
    ds *`year'
    local stubs `r(varlist)'
              local stubs: subinstr local stubs "`year'" "", all 
    reshape long `stubs', i(person_id) j(year) destring `stubs' , replace 
    gen source_file = `"`f'"'
    append using `combined'
    save `"`combined'"', replace
}
save all_years_data, replace

I got -variable taxes???? not found- r(111).

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#21

28 Feb 2023, 15:56

Oh, now I see what went wrong with your initial example: you had edited out the variable names. Your variable names are not taxes2016, etc. They are A, B, C, ... That's the problem. It looks like these data sets were imported from spreadsheets, and in the importation, the names in the first row were just left as data, rather than being used to name the variables. It also explains why all of your numeric variables are actually strings. You need to rename those variables to the values shown in the first observation. So:

Code:

cd "path" local files: dir "." files "*dta" tempfile combined clear save `combined', emptyok foreach f of local files { use `"`f'"', clear foreach v of varlist _all { rename `v' `=`v'[1]' } drop in 1 ds taxes???? local year = substr("`r(varlist)'", -4, 4) ds *`year' local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all reshape long `stubs', i(person_id) j(year) destring `stubs' , replace gen source_file = `"`f'"' append using `combined' save `"`combined'"', replace } save all_years_data, replace

By the way, make sure you write -ds taxes????- with four, and only four ?'s. (In the code in #20 you have only 2.) That's absolutely crucial.
Comment
Elena Draghici

Join Date: Jan 2021

Posts: 43
#22

01 Mar 2023, 06:26

Hi Clyde, Thank you so much for all your help! I tried that code and received an error stating "1 new variable name invalid you attempted to rename PID_I2 to 470667. That is an invalid Stata Variable name. r(198)." PID is the person_id variable, and it does not have a stub.

I realize that preparing an example of the variable names in excel was perhaps not the best approach and I see how that can cause confusion and issues. I apologize. I cannot show you my actual data due to confidentiality reasons, but I think maybe this example will be more clear. This is based on real data that I attached the year stub to, which mimics my data better, except my real data are mostly string variables.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(lfsstat2016 prov2016 cma2016 age2016 sex2016 marstat2016 taxcredit2016 educ902016 mjh2016) 6 11 4 10 . 1 1 . 5 . 1 24 4 8 . 2 6 . 4 1 6 35 4 4 . 1 2 . 1 . 6 59 4 5 . 1 1 . 4 . 6 24 4 1 1 1 6 . 1 . 1 46 4 10 . 2 1 . 4 2 6 35 4 11 . 1 1 . 0 . 1 59 3 6 . 2 1 . 2 1 1 48 4 3 6 1 6 . 4 1 6 35 4 4 . 2 1 . 1 . end

I will be sure to more explicit next time. Thank you again!

Also, I am curious what the line "-ds taxes????- does?

Last edited by Elena Draghici; 01 Mar 2023, 06:32.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#23

01 Mar 2023, 09:59

First, if some of your data files have string variables and others have numeric variables, then you cannot combine them until you harmonize that across all of the data sets. Given the nature of your variables, making them all numeric (except for the ID variable) makes the most sense. But the point, then, is that it is not sensible to, in a single operation, import and combine the data sets. It will be more effective to do one loop that imports each spreadsheet and saves it in a separate file, and then run the -precombine- command (available from SSC) to identify the inconsistencies and incompatibilities among the data sets. Then fix all of those incompatibilities in each data set as needed. Finally combine the fixed data sets in a loop that -appends- them all together.

We are still wrestling with the problem of finding a way to extract the year from the variable names in a way that is consistent across the files. I had assumed that there is always a variable named taxes2016, or taxes2017, or taxeswhatever_the_year_is in each data set. But your new example suggests this is not the case. Is there any variable that appears in every one of the data sets with the same name, except for ending in a different year? Let's assume there is one. For the purposes of demonstration with your current example data, I'll assume that age#### is such a variable. I also assume that each data set has a person_id variable. Then the code that will work for importing all of the data sets (but saving them separately as I suggested in the preceding paragraph) looks like:

Code:

cd "path" local files: dir "." files "*dta" tempfile combined clear save `combined', emptyok foreach f of local files { ds age???? local year = substr("`r(varlist)'", -4, 4) ds *`year' local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all reshape long `stubs', i(person_id) j(year) save data_`year', replace }

This will read each file, identify the year of the data in the file by extracting it from the age#### variable, -reshape- the data to long layout, and then save the resulting long data in a file named data_#### (where #### refers to the four-digit year). Note that there is no longer a -destring- command, since only some of the files will require this. -precombine- will help you by pointing out which files have string variables and which ones have numeric variables under the same names.

Also, I am curious what the line "-ds taxes????- does?

The -ds- command creates a list of variables that match the pattern(s) specified in the command. The pattern taxes???? will match any variable name that begins with taxes and is immediately followed by exactly four characters. So taxes2016 will match. So will taxes2018, or taxesany_four_digit_number. So, will taxesAb_C if there is such a variable in the data set. The list of matching variable names is displayed in the Results window and is also temporarily saved in -r(varlist)-.

Since it seems that not all file contain any variable that will match the taxes???? pattern, I have changed it in the new code to -ds age????- on the hope that each file will contain an age variable named agewhatever_the_year_is. If that's not the case, replace age by whatever variable actually will appear in every file. The following line in the code extracts the year as the final four characters of that variable name. The next line -ds *`year'- then creates a list of all variables whose name ends in that year. We then copy that list from -r(varlist)- to local macro stubs. The next command removes the year from those variable names so that only the stubs (lfsstat prov cma age sex...) remain in the local macro stubs. From there it's just a reshape and a save.

I cannot show you my actual data due to confidentiality reasons,

Understood. That is not a problem in your situation: the actual values of the variables are irrelevant to your question. But it is crucial that the examples you show accurately reflect the organization (variable names, data storage types, etc.) of your Stata data set. The simplest way to do that is to -use- your Stata data set. Then drop all but, say, the first 25 observations. Then open the Data Editor and change the actual numbers in the data. Just type nonsense numbers over them. (Or, if you want to be more sophisticated write a short program to replace the actual data with random numbers.). Then run -dataex-. You will not have disclosed any real data, but the organization of the data set will not have been modified. Code that works on the -dataex- example should then work with the real data set.

There sometimes are questions people ask where the actual data values matter: these are usually problems involving a lot of numerical computation. Sometimes it isn't possible to provide help with those questions without seeing the real data. But this is not your present situation at all.
Comment
Elena Draghici

Join Date: Jan 2021

Posts: 43
#24

01 Mar 2023, 10:28

Thank you Clyde, that was most helpful! I tried the code, and there is a bug: Stata keeps saying "AGE__X????" not found, which is the exact naming convention of the variable. I typed -ds AGE__X????- in one of the files on its own and it returned the correct variable. I wonder why the program is not reading -ds AGE__X????- within the actual loop?

Also, couldn't I use the append, force option? Or is that not recommended?

And thank you for telling me about precombine, I will download!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#25

01 Mar 2023, 10:51

You need to send me -dataex- output for the actual file that is causing the problem and I will try to figure it out. I can't do it in a vacuum.

Added: Wait, I see a problem. In that loop, I omitted the statement that opens the data sets. So after the first iteration, the code is trying to -ds AGE_????- on the reshaped data from the previous file, which is still lingering in the data. Sorry. It should be:

Code:

cd "path" local files: dir "." files "*dta" tempfile combined clear save `combined', emptyok foreach f of local files { use `"`f'"', clear ds AGE__X???? local year = substr("`r(varlist)'", -4, 4) ds *`year' local stubs `r(varlist)' local stubs: subinstr local stubs "`year'" "", all reshape long `stubs', i(person_id) j(year) save data_`year', replace }

If that doesn't fix the problem, please send -dataex- output for the file that is exhibiting the problem.

Continuation of original post:

Also, couldn't I use the append, force option? Or is that not recommended?

It is most definitely NOT recommended. It will not reconcile the inconsistency between the data sets. Instead it will drop the incompatible data. So you will lose data and end up with lots of missing data.

-force- options are, in general, dangerous and should only be used in circumstances where you clearly understand the damage they do to your data and are 1000% certain that this damage makes no difference for your purposes. (E.g. the incompatible variables are just ones will never be used in your analyses anyway. Even then, it is usually much better to just -drop- those variables from the data sets before appending them and avoid -force-.) Such situations are relatively uncommon in the first place, and it is usually quite simple to avoid them by first clearing up the problems that are leading you to want to use -force-. I would say that if you use Stata in your work on a daily basis and you find yourself writing commands with -force- more than one or two times a year, you are probably ending up with mangled data sets and unwittingly doing analyses with garbage data. And you know what garbage in leads to.

Last edited by Clyde Schechter; 01 Mar 2023, 10:58.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment