Loop to open, keep variables, and join different files

Simon Wimbly

Join Date: Dec 2018

Posts: 3
#1

Loop to open, keep variables, and join different files

18 Dec 2018, 16:22

Hello,

I am struggling with the following task that already kept me bugging for hours.

This is the process step by step:

1. Import individual file (names: file2002, file2004, file2006, file2008, file2010)
2. Keep only the variables: id aa001_* aa004_* ba016_* ca001_* ea104_* eb001_* eb002_* ec023_* eh041_*. * should be replaced with 2002, 2004, 2006, 2008, 2010.
3. Join files on the first file using the id variable as key

Having spent more than 5 hours now, I didn't get far. The main problem involves the joining of multiple files. Is there a possibility to perform this task using simple code? I can imagine that there's are fairly frequently performed operations.

Thank you very much.

Sincerely,
Simon
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#2

18 Dec 2018, 17:29

The command you want is -merge- to do your "join."

There's a lot of other stuff to do here, too, such as importing each file (from a csv, an Excel file or what--you didn't say), and possibly a fancy use of the rename command. (See -help rename group-, which can be tricky) Your reference to a renaming with 2002 etc. won't work, since you can't have numbers as variable names in Stata. (Perhaps I misunderstand what you want there.) You'll need a loop, since the approach here will involve a loop to import each of your files and save it, and then and then another loop to merge them onto a master file.

However: We'd need a more detailed description of your data, with actual file names, actual variable names, and so forth, in order for some member here to help you. What you want to do is a common programming problem, and in fact has appeared before on this list in one form or another. What you want will likely require about 5 minutes programming work and will result in a 10 line file, but there's not enough information here to write code that will be very helpful to you, I don't think. (I could sketch a high level algorithm, but my sense is you need something more finished than that.)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#3

18 Dec 2018, 17:34

P. S. The post on another topic by William Lisowski refreshed my memory that there's a community-contributed command named -xls2dta- that might be relevant to automating your problem, presuming the files to be imported are in Excel format. See -ssc describe xls2dta-
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

18 Dec 2018, 17:40

Welcome to Statalist, Simon.

Your description of your data is unclear: are the variable names the same in each file (aa001) and you hope to rename them to include the year of the file from which they came (aa001_2002)? Or does each file already have a different set of variable names that already incorporates the year (file2002 contains aa001_2002)? I'm going to assume the former.

The approach I would take to do what you describe is a little different than you might imagine.

Code:

clear save file_all, replace emptyok forvalues y = 2002(2)2010 { use file`y', clear keep id aa001 aa004 ba016 ca001 ea104 eb001 eb002 ec023 eh041 generate int year = `y' append using `file_all' save file_all, replace }

At this point, file_all contains between 1 and 5 observations for each id, one for each of the five years in which the id appears. This is known as having your data in a long layout. What you described as your objective is a wide layout. Your data can be changed from a long layout to a wide layout, with the year incorporated into the variable names, by using the reshape command. But I'm not going to get into the details of that. I'm going to recommend you not do that.

The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. You should try to achieve what you need with the data in a log layout, and seek the help of Statalist in doing so. The sort of problems you will encounter trying to use your reshaped data will almost certainly be solved by reshaping the data. It is much easier, for example, to compare the second observation to the first, the third to the second, and so on, than it is to compare the second variable to the first, the third to the second, etc.

Added in edit: Mike's comment in post #3 makes me realize that I assumed your files are Stata datasets. If not, xls2data is a good way of getting them converted before running the code in this post. The verb "import" suggests this may not be the case. Mike is right, more details are needed. But my goal, actually, wasn't so much to help with the import, as to strongly recommend the long layout over the wide layout.

Last edited by William Lisowski; 18 Dec 2018, 17:43.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#5

18 Dec 2018, 19:14

Oh, yes, I rigidly mistook your use of "join" as a synonym for "merge." You may very well be in -append- territory, per William's suggestion, if you have the same variables for different years. In that case, long format would definitely be better.
Comment
Simon Wimbly

Join Date: Dec 2018

Posts: 3
#6

19 Dec 2018, 01:46

Hello,

Thank you all for your helpful comments. Although I was able to make some progress, I still get stuck. This is the code I currently have:

Code:

global DATASET "file_2004_2014" save $DATASET, replace emptyok forvalues year = 2004(2)2014 { local file `"file `year'"' use "`file'", clear keep id aa001_* aa004_* ba016_* ca001_* ea104_* eb001_* eb002_* ec023_* eh041_* merge 1:1 id using $DATASET, generate("Merged_`year'") save $DATASET, replace }

Since I didn't provide I detailed explanation of the datasets before, I will do so now.

- I have the datasets: "file 2004", "file 2006", "file 2008", "file 2010", "file 2012", "file "2014".
- All files have the variable id which includes the unique person id. Not all participants may be in each file. All rows should however be retained.
- Each file has the variables: aa001_* aa004_* ba016_* ca001_* ea104_* eb001_* eb002_* ec023_* eh041_*. * represents the year of the file. For example, datafile 2004 has variables named using the postfix _04, while the datafile 2006 has a postfix of _06 for all variables, except for id (this is id in very file)
- I want to merge these datasets on the id (i think a full outer join?) such that all records are retained.
- The merged datasets are saved as "file_2004_2014". In this file, the variable id, including aa001_* aa004_* ba016_* ca001_* ea104_* eb001_* eb002_* ec023_* eh041_* for datafiles 2004, 2006, 2008, 2010, 2012, and 2014 are stored.

When running the code above, however, I get the error "No variables defined".

It seems that the first time the forvalues loop is iterated, I am trying to merge the dataset 2004 with an empty dataset (file_2004_2014). This appears to give the error. How can I overcome this?

Last edited by Simon Wimbly; 19 Dec 2018, 01:50. Reason: Found the cause of the error, see last sentence.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

19 Dec 2018, 02:23

Stata is right: You didn't define any variables. The idea of variables being defined but there being no observations may seem a little weird, but it is just like a table with headers but no entries yet, and is what is needed for your approach..

But I wouldn't do it that way at all. As has been explained, you need a long layout (some say structure or format) and thus to append. You might as well do it straight away and fix the variables before you do it.

Necessarily this is untested.

Code:

forvalues year = 2004(2)2014 { use "file `year'", clear keep id aa001_* aa004_* ba016_* ca001_* ea104_* eb001_* eb002_* ec023_* eh041_* local suffix : di %02.0f mod(`year', 100) rename (*_`suffix') (*) gen year = `year' save "`file' `year'_2", replace } append using "file 2004_2" "file 2006_2" "file 2008_2" "file 2010_2" "file 2012_2" save "file_2004_2014"

Note that at the end of the loop, the 2014 observations are in memory, so you just need to append the previous years' data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

19 Dec 2018, 03:49

Code:

save "`file' `year'_2", replace

should be

Code:

save "file `year'_2", replace

Sorry about that.
Comment
Simon Wimbly

Join Date: Dec 2018

Posts: 3
#9

19 Dec 2018, 04:29

Hello Nick,
Thank you for your support. I appreciate this very much.
When appending data, how can the variables from multiple files be added while the data is joined on the variable id, that is the unique identifier which needs to be used to add these data to the corresponding respondents?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

19 Dec 2018, 05:08

I don't understand the question. Suppose the file for 2004 contains id 1 and 2 and that for 2006 contains id 2 and 3. Then you should end with observations like this:

Code:

id year 1 2004 2 2004 2 2006 3 2006

That is not a problem except that it's an unbalanced panel, but it would be an unbalanced panel with any other data structure.

I guess that you are thinking of append as if it were like merge. It's not like merge at all. There is not, and cannot be, matching of identifiers in the append process. There is no checking of what the values are. What needs to match are the variable names, but different observations having different identifiers is what you expect and what you should get. Naturally if the identifiers are in error, you have a problem; but you would have the same problem with merge.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

19 Dec 2018, 05:49

Nick Cox - It is likely that what Simon Wimbly requested in post #1 does indeed require merge, as Mike Lacy suggested in post #2.

In post #4 I gave instructions for using append because the weak description of the data in post #1 led me to believe Simon's problem was that the variables had the same name in every dataset, and because, as I argued in the two paragraphs following my code, merging the data into a single observation per id is not appropriate for most purposes in Stata.

Simon, you apparently have longitudinal data - each id is observed at multiple points in time. Stata has an extensive collection of commands specifically for analyzing longitudinal data, which you can learn about in the Stata Longitudinal-Data/Panel-Data Reference Manual PDF including in your Stata installation and accessible through Stata's Help menu. You should review that material and see if it does indeed describe what you are doing. You will see that right off the bat, it tells you that data organized as a single observation per id needs to be reshaped to a long layout.

Alternatively, explain to us what it is that you need to do that requires an Excel-like organization of your data, which as I say in post 4 usually makes your work more difficult in Stata.

There is limited interest on Statalist for answering questions that seem to be going in the wrong direction - it usually means another round of questions to be answered to get the analysis back on the right track. Sometimes the trick to answering a question effectively is to first make sure the right question is being asked. Apologies if this seems unhelpful, but showing you the shortest direction to the foot of the mountain by sending you over a cliff would not be helpful, either.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#12

19 Dec 2018, 06:19

Well, three experienced users of Stata can't understand or agree exactly what the problem is -- and we have yet to see what the data files look like. All we need is a simplified example with real(istic) variable names.Otherwise this is too much of a guessing game.
Comment

Announcement

Loop to open, keep variables, and join different files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment