Designing a macro to import and manipulate data based on values of variable in an initial dataset

Samir Kelada

Join Date: Jun 2014

Posts: 11
#1

Designing a macro to import and manipulate data based on values of variable in an initial dataset

16 Oct 2014, 07:48

I would like to write a macro that would allow me to import and manipulate data based on values of variable in an initial dataset (but nothing I have tried so far works due to my ignorance of stata coding). As an example, for the following dataset:

fruit site locus

apples a 1

orange b 2

bananas c 3

I would like to import files based on values of the variable fruit , then generate variables (and manipulate them) based on the values of the variable locus, and finally save those files using filenames that depend on the values of variable site.

My two big stumbling blocks are (1) how to keep the values of the variable in the initial dataset in some type of memory that can be accessed later and (2) how to refer to the values of the variables in the macro.

I tried the commands below but the clear command in the foreach loop is obviously a death knell.
Code:

use data, clear levelsof fruit, local(levels) foreach rep of local levels { clear insheet using "`apples'.txt" gen locus_2=`locus'+1 save "`apples'_`site'.dta", replace }
It looks like levelsof could be a useful tool in this case, but I can't figure how to use that for three different variables at one time. Suggestions for how to deal with this would be much appreciated!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

16 Oct 2014, 09:20

First, you appear to be using the term "macro" in the SAS sense of the word: in Stata, macro means something entirely different. A SAS "macro" is known in Stata as a "program."

The solution to your problem will indeed require the use of Stata macros! Let's call the data set you are starting out with above dataset1.

Code:

clear use dataset1 // STORE THE FRUIT/SITE/LOCUS INFORMATION IN LOCAL Stata MACROS forvalues j = 1/`=_N' { local fruit`j' = fruit[`j'] local site`j' = site[`j'] local locus`j' = locus[`j'] } // NOW USE THE INFORMATION TO PROCESS THE OTHER DATASETS forvalues j = 1/`=_N' { clear insheet using "`fruit`j''" gen locus_2 = `locus`j''+1 save "`fruit`j''_`site`j''", replace }

Last edited by Clyde Schechter; 16 Oct 2014, 09:21. Reason: Correct typo
Comment
Samir Kelada

Join Date: Jun 2014

Posts: 11
#3

16 Oct 2014, 12:52

Thanks for that very helpful input. The one thing I am still confused about is that the macro sets the values of each variable (e.g., fruit`j' ) to be the rank ordered position in the dataset (i.e., 1, 2 or 3) as opposed to the actual value of the variable, namely apples, orange, or bananas. So when the insheet command is used, instead of getting

Code:

insheet using "fruit.txt"

, I get

Code:

insheet using "1.txt"

So is there a way to specify the value as opposed to rank?

Last edited by Samir Kelada; 16 Oct 2014, 13:41. Reason: typos
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

16 Oct 2014, 13:07

How is your first dataset structured? I wrote the code assuming that fruit is a string variable. If it is in fact a numeric variable with value labels, then the code for that one must be changed to reflect that.

So run -describe fruit-. If the storage type is strX (for some number X) or strL, my code should have worked. So then I would ask that you run the entire code I sent earlier and post what you got by copying the output in the Results window into a code block.

If the storage type is not string, then we are working with a labeled numeric variable. In that case change -local fruit`j' = fruit[`j']- to:

Code:

local fruit`j': label (fruit) `=fruit[`j']'

You may also need to make similar adjustments to deal with your site variable if it, too, is actually a labeled numeric variable rather than a string. The locus variable appears on its face to be numeric, and the code should work without modification. Even if it is numbers stored as a string variable, it should still work. It will fail, however, if it is a numeric variable with value labels and the value labels differ from the underlying values. That would be an odd situation, but I have seen it on occasion. If that's what you've got, the fix for that variable is also the same.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

16 Oct 2014, 13:13

Clyde's code looks good to me. Also, I see no reason why you should expect it to produce your first line, nor why it should produce your second line.

Some larger points and a smaller one:

1. You have to copy code from experienced users precisely as they give it unless you know that their code contains a mistake.

2. You have to report code that you use precisely as entered into Stata. I guess some of the bizarre features in your code segments are just typos, such as

Code:

in sheet

rather than

Code:

insheet

but once it's clear that typos are present it becomes hard to work out what's a typo and what's a more serious mistake.

3. Clyde's code hints at a serious problem with your strategy. You want to put details of different datasets into a kind of master dataset and then read in those datasets one by one. Clyde's code shows that putting such details into a master dataset obliges you to take them out again, as the first time you will read in a new dataset you will overwrite the first dataset. It's arguably a limitation of Stata that it allows only one dataset at a time in memory, but the limitation implies that the whole strategy is self-defeating. Presumably your little example is a toy to explore principles and that's fine, but we need an idea of how many datasets you want to deal with to give good advice. It's a better approach just to put the filenames directly into a local macro, for example.

Smaller point: Please register with your full real name. See FAQ Advice Section 6 for why and how.
Comment
Samir Kelada

Join Date: Jun 2014

Posts: 11
#6

16 Oct 2014, 13:49

Thanks for both responses. To respond to Nick's question, indeed this is just a toy example so I can learn basic principles. For my actual dataset, the master dataset includes ~100 observations with 5 variables of interest (some string, some numeric) that I want to use to (1) read in 100 different datasets and for each (2) manipulate and graph that data. I agree that it's a clumsy approach, but if the capacity exists then I would like to use it because I don't know of another way to accomplish my goal.

ps- I put in a request to change my username.
Comment

fruit	site	locus
apples	a	1
orange	b	2
bananas	c	3

Announcement

Designing a macro to import and manipulate data based on values of variable in an initial dataset

Comment

Comment

Comment

Comment

Comment