Basic Recoding Issue

Dylan Connor

Join Date: Jul 2014

Posts: 3
#1

Basic Recoding Issue

30 Jul 2014, 11:32

Dear Colleagues,

I am having an issue with recoding a datafile of approximately 4 million observations in batch mode. My issue is that I currently have 1.3 million recode lines in my do-file, which Stata will not accept. The do-file resembles:

-----

use "DF1"
replace var1 = x if var 2 == "Y"
replace var1 = x if var 2 == "Y1"
replace var1 = x if var 2 == "Y2"

etc.

save "DF2"

-----

I was wondering if somebody could suggest a way around this issue. It would be sincerely appreciated.

Many thanks,
Dylan
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

30 Jul 2014, 11:46

What does "batch mode" mean? Are you using Stata in a terminal with Unix? If so, does it work with Stata GUI?

Do you mean you have 1.3 Million lines of code, all starting with replace? (That doesn't sound right.)

What is "x", a variable or a string?

What is your problem exactly? Please post exact error given by Stata.

See the FAQ for advice on posting questions.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#3

30 Jul 2014, 11:48

You'll likely get a more helpful answer if you explain what you mean by "which Stata will not accept." What error message are you getting?

My guess is that you want:

Code:

replace var1="x" if var2=="Y"

Currently you're instructing Stata to look for a variable x and replace var1 with the value of that variable if the if condition is met. My guess is that maybe you actually want to set var1 to the string value "x" instead. If that's the case, you need the quotation marks.
Regardless of what value you want var1 to take, the space in the variable name var 2 makes it not a legal Stata variable name and will cause errors.

Please see the FAQ for more advice on how to write questions that are the most likely to get helpful answers. Also note our strong preference for full names. You can contact the forum administrators to request a change in your user name by using the contact us button at the bottom right of the screen.
1 like
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

30 Jul 2014, 11:49

To make sure we're on the same page, do you have 1.3 million unique values for var2? Also, based on what you show of your code, var1 is always taking the value of x if var2 begins with Y?

if this really is the case, then it might be as simple as:

Code:

replace var1=x if upper(trim(substr(var2,1,1)))=="Y"
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

30 Jul 2014, 12:22

Also, if you really have 1.3 million replace commands, I cannot imagine you possibly generated them by hand. There must have been some pattern that enabled you to generate them. Tell us what the pattern is, and probably we can reduce it to a small number of replace commands (maybe even just one) inside one or more loops.
1 like
Comment
Dylan Connor

Join Date: Jul 2014

Posts: 3
#6

30 Jul 2014, 14:06

Hi,

Sincere apologies for not framing my question correctly. In answer to the questions above:
-I am using a Unix terminal but I believe the error is identical in the GUI. The error is: 'system limit exceeded - see manual, r(1000);'
-I do indeed have 1.3 million lines of code, each beginning with "replace".
-I read in the manual or on a help forum that Stata cannot process do-files of this length.

I am recoding a string variable, which may contain over a million unique values, into a second variable which will have approximately 400 values. Hence, the long lines of code. I thought I could perhaps write macros for each new code but I do not think macros can take that many characters in stata.

A real line of code looks like:
replace occupation_code = 113 if occupation_string == "deep sea fisherman"
replace occupation_code = 114 if occupation_string == "construction worker"

In terms of how I generated the pattern:
I have a file which contains the string variable which has 1.3 million unique observations. I also have a coding scheme (400 codes) for those 1.3 million unique observations. I wrote the scripts using concatenation between the two values.
I'm not sure if that makes things any clearer?

Many thanks,
Dylan

Last edited by Dylan Connor; 30 Jul 2014, 14:09.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

30 Jul 2014, 14:17

Well, I can't say I'm surprised that your setup is choking on a 1.3 million line do file.

I don't quite know what you mean when you say you "have a coding scheme (400 codes) for those 1.3 million unique observations." What is this coding scheme? Is it an algorithm? If so, that algorithm can be translated into Stata code and the result should be far shorter than 1.3 million lines.

If you mean you have a data set that crosswalks the occupation_string to the occupation_code, then what you probably want to do is merge your data set with that crosswalk data set, something like:

Code:

merge m:1 occupation_string using crosswalk_data_set, update replace

and that will fill in the values of occupation_code with the corresponding values from the crosswalk data set.

Hope this helps. I'm still not sure I understand your set up.
2 likes
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#8

30 Jul 2014, 14:33

Clyde's merge approach is almost certainly the way to go. To simplify things, you may want to pre-process your two files with trim() and upper() if you suspect leading or trailing spaces, or if capitalization varies.
1 like
Comment
Dylan Connor

Join Date: Jul 2014

Posts: 3
#9

31 Jul 2014, 07:23

Thank you all for your suggestions. Clyde and Ben, those suggestions were really helpful and the merge was a success. Your help is really appreciated and saved me much time.
Comment

Announcement

Basic Recoding Issue

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment