Matching data after input command

Benedikt Walker

Join Date: Nov 2017

Posts: 18
#1

Matching data after input command

06 Apr 2018, 07:00

Dear Statalist,

I am working with a set of panel data which I, due to data protection, am only able to access via e-mail, sending code to the institution providing me with the data. (It`s the German socio-economic panel, I`m using their soep-remote system in case anyone reading this worked with it before or will in the future) Therefore, I send my do-files as text in an e-mail, the institution runs it through and I receive the log-file in another e-mail.
This process bears some limitations one of them making it impossible to merge data from my own computer with their data. The only way the system allows me to work with external data is by sending it within an e-mail, using the input command. My problem is that I cannot think of an easy solution working with the external data after inputting it with the input command. Here`s what I`m trying to do:

The panel data of the institution looks like this:

Code:

clear input long(id region income syear) 1 1001 4634 2000 1 1001 5380 2001 1 1001 3981 2002 1 1001 4574 2003 2 1001 3181 2000 2 1001 3430 2001 2 1001 2803 2002 2 1001 2349 2003 3 1001 2824 2001 4 1002 2041 2000 4 1002 3100 2001 4 1001 3300 2002 5 1057 6000 2000 end

Where id= PersonID, Region=Region Identifier that person is living in, syear= survey year

(This is a fictious example, sorry for not being able to post the original data. The reason is that I can neither use the list or the dataex command for the same reason of data protection)

My additional data is structured as follows:

Code:

input long (region2 expend syear2) 1001 9000 2000 1001 8000 2001 1001 9000 2002 1001 10000 2003 1002 7000 2000 1002 200 2001 1002 5600 2002 1002 900 2003 1006 10000 2000 1006 2000 2001 end

region2= same region ID, expend= government expenditures within region

I`m searching for a way to create a variable that provides the information about the government expenditures for the region the individuals are living in,
e.g. in order to calculate regressions on the relationship between government expenditures and income.

I tried to generate a new variable and replace it, but that`s not the right approach, as a lot of information is being lost this way.

Code:

gen expend2=. replace expend2==expend if syear==syear2 & region=region2

I guess I would need a code (probably a loop?) that runs through the data and checks the region and year for all ids and then replaces expend2 with the fitting expenditures per region2/syear2. But I have no clue, maybe there might be a much easier solution?
If anyone can think of a solution I would be really grateful!
Thank you in advance! Any further information that might be required I will post asap.

Thanks a lot and have a nice day!

Benedikt
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

06 Apr 2018, 08:35

I don't understand the problem. Your expenditure data is already well-organized and compatible with the data in the institutional data for merging. All you need to do is rename the variables so they are the same in both data sets. Why doesn't this work?

Code:

clear
input long (region2 expend syear2)
1001 9000 2000
1001 8000 2001
1001 9000 2002
1001 10000 2003
1002 7000 2000
1002 200 2001
1002 5600 2002
1002 900 2003
1006 10000 2000
1006 2000 2001
end
rename region2 region
rename syear2 syear
tempfile expenditures
save `expenditures'

use institutional_data, clear
merge 1:1 region syear using `expenditures'

Comment

Benedikt Walker

Join Date: Nov 2017

Posts: 18
#3

06 Apr 2018, 09:17

Dear Clyde,

because due to the limitations the e-mail sending process entails, I can neither save the data or merge it afterwards. All the commands that access their memory are deactivated. I`ll try to explain the process again: I write a do-file and copy its content into an e-mail. I send the e-mail to the institution that provides me with the panel data. They run the e-mail content automatically through their system and I receive the log-file in another e-mail. I guess in order for users to not mess up their data, commands like save, etc. aren`t working in this "special" Stata version. (I don`t know how this exactly works technically)

Therefore, I cannot use:

merge
_getfilename
cd
copy
cscript
dir
do
file
infile
list
log
macro
mkdir
net
print
set defaults
shell
ssc
sysuse
webuse
type

In fact my struggle is to work around the merge command. All other commands (except for user written programs that one would have to install) are working as usually.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30065

06 Apr 2018, 09:58

I see. I think this will work.

Code:

clear
input long(id region income syear)
1 1001 4634 2000
1 1001 5380 2001
1 1001 3981 2002
1 1001 4574 2003
2 1001 3181 2000
2 1001 3430 2001
2 1001 2803 2002
2 1001 2349 2003
3 1001 2824 2001
4 1002 2041 2000
4 1002 3100 2001
4 1001 3300 2002
5 1057 6000 2000
end

input long (region2 expend2 syear2)
1001 9000 2000
1001 8000 2001
1001 9000 2002
1001 10000 2003
1002 7000 2000
1002 200 2001
1002 5600 2002
1002 900 2003
1006 10000 2000
1006 2000 2001
end

gen expend = .
count
forvalues i = 1/`r(N)' {
    summ expend2 if region2 == region[`i'] & syear2 == syear[`i'], meanonly
    if `r(N)' == 1 {
        replace expend = `r(mean)' in `i'
    }
}

drop *2

Note that I have specified the expenditure variable as expend2 in the -input- statement, and then created a new variable called expend for the results.

This is really quite a challenge. Good luck with it.

Comment

Benedikt Walker

Join Date: Nov 2017

Posts: 18
#5

07 Apr 2018, 01:06

Dear Clyde,

thanks a lot for the code and the fast reply! It works perfectly! In order to learn a bit more, I would like to explain what I understand of the code, if there`s anything incorrect, please correct me. About the first part:

forvalues i = 1/`r(N)' { summ expend2 if region2 == region[`i'] & syear2 == syear[`i'], meanonly

Here [`i'] is being substituted for the values of region/syear line by line, so you basically run through the data and do the matching. The mean will be the same as expend2, as there is only one value, so r(mean) will hold that very same value later.

Code:

if `r(N)' == 1 { replace expend = `r(mean)' in `i' }

The second part I don`t quite understand. First, I didn`t know one could start a command with an if-condition within a forvalue-loop. Second, isn`t r(N) being substituted for the count, which would never be 1? I though `r(N)' would be 13 as there are 13 observations in this example. Therefore I think, the if condition could be a test for whether r(N) to be true. Something like that?
I ran the code without the if condition and it works fine until the last replace command, which is invalid. So the if condition must be important here, but I don`t understand it. What I don`t understand either is how the second pair of parentheses work. As far as I understand,

Code:

replace expend = `r(mean)' in `i' if `r(N)' == 1

should do the same, but it doesn`t. The same invalid occurs as with executing the command without the if condition.

If you could give me a clue how this works?

Thanks again for the amazing help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#6

07 Apr 2018, 09:21

You understand the first part of the code correctly.

First, I didn`t know one could start a command with an if-condition within a forvalue-loop.

You can't, and that isn't what this is. Stata has, in addition to if-conditions, if-commands, and this is the latter. Let me explain the difference. An if-condition appears after the main part of a command, and before the options, and it serves to restrict the application of the command it appears in to the subset of the observations that satisfy the condition. The if-command is different. It is a command in its own right, and it appears on a separate line in the file. It applies not just to a single command, but to an entire block of commands that is enclosed in curly braces {}. Moreover, it does not restrict the application of those commands to a subset of the data. Rather it assesses whether or not the condition is true. If the condition is true, all of the commands between the braces are executed. If the condition is false, all of the commands are skipped. Thus the conditions that appear in -if- commands are typically not about properties of particular observations, rather they are typically global properties of the data set and even of the program running. Typically the conditions seen in if commands don't even make sense if you try to apply them to a single observation in the data. (Though occasionally they do, and when that is the case, they are interpreted as being applied only to the first observation in the data set.)

Second, isn`t r(N) being substituted for the count, which would never be 1? I though `r(N)' would be 13 as there are 13 observations in this example

Anything returned in r() by a command is "volatile." It ceases to exist as soon as some other command returns something in r(). In this case, before we enter the loop, r(N) is the product of the -count- command, and it contains the number of observations in the data set. We then enter the loop, referring to r(N) to define when it will end. Next comes the -summ- command. It returns new values to r(), so the value of r(N) that was created by -count- is now gone. At this point r(N) is the value of r(N) created by the -summ- command, which is the number of non-missing observations of expand that satisfy the stated conditions on syear and region and syear2 and region2. That number, by design of the data, will either be 0 or 1.

I ran the code without the if condition and it works fine until the last replace command, which is invalid. So the if condition must be important here, but I don`t understand it. What I don`t understand either is how the second pair of parentheses work.

When you reach the last replace command, as it happens in the example data, there are no observations where syear and region match syear2 and region2. So what does -summ- do with that? It returns 0 in r(N). And it doesn't create r(mean) at all because there is nothing to calculate a mean of. Consequently when you then encounter -replace expend = `r(mean)' in `i'-, r(mean) does not exist, so the command looks to Stata like -replace expand = in `i'-, which is a syntax error because there is no expression on the right of the equals sign. Adding the if-condition -if `r(N)' == 1- to the command doesn't save it because r(mean) still doesn't exist, and Stata hasn't even gotten to looking at the -if `r(N)' == 1- part at that point in the command parsing. Referring back to my explanation for your second question, this is exactly the difference between an if-condition and an if-command. By guarding the -replace- command with an if-command, Stata first checks whether -summ- has returned 1 in r(N). If so, the -replace- command is executed. If not, the -replace- command is skipped entirely--it isn't even parsed.
Comment
Benedikt Walker

Join Date: Nov 2017

Posts: 18
#7

08 Apr 2018, 01:22

Thank you for the detailed explanation and the code! I learned a lot!
Comment

Announcement