Translate R to STATA

Marc Audrna

Join Date: Feb 2015

Posts: 59
#1

Translate R to STATA

12 Feb 2019, 10:17

Hello,

I would like to translate this R formula below in Stata.

My data :
translate <- contains : translate$ID translate$WORDS (list of words) translate$RESUME
home <- contains : home$ID home$adress home$RESUME (empty field)

I whant to complete home$RESUME from translate$RESUME looping with translate$ID and translate$WORDS

[I]for(i in 1:nrow(translate)){
home$RESUME[grepl(translate$WORDS[i], home$adress, ignore.case = TRUE) & home$ID == translate$ID[i]] <- translate$RESUME
}

Thank you for your help !
Marc
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

12 Feb 2019, 10:49

I gather you need to - merge - both "datasets" (well, here in R they are just "objects").

Best regards,

Marcos
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

12 Feb 2019, 11:57

If you don't tell us more about your data, you run the risk that people will wrongly guess what your data structure is. That said, it looks like you have two datasets/data frames:

1. translate has variables ID, WORDS, and RESUME. (Q: is the dataset unique by ID and WORDS?)
2. home has variables ID, address, and RESUME.

translate and home match by ID. There may be a many to one relationship, or a one to one relationship. See the question.

It looks like you're trying to update the value of RESUME in the home dataset using the same value in translate. However, the grepl function appears to search for matches, so it appears you're wanting to match only if, in a row of the home dataset, the address has a string match to WORDS in translate. You're looping over all the rows in translate to do this. At least, that's what I think is happening - my R is beginner level, and I'm still getting a handle on R's data structures and references.

Either way, this is now sounding a bit more complex than just merge. We are going to need a better description of the data involved before we can advise you properly. If I've mis-described your intent, you'll need to tell us what you're trying to do.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Marc Audrna

Join Date: Feb 2015

Posts: 59
#4

13 Feb 2019, 02:20

Indeed, I should have presented my data.

I have this dataset (home) :
zipcode adress RESUME

30001 blabla lolo

30001 blabla lolo bla

30005 12 bobo lala

30005
lolo bobo titi

30006
tiTi

A correspondance table (translate) :
zipcode WORDS RESUME

30001 lolo The Lolo

30001 bobo The Bobo

30005 bobo The Bobo

30006 titi The Titi

And I would like to get :
zipcode adress RESUME

30001 blabla lolo The Lolo

30001 blabla lolo bla The Lolo

30005
12 bobo lala

The Bobo

30005
lolo bobo titi

The Bobo

30006
tiTi

The Titi

I can do it with the R code below :

Code:

[I]for(i in 1:nrow(translate)){ home$RESUME[grepl(translate$WORDS[i], home$adress, ignore.case = TRUE) & home$zipcode == translate$zipcode [i]] <- translate$RESUME }

but I need to do it with Stata.

Thank you !
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#5

13 Feb 2019, 02:35

What is the logic behind selecting the value from the first rather than the second occurrence of zipcode==30001 from your dataset 'translate' ?
Comment
Marc Audrna

Join Date: Feb 2015

Posts: 59
#6

13 Feb 2019, 02:50

From a lookup table (translate), I want to look for words (from the WORDS field) in the ADRESS field and harmonize the addresses in the RESUME field.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

13 Feb 2019, 03:13

As recommended n #2, you could take a look at the - merge - command.

Best regards,

Marcos
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#8

13 Feb 2019, 04:07

I'd suggest joinby (gives all possible combinations on zipcode) and then keeping only those where words match. Intermediate step is making both strings all lowercase.

Code:

use home.dta, clear drop resume joinby zipcode using translate.dta, update replace adress =lower(adress) replace words =lower(words) keep if strpos(adress, words)
Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

13 Feb 2019, 04:07

Hi,

I agree that -merge- is the way to go in Stata. I would try to get a list of words from the lookup table; and run an -merge , update- for any occuring word in order to fill in the RESUME variable. This works with your example data, but can be tweaked in terms of performance:

Code:

// create example data: lookup table
clear
input zipcode  str20 WORDS str20 RESUME
30001  "lolo" "The Lolo"
30001  "bobo"  "The Bobo"
30005  "bobo"  "The Bobo"
30006  "titi"  "The Titi"
end
tempfile translate
save `"`translate'"'
list

// create example data: source data
clear
input zipcode str220 adress str1 RESUME
30001 "blabla lolo" ""
30001 "blabla lolo bla" ""
30005 "12 bobo lala" ""
30005 "lolo bobo titi" ""
30006 "tiTi" ""
end

// get list of words from lookup table
preserve
use WORDS using `"`translate'"' , clear
quietly : levelsof WORDS , local(valid_words)
restore

// create identifier variable filled with relevant words one by one,
// than update-merge with the lookup table
generate WORDS=""
foreach word of local valid_words {
    replace WORDS=`"`word'"' if strmatch(ustrlower(adress),`"*`word'*"')
    merge m:1 zipcode WORDS using `"`translate'"' , update nogenerate keep(master match_update match)
}

// clean up
drop WORDS

// show result
list , sepby(zipcode)

Regards
Bela

Comment

Marc Audrna

Join Date: Feb 2015

Posts: 59
#10

13 Feb 2019, 06:38

Thank you for your answers.
I think that my Stata version (ic 13.1) is to old to recognize "ustrlower" function... ? I will manage the characters case upstream.

Thanks again for everything,
Regards
Marc
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#11

13 Feb 2019, 07:17

You can use lower in place of ustrlower. The latter is an alternative for unicode characters. Stata 13 does not support unicode and therefore doesn't include ustrlower.
Comment

zipcode	WORDS	RESUME
30001	lolo	The Lolo
30001	bobo	The Bobo
30005	bobo	The Bobo
30006	titi	The Titi

Announcement

Translate R to STATA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment