Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Translate R to STATA

    Hello,

    I would like to translate this R formula below in Stata.

    My data :
    translate <- contains : translate$ID translate$WORDS (list of words) translate$RESUME
    home <- contains : home$ID home$adress home$RESUME (empty field)

    I whant to complete home$RESUME from translate$RESUME looping with translate$ID and translate$WORDS


    [I]for(i in 1:nrow(translate)){
    home$RESUME[grepl(translate$WORDS[i], home$adress, ignore.case = TRUE) & home$ID == translate$ID[i]] <- translate$RESUME
    }


    Thank you for your help !
    Marc

  • #2
    I gather you need to - merge - both "datasets" (well, here in R they are just "objects").
    Best regards,

    Marcos

    Comment


    • #3
      If you don't tell us more about your data, you run the risk that people will wrongly guess what your data structure is. That said, it looks like you have two datasets/data frames:

      1. translate has variables ID, WORDS, and RESUME. (Q: is the dataset unique by ID and WORDS?)
      2. home has variables ID, address, and RESUME.

      translate and home match by ID. There may be a many to one relationship, or a one to one relationship. See the question.

      It looks like you're trying to update the value of RESUME in the home dataset using the same value in translate. However, the grepl function appears to search for matches, so it appears you're wanting to match only if, in a row of the home dataset, the address has a string match to WORDS in translate. You're looping over all the rows in translate to do this. At least, that's what I think is happening - my R is beginner level, and I'm still getting a handle on R's data structures and references.

      Either way, this is now sounding a bit more complex than just merge. We are going to need a better description of the data involved before we can advise you properly. If I've mis-described your intent, you'll need to tell us what you're trying to do.
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #4
        Indeed, I should have presented my data.

        I have this dataset (home) :
        zipcode adress RESUME
        30001 blabla lolo
        30001 blabla lolo bla
        30005 12 bobo lala
        30005
        lolo bobo titi
        30006
        tiTi

        A correspondance table (translate) :
        zipcode WORDS RESUME
        30001 lolo The Lolo
        30001 bobo The Bobo
        30005 bobo The Bobo
        30006 titi The Titi

        And I would like to get :
        zipcode adress RESUME
        30001 blabla lolo The Lolo
        30001 blabla lolo bla The Lolo
        30005
        12 bobo lala
        The Bobo
        30005
        lolo bobo titi
        The Bobo
        30006
        tiTi
        The Titi

        I can do it with the R code below :
        Code:
        [I]for(i in 1:nrow(translate)){
        home$RESUME[grepl(translate$WORDS[i], home$adress, ignore.case = TRUE) & home$zipcode == translate$zipcode [i]] <- translate$RESUME
        }
        but I need to do it with Stata.

        Thank you !



        Comment


        • #5
          What is the logic behind selecting the value from the first rather than the second occurrence of zipcode==30001 from your dataset 'translate' ?

          Comment


          • #6
            From a lookup table (translate), I want to look for words (from the WORDS field) in the ADRESS field and harmonize the addresses in the RESUME field.

            Comment


            • #7
              As recommended n #2, you could take a look at the - merge - command.
              Best regards,

              Marcos

              Comment


              • #8
                I'd suggest joinby (gives all possible combinations on zipcode) and then keeping only those where words match. Intermediate step is making both strings all lowercase.


                Code:
                use home.dta, clear
                drop resume
                joinby zipcode using translate.dta, update
                
                replace adress =lower(adress)
                replace words =lower(words)
                keep if strpos(adress, words)

                Comment


                • #9
                  Hi,

                  I agree that -merge- is the way to go in Stata. I would try to get a list of words from the lookup table; and run an -merge , update- for any occuring word in order to fill in the RESUME variable. This works with your example data, but can be tweaked in terms of performance:
                  Code:
                  // create example data: lookup table
                  clear
                  input zipcode  str20 WORDS str20 RESUME
                  30001  "lolo" "The Lolo"
                  30001  "bobo"  "The Bobo"
                  30005  "bobo"  "The Bobo"
                  30006  "titi"  "The Titi"
                  end
                  tempfile translate
                  save `"`translate'"'
                  list
                  
                  // create example data: source data
                  clear
                  input zipcode str220 adress str1 RESUME
                  30001 "blabla lolo" ""
                  30001 "blabla lolo bla" ""
                  30005 "12 bobo lala" ""
                  30005 "lolo bobo titi" ""
                  30006 "tiTi" ""
                  end
                  
                  // get list of words from lookup table
                  preserve
                  use WORDS using `"`translate'"' , clear
                  quietly : levelsof WORDS , local(valid_words)
                  restore
                  
                  // create identifier variable filled with relevant words one by one,
                  // than update-merge with the lookup table
                  generate WORDS=""
                  foreach word of local valid_words {
                      replace WORDS=`"`word'"' if strmatch(ustrlower(adress),`"*`word'*"')
                      merge m:1 zipcode WORDS using `"`translate'"' , update nogenerate keep(master match_update match)
                  }
                  
                  // clean up
                  drop WORDS
                  
                  // show result
                  list , sepby(zipcode)
                  Regards
                  Bela

                  Comment


                  • #10
                    Thank you for your answers.
                    I think that my Stata version (ic 13.1) is to old to recognize "ustrlower" function... ? I will manage the characters case upstream.

                    Thanks again for everything,
                    Regards
                    Marc

                    Comment


                    • #11
                      You can use lower in place of ustrlower. The latter is an alternative for unicode characters. Stata 13 does not support unicode and therefore doesn't include ustrlower.

                      Comment

                      Working...
                      X