Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating variables by selective search of variable’s entries

    Dear Stata users,

    I have a database about patents. In this database the individual’s address is written in the following format:
    address1, address2, address3 and address4 (e.g., Euclid’s Road 15258; P. O. Box 15269; Boston, MA; MA). Information on address3 and address4 may appear with missing values.

    What I want to create is a variable called state. This variable will take the name of the state given information from address2, address3 and address4.

    What I have in mind is that I need to tell Stata to generate the variable given specific words found in entries of address2, address3 and address4. For the sake of an example: fill state variable with the name Massachusetts if the words Boston or MA (or both) are found at least once in address2, address3 and address4. Since the entries are not the same (i.e., Boston, MA != Boston != MA), a simple if statement does not seem to be the right way to attack this problem.

    Can we somehow fix this in Stata?

    Thank you for your time!
    Last edited by Pantelis Kazakis; 02 Oct 2014, 09:43.

  • #2
    be careful, there are states other than MA that have a town/city named "Boston"

    second, you have, in address4, the state; this seems to be all you need but clearly that is not what you think; please clarify

    Comment


    • #3
      Originally posted by Rich Goldstein View Post
      be careful, there are states other than MA that have a town/city named "Boston"

      second, you have, in address4, the state; this seems to be all you need but clearly that is not what you think; please clarify

      I understand that well . Here in Columbus, OH we have Pennsylvania Road for example. I forgot to say that address4 is many times null, so one needs to fill the gap using the other information.

      Comment


      • #4
        Be warned: these problems are usually much messier than you want to believe. If this were my problem I would look for the two-letter abbreviations to start with.

        Comment


        • #5
          I second Nick's caution about how messy this can be. I would use regular expressions to search for two-letter (uppercase) matches at the end of an address variable. I would allow for a number of trailing spaces and digits for zip codes. Since you have a variable number of address lines (variables), you would have to repeat the search. Something like:

          Code:
          clear
          input str50 address1 str50 address2 str50 address3 str50 address4
          "Euclid's Road 15258" "P. O. Box 15269" "Boston, MA"
          "1 Infinite Loop" "Cupertino, CA 95014"
          "Washington, DC 20001"
          "Kendall Square/MIT" "209 Broadway" "Cambridge" "MA 02139"
          end
          
          * the following is from SSC; to install, type ssc install leftalign
          leftalign
          list
          
          gen state = regexs(1) if regexm(address4, "([A-Z][A-Z])[0-9 ]*$")
          replace state = regexs(1) if regexm(address3, "([A-Z][A-Z])[0-9 ]*$") & mi(state)
          replace state = regexs(1) if regexm(address2, "([A-Z][A-Z])[0-9 ]*$") & mi(state)
          replace state = regexs(1) if regexm(address1, "([A-Z][A-Z])[0-9 ]*$") & mi(state)
          
          * the following is from SSC; to install, type ssc install listsome
          listsome, random noobs
          I highly recommend using listsome to check this type of data cleaning work, especially if you are working with lots of observations.
          Last edited by Robert Picard; 02 Oct 2014, 10:33.

          Comment


          • #6
            OK, thank you for your suggestions.

            I have tell you that specific search must be made in address4 as well. For it is not only with 2 letters (now seeing the data better I can check that), for example it can be MA.; M. A.; Boston, Massachusetts as well. So, I will try to focus on the very last address using selective search and see what happens.

            Comment


            • #7
              Yes, this is why it's messy. Again, use listsome to get a sense of what did not work in the first pass

              Code:
              listsome if mi(state), random
              This gives you a random sample of observations where the initial strategy did not work. You then pick the easiest target and make another pass. Repeat until you reach your threshold of diminishing returns.

              Comment


              • #8
                Robert,

                the code that you provided works pretty well, but as you said is not without errors. See for example this:

                address1 address2 address3 address4 state
                3500 DEER CREEK ROAD M/S 26U-25 PALO ALTO, CA 94304-1317 null AD

                As you can see, this should have returned CA.

                The problem is that the code continues to address1 and takes the last two letters (AD), while it should have stopped in address3. How can we make it to not continue after finding something starting from the very last address? Some kind of loop?

                Another thing, after passing some time with the data is that some state indicators differ. Take for example Virginia, which should be VA. I found that you may also have V.A. or V. A. or even VA. all of them differ with each other.

                Plus the case for Michigan is written many time as MICHI or MICHI.

                Generally I would say that the more you advance in address, the more likely to get a mistake. Although the sample increases, I may need to stay at the very last level (address4).

                Comment


                • #9
                  I would switch to a strategy that looks for the State in the last address field only. There's no perfect recipe for how to proceed, you just need to carefully check the matches as you proceed. You should also get a master list of State abbreviations that you can merge with your matches to validate your results. Here are some additional techniques

                  Code:
                  clear
                  input str50 address1 str50 address2 str50 address3 str50 address4
                  "Euclid's Road 15258" "P. O. Box 15269" "Boston, MA"
                  "1 Infinite Loop" "Cupertino, CA 95014"
                  "Washington, DC 20001"
                  "Kendall Square/MIT" "209 Broadway" "Cambridge" "MA 02139"
                  "3500 DEER CREEK ROAD" "M/S 26U-25" "PALO ALTO, CA 94304-1317"
                  "Somewhere, V. A."
                  "Somewhere else, V.A"
                  "Here" "Ann Arbor, Michigan"
                  "There" "Cambridge, Massachusetts"
                  end
                  
                  * the following is from SSC; to install, type ssc install leftalign
                  leftalign
                  list
                  
                  * switch to using only the last address line
                  gen lastaddr = address4
                  replace lastaddr = address3 if mi(lastaddr)
                  replace lastaddr = address2 if mi(lastaddr)
                  replace lastaddr = address1 if mi(lastaddr)
                  
                  * first attempt, 2 uppercase letters with optional space/zip code
                  * tweaked to allow also a "-" for zip+4
                  gen state = regexs(1) if regexm(lastaddr, "([A-Z][A-Z])[0-9 -]*$")
                  
                  * the following is from SSC; to install, type ssc install listsome
                  listsome, random noobs
                  listsome if mi(state), random noobs
                  
                  * redo but allow for a space and . ( V.A. or V. A. or VA.)
                  clonevar state0 = state
                  replace state = regexs(1) + regexs(2) if regexm(lastaddr, "([A-Z])\.? ?([A-Z])\.? ?[0-9 -]*$")
                  listsome if state0 != state, random noobs
                  
                  * Let's look for states, fully spelled out
                  listsome if mi(state), random noobs
                  replace state0 = state
                  replace state = "MI" if strpos(upper(lastaddr), "MICHIGAN")
                  replace state = "MA" if strpos(upper(lastaddr), "MASSACHUSETTS")
                  listsome if state0 != state, random noobs

                  Comment

                  Working...
                  X