Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge observations with the same name

    Hello,

    I am having troubles combining two observations with the same meaning in Stata.
    I am using Stata/IC 15.1 for Windows.

    I have inputed educational data from the NCES into Stata. One of the variables is the state.

    Stata has made each state name a number. In addition, there are duplicate observations (ILLINOIS and Illinois). I would like to combine these into one state (preferably the one that is Illinois).
    Below is the code:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long state
    15
    10
    10
     6
    10
    10
     3
     6
    10
     3
     3
     3
    end
    label values state state
    label def state 1 "ILLINOIS", modify
    label def state 3 "Illinois", modify
    label def state 4 "Indiana", modify
    label def state 6 "Kentucky", modify
    label def state 8 "Missouri", modify
    label def state 10 "Ohio", modify
    label def state 13 "Tennessee", modify
    label def state 15 "Virginia", modify
    label def state 16 "West Virginia", modify
    Therefore, my questions are:
    1) How can I combine duplicate state names (OHIO and Ohio, ILLINOIS and Illinois, etc)?
    2) Why has Stata converted state names into numbers?

    Thank you.

  • #2
    Well, the commonest typographical variants would be capitalization (as here) and extra blank space. Those are easy enough to correct (see below) in a string variable. You, however, have a numeric variable with value labels. It looks like it was probably created somebody who used the -encode- command; which is often a good idea once the strings have been cleaned up, On the other hand, using -encode- on a messy string variable riddled with errors is a really terrible idea. So I think first you have to convert this back to string, then clean it up:

    Code:
    decode state, gen(str_state) // UNDO ENCODE
    label drop state // THIS IS A CORRUPTED LABEL AND NEEDS TO GO
    replace str_state = proper(trim(itrim(str_state))) // SIMPLE CLEANING
    Before going any further, you should then inspect the values of str_state to make sure they are all as expected. The code above will not correct spelling errors. That may have to be done with some idiosyncratic -replace- statements.

    That will leave you with a simple string version of the state names.

    As for why Stata converted staten ames into numbers, undoubtedly because somebody asked it to, probably with the -encode- command. You might wonder why somebody would want to do that. Actually, once the string variable is clean, it's often necessary to convert it to a value labeled numeric variable. For example, if this is panel data and you want to set state as the panel identifier, then you must make it numeric. Similarly for it to serve as a variable in a regression model, it must be numerically encoded. Also, even when none of these considerations apply, in a large data set with many observations and a relatively small number of values taken on by the variable, -encode-ing it saves a lot of memory.

    Comment

    Working...
    X