Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning string variable

    Dear all,

    I have a variable called place_birth in my dataset. Some of the locations weren't recorded properly.

    place_birth

    Feucherolles (Saint-James = Le château royal de Sainte-Gemme)
    ST B(?), canton de Chaillot
    (?Chanvrand)Canton de La Guiche
    Seine-Inférieure (Seine-Maritime)
    Épinay-sur-Seine ,
    Autine (?) Outines
    Darrois ? Darvois

    I would like to do two things.
    First, separate what is inside parenthesis () and comma , and = from the text. With what I separate I can create an new variable called place_new
    Second, clean both variable from weird signs like ?, =, . at the end, /, etc...

    For example
    Épinay-sur-Seine ,

    should look like
    Épinay-sur-Seine

    replace ? and (?) with a comma
    Autine (?) Outines
    it becomes
    Autine , Outines

    For this one:
    Feucherolles (Saint-James = Le château royal de Sainte-Gemme)

    Eliminate "Saint-James =" and just leave:
    Feucherolles (Le château royal de Sainte-Gemme)

    Then I can separate the strings by comma and parenthesis so that for example:

    place_birth
    (?Chanvrand)Canton de La Guiche

    becomes:
    place_new
    Chanvrand

    Or:
    place_birth
    Seine-Inférieure (Seine-Maritime)

    Becomes in the new var:
    place_new
    Seine-Maritime

  • #2
    marco lecci you essentially need a bunch of manipulations using functions like subinstr() and commands like split. You can look up the help for these, try them out, and come back to the forum if you're stuck.

    Comment


    • #3
      Thanks Hemanshu Kumar
      I did try
      gen dep_new = ustrregexs(0) if ustrregexm(départementdenaissance,"((?<=\().*?(?=\ )))|(^(^\(\))*$)")
      This allows me to split the variable into (). However, the cleaning for ?, commas etc... I don't get the results I want.

      Comment

      Working...
      X