Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Forloops and variable generation

    I have a series of variables fam1 - fam14 that represent sex and age of all family members. They were entered as H45 or M35 (where H=male and M=woman, and their age after the letter). I need to separate the age from sex in order to run any analyses


    I've figured out that I can go line by line, and make male1 - male14 to represent the sex of each family member and age1 - age14 to represent their age with the following commands:

    gen male1 = regexs(2) if regexm(fam1, "(([A-Z]+)*([0-90 - 90-9]+))")
    gen age1 = regexs(3) if regexm(fam1, "(([A-Z]+)*([0-90-90 - 9]+))")

    However, I know this is rather labor intensive and feel like I should be able to do this with a loop, but can' seem to figure it out.

    Any guidance would be greatly appreciated.

    Thanks,
    Taylor

  • #2
    Hi Taylor, welcome to Statalist!

    A forvalues loop is what you want and I would use Stata's substr() function rather than regular expressions.

    Doing it with regex
    Code:
    forvalues i=1/14  {
    gen male`i' = regexs(2) if regexm(fam`i', "(([A-Z]+)*([0-90 - 90-9]+))")
    gen age`i' = regexs(3) if regexm(fam`i', "(([A-Z]+)*([0-90-90 - 9]+))")
    }
    Doing it with substr() and destring
    Code:
    forvalues i=1/14  {
    gen male`i' = (substr(fam`i', 1, 1)=="H") destring male`i', gen(age`i') ignore("HM")
    }

    Comment


    • #3
      Welcome to Statalist, Taylor.

      You can use a forvalues loop to streamline the code you have from 14 pairs of commands to a single pair.
      Code:
      forvalues i=1/14 {
        gen male`i' = regexs(2) if regexm(fam`i', "(([A-Z]+)*([0-90 - 90-9]+))")
        gen age`i' = regexs(3) if regexm(fam`i', "(([A-Z]+)*([0-90-90 - 9]+))")
      }
      Added in edit: this post crossed with #2, in which David's elegant second approach is the one I would recommend.
      Last edited by William Lisowski; 26 Oct 2018, 17:56.

      Comment


      • #4
        Hi David,

        It seems like your code works to turn male1 into a numeric where 1 = males and 0 =females but it seems like there's a step missing as none of the age variables are generated and all subsequent male`i' variables are all 0 after male1 - any thoughts, seems like maybe you can't do both of these within the same loop?

        In running the loop I kept getting this feedback for each male`i':

        male1 already numeric; no generate
        male2 already numeric; no generate

        etc.

        Thank you for your guidance!

        Comment


        • #5
          Sorry, I should have written it as:

          Code:
           forvalues i=1/14  {
          gen male`i' = (substr(fam`i', 1, 1)=="H") destring fam`i', gen(age`i') ignore("HM")
          }
          Also, note that you'll need to drop the male1-male14 variables before re-running this code (or you'll get an error like "variable male1 already defined.") Or, just run the loop over the destring command.
          Last edited by David Benson; 26 Oct 2018, 18:57.

          Comment


          • #6
            Thank you for your help! I was able to successfully get the ages extracted, however I'm still not having much success with the male/female portion of the loop.

            Using the code you provided in #5 I'm able to generate the proper amount of males=1 and females=0, for male1, however, for all following male2 - male14 they all equal 0 which isn't correct. Also, it seems to be generating the variable without reference to how many observations are in the corresponding fam`i'

            Apologies for the additional queries - thank you for your help.

            ~Taylor

            Comment


            • #7
              Hi Taylor,

              Not sure why it works for male1 but not male2-male14, but a couple of thoughts (or things you might try):
              1. I wondered if your data was initially in a foreign language (I wondered that because it was coded H=male and M=woman), and so maybe it wasn't really an H but a foreign letter / character with an umlaut or tilde or something.
              2. substr() is case sensitive and so "H" is treated differently than "h".
              3. Sometimes the letter "H" or "M" isn't first (ie it was coded as "45H" rather than "H45", or more likely, there is an extra space at the beginning of observation " H45" rather than "H45"
              Of all of these, I suspect it is #3, because you said creating the age variables worked fine and the problems #1 and #2 mean they wouldn't.

              Code:
              forvalues i=1/14 {
              replace fam`i' = trim(fam`i')
              replace fam`i' = itrim(fam`i') replace fam`i' = strupper(fam`i')
              }
              You could also check whether there are spaces (or funny characters) in your values by doing the following (note that charlist (SSC) is a user-created command that can be installed using ssc install charlist.

              Code:
              gen sex2 = substr(fam2, 1, 1)  // This should always end up being "H" or "M".
              ​​​​​​​* If it is anything else or blank, then you know whether the problem is #1-#3 above.
              * I'm just creating one here for simplicity.  Obviously, you would want to do this for all of them, or run them through a loop. 
              
              ssc install charlist  // just to install the package if you haven't already
              
              forvalues i=2/14 {
              charlist fam`i'
              }
              Regarding
              Also, it seems to be generating the variable without reference to how many observations are in the corresponding fam`i'
              You should update the code so that instances where fam`i'=="" don't get set to ==0 or ==1.

              Code:
              forvalues i = 1/14 {
              gen male`i' = 0 if fam`i' !=""
              replace male`i' = strpos(fam`i', "H") > 0
              }
              Finally, if you can't get it to work, I often use the LEFT() and RIGHT() functions in Excel for stuff like this.

              Comment


              • #8
                If none of those solve the problem, you'll probably need to post a sample of your data using -dataex-

                Comment


                • #9
                  Sorry, just realized I left something out in that last piece of code. It should be:
                  Code:
                  forvalues i = 1/14 {
                  gen male`i' = 0 if fam`i' !=""
                  replace male`i' = 1 if strpos(fam`i', "H") > 0  
                   }

                  Comment


                  • #10
                    Thank you so much for your help, it now works! It must have been that the data were entered in with extra spaces at the beginning or end of the observation - I greatly appreciate all of the support.
                    ~Taylor

                    Comment

                    Working...
                    X