Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improving and Polishing STATA Codes

    Hello all,

    After some time using STATA in different projects and always finding my own ways to get exactly what I want to achieve with my knowledge, I decided to ask whether there actually are better and easier ways to do it.

    Example 1:


    Code:
    clear
    input str8 childid byte(grder309 grder308 grder307 grder306 grder305 grder304 grder303 grder302 grder301 grder300 grder399 grder398 grder397 grder396 grder395 grder394) float preschool
    "IN011048"  .  . . . . . . . . .  .  .  .  .  .  . 1
    "IN011048"  .  . . . . . . . . .  .  .  .  .  .  . 1
    "IN011048" 11 10 9 8 7 6 5 4 3 2  1  0  0 88 88 88 1
    "IN011048"  .  . . . . . . . . .  .  .  .  .  .  . 1
    "IN011049"  .  . . . . . . . . .  .  .  .  .  .  . 0
    "IN011049"  .  . . . . . . . . .  .  .  .  .  .  . 0
    "IN011049" 10  9 8 7 6 5 4 3 2 1 88 88 88 88 88 88 0
    "IN011049"  .  . . . . . . . . .  .  .  .  .  .  . 0
    "IN011050"  .  . . . . . . . . .  .  .  .  .  .  . 0
    "IN011050"  .  . . . . . . . . .  .  .  .  .  .  . 0
    "IN011050" 10  9 8 7 6 5 4 3 2 1 88 88 88 88 88 88 0
    "IN011050"  .  . . . . . . . . .  .  .  .  .  .  . 0
    end

    Variable "preschool" has already been created the only way I came up with, which is:

    Code:
    gen preschool=0
    replace preschool=1 if  grder309==0 | grder308==0 | grder307==0 | grder306==0 | grder305==0 | grder304==0 | grder303==0 | grder302==0 | grder301==0 | grder300==0 | grder399==0 | grder398==0 | grder397==0 | grder396==0 | grder395==0 | grder394==0
    When the variables grder3* take value 0, it means that child attended preschool. So preschool should be 1 whenever a child has attended preschool at any year between 1994-2009 (grder394-grder309).

    The idea of creating var lists or macros with grder3* is not what I am looking for since I will only use that group of variables once.

    I would like to know if anyone knows an easier way to create the preschool variable.

    Thanks in advance.

    Francisco
    Last edited by Francisco Carballo; 03 May 2017, 08:11.

  • #2
    On face of it your indicator variable is 1 if any of a bunch of variables is 0 and 0 otherwise. This would be shorter code to write (although not to run):

    Code:
    egen wanted = rowmin(grder*) 
    replace wanted = !wanted
    What's the point of the extra observations with just missings in most of the variables?

    Comment


    • #3
      Thanks for the reply Nick.

      I find difficult to come up with ideas about how to create new variables by conditioning a bunch of other variables. Your code works and meets its function, but just because in this case preschool takes value 0. What if we want to create a variable that takes value 1 if any of the grder* variables is 7?

      I want to get rid of the tedious work of writing 14 "or" conditions plus variables names.

      PS: I wanted to show a couple of lines with different grades attended (preschool and no preschool), and my panel data have that information only on wave 3, so I needed at least 8 lines. I forgot to delete extra rows with redundant information when I created the post.

      Comment


      • #4

        Code:
        * canned code 
        egen any7 = anymatch(grder*), value(7) 
        
        * first principles 
        gen ANY7 = 0 
        
        qui foreach v of var grder* { 
            replace ANY7 = 1 if `v' == 7 
        }

        Comment


        • #5
          In short, as pointed out in #2 and #4, the command - egen - stays among the best alternatives in terms of data management or "polishing", if you will.
          Best regards,

          Marcos

          Comment


          • #6
            Just to offer an alternative argument, good code does not necessarily equal short or smart code. Especially if you share the code with others, it can sometimes help to add "useless" code or comments for the sole reason of making the reasoning more explicit.

            Code:
            ** Determine whether kid attended preschool
            ** Only true if grades are always zero (I don't know what grder means)
            egen gradeAlwaysZero = rowmin(grder*) // Find smallest value of grder variables
            replace gradeAlwaysZero = (gradeAlwaysZero == 0) // 1 if the grade is zero at all times
            gen preschool = (gradeAlwaysZero == 1) // kid attended preschool if the grade is always zero
            drop gradeAlwaysZero
            In the code above, the variable "gradeAlwaysZero" is redundant from a coding perspective, but it helps the reader understand what's going on. Note also that the variable name is fully self-explanatory. For some reason programmers always abbreviate their variable names (e.g. grEqZero or whatever), often obscuring the meaning of their variables. Is the code above elegant? No. But can anyone with rudimentary understanding of Stata and the dataset understand its function? Yes! (I'd hope so anyway)

            Comment


            • #7
              The context is how to write code more concisely -- and I agree strongly: the trade-off with clarity can be tricky.

              I once spent two hours repeatedly reading the entire documentation for the in-built text editor of a programming language which was -- I kid you not -- one sentence long! It was a brilliantly concise statement but in my case only when you had worked out how to use the editor could you understand the documentation. (Cue for someone to tell me this applies to my own writings.)

              I am not feeling criticised but I do assert that rowmin() is far from cryptic as an explanation of what is happening. Similarly, Stata programmers should learn fast that ! flips true and false if they want to be(come) competent.

              It is not difficult to explain why programmers often (not always!) abbreviate variable names:

              1. Less typing

              2. (For programmers dating back to Jurassic or earlier) Still not used to the idea that variable names don't have to be short, because we started with Fortran, or 8-character limits, or whatever.

              But clearly, if anyone votes for obscure code over clear code, and nothing else said, they are either joking or dubious.

              I find comments on the same line to impart a feeling of crowdedness. One outstanding Stata programmer still writes code as if he were writing on the back of a small envelope, which makes his (*) programs hard to read.

              (*) Unfortunately very little information there.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                I find comments on the same line to impart a feeling of crowdedness. One outstanding Stata programmer still writes code as if he were writing on the back of a small envelope, which makes his (*) programs hard to read.
                I actually agree and have started moving comments to the top more and more. That said, the crowded feeling reduces significantly if one adds a couple of tabs before the comment itself (and aligns them vertically).

                Comment

                Working...
                X