Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • General While Loop Question

    Hello,

    I've been having some problems with loop syntax in Stata. I'm more familiar with R and python syntax.

    Im using the code below to separate a string variable "B" after the word "area" and appending it to variable "C" because of some scraping issues with a few observations. I keep getting a syntax error no matter what I try doing. Please advise.

    gen i = 1
    while i <= _N & trim(B[i]) != "" {
    tokenize B[i], parse("area")
    gen newC`i' = word(B[i], 2, .)
    replace C = C + " " + trim(newC`i') if newC`i' != ""
    gen i = `i' + 1
    }

  • #2
    Hi Jay, welcome to the forum.

    In general with Stata (and much like R) you probably don't want a loop for most tasks like this. You essentially never need to loop through observations in particular - though it sometimes makes sense to loop through objects like variables. Instead you want to think in vectors. So for example:

    Code:
    gen i = 1
    actually creates an entire new vector as a column in your data frame, not a single scalar or a local macro. Here, you might want to start by splitting the entire column by area:

    Code:
    split B, parse("area") gen(B)
    Then generate the new variable with:

    Code:
    gen wanted = B2 + C
    Of course, I haven't tested this code on my end because you don't provide a data example. Please do provide a data example using the -dataex- command for questions like this in the future. It makes things easier for you and whoever is trying to answer your question.

    Comment


    • #3
      I think the syntax error you are getting comes from -gen newC`i' = word(B[i], 2, .)-. The -word()- function takes only 2 arguments, not three. You are probably confusing it with the syntax of the -substr()- function which does take 3.

      Even if you fix that, however, this code will not do what you describe in your text, because the -tokenize- command will not parse on the substring "area". It will parse on the individual characters 'a', 'r', and 'e'--which is not what you want. -tokenize- is simply not the tool for this job.

      I'll also add that one of the hardest thing for people experienced in general programming languages like Python to get used to in Stata is that loops are much less often needed. In particular, looping over observations is very rarely needed. Yes, there are some things that do call for it, but for most of those situations there are Stata programs that "hide" the looping and often get it done more efficiently than an explicit loop would.

      Anyway, what you want to do, if I understand it correctly, is done by:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str10 b str11 c
      "abcdef"     "xyz"        
      "abcareadef" "123"        
      "aareadef"   "x1ys3z"    
      "abcarea"    "abracadabra"
      "areabcd"    "foobar"    
      end
      
      gen place = strpos(b, "area")
      replace c = c + trim(substr(b, place+4, .))
      
      list, noobs clean
      It was easy for me to make up a toy data set to illustrate the code. But, in general, when asking for help with coding problems, it is wise to show example data of your own, as the solution often depends on details of the data, or, perhaps more often, the metadata, details that rarely fare well in verbal descriptions. The most helpful way to show example data is to use the -dataex- command, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Added: Crossed with #2 who offers another good approach.

      Last edited by Clyde Schechter; 30 Dec 2023, 13:30.

      Comment


      • #4
        Also, I notice you used a -while- loop here. I'm guessing you tried for a -for- loop first since you know the expected number of iterations, but you discovered that -for- is depreciated. Like I said in #2, -for- loops are not often needed in day to day Stata use, and especially as a beginner with the language, if you think you need one, chances are you probably actually don't. All that said, you were probably actually looking for either the -foreach- or in this case the -forvalues- loop.

        Hope that helps!

        Comment


        • #5
          I'll also add that one of the hardest thing for people experienced in general programming languages like Python to get used to in Stata is that loops are much less often needed.
          That was certainly my experience. I have a CS background, a lot of my early training was in object oriented and imperative languages, and I was (and still am) very comfortable thinking through problems iteratively with loops. In undergrad, if you had asked me how to take a mean in Stata, I might have said something like:

          Code:
          local sum = 0
          forv i = 1/`=_N' {
              local sum = `sum' + variable[`i']
          }
          display `sum' / _N
          Part of that is pedagogical. I was aware even then that functions like mean() existed, but I was not generally allowed to use functions like that (especially in my first few classes) because I needed to understand and practice iteration. Fair enough, but I only really started to get over some of this particular bias learning R and got better about it reading forum posts here and learning Stata. In many of my early posts I still had a strong C accent and even now I sometimes need to remind myself that, yes, I know immediately how to do this or that with a loop, but there is probably a cleaner vectorized approach that I should try to think through first.

          By the way, as generalist languages become more functional, loops are starting to disappear in favor of map() functions. Even python shouldn't require a loop for this particular task in pandas.

          Edit: sum should probably be a local, not scalar. forgot to wrap i in compound quotes.
          Last edited by Daniel Schaefer; 30 Dec 2023, 13:54.

          Comment


          • #6
            Thank you so much for the suggestions and advice! I tried them out but I'm still not having any luck. This data was retrieved using a python PDF reader but one of the pages' formatting is causing two of the variables to merge, technically the "Plan Area" and "County" variables are separate. The majority of the data for "Plan Area" ends with the word "area". I'm having the same problem with the variables "FY 2019-20" and "FY 2020-21". How do I split the variables without affecting the rest of the data?


            Code:
            dataex B C D if C==""
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str54 B str10 C str29 D
            "Plan Area County"                              "" "FY 2019-20 FY 2020-21"  
            "Waterfront UR Plan Area Hood River"            "" "65,052,866 N/A"         
            "Windmaster UR Plan Area Hood River"            "" "24,942,415 27,601,769"  
            "Medford City Center UR Plan Area Jackson"      "" "283,334,426 18,635,299" 
            "Talent UR Plan Area Jackson"                   "" "60,914,324 N/A"         
            "Jacksonville UR Plan Area Jackson"             "" "49,632,898 55,327,975"  
            "Phoenix UR Plan Area Jackson"                  "" "37,147,660 40,576,920"  
            "City Of Culver UR Plan Area Jefferson"         "" "3,812,155 5,371,135"    
            "Madras City UR Plan Area Jefferson"            "" "33,020,744 36,602,994"  
            "Madras Housing UR Plan Area Jefferson"         "" "N/A 540,420"            
            "Grants Pass Urban Renewal Plan Area Josephine" "" "89,219,863 106,606,434" 
            "Klamath Town Center UR Plan Area Klamath"      "" "12,436,310 12,726,900"  
            "Lakefront UR Plan Area Klamath"                "" "4,366,950 7,524,420"    
            "Spring Street UR Plan Area Klamath"            "" "1,223,024 2,581,420"    
            "Eugene Downtown UR Plan Area Lane"             "" "184,216,890 192,453,654"
            "Riverfront UR Plan Area Lane"                  "" "161,823,723 183,981,258"
            "Veneta Downtown UR Plan Area Lane"             "" "53,144,089 55,326,861"  
            "Coburg Industrial Park UR Plan Area Lane"      "" "28,520,812 28,792,287"  
            "Glenwood UR Plan Area Lane"                    "" "76,520,861 81,804,334"  
            "Springfield Downtown UR Plan Area Lane"        "" "60,026,538 64,598,973"  
            "Florence UR Plan Area Lane"                    "" "48,395,956 50,620,781"  
            "Creswell UR Plan Area Lane"                    "" "4,521 1,760,459"        
            "Waldport 2 UR Plan Area Lincoln"               "" "5,867,950 6,115,660"    
            "Lincoln City Yr2000 UR Plan Area Lincoln"      "" "56,204,456 57,958,188"  
            "Newport South Beach UR Plan Area Lincoln"      "" "169,296,249 162,646,589"
            "Mclean Point Plan Area Lincoln"                "" "2,704,270 2,721,760"    
            "Northside Plan Area Lincoln"                   "" "47,493,532 76,970,532"  
            "Yachats UR Plan Area Lincoln"                  "" "40,670,005 44,984,835"  
            "Depoe Bay Plan Area Lincoln"                   "" "27,222,940 28,566,490"  
            "NW Lebanon 2 UR Plan Area Linn"                "" "104,999,999 59,999,999" 
            "Cheadle Lake Lebanon 3 UR Plan Area Linn"      "" "25,631,124 27,451,752"  
            "North Gateway UR Plan Area Linn"               "" "55,520,925 59,870,211"  
            "Lebanon Downtown UR Plan Area Linn"            "" "80,225 797,389"         
            "Harrisburg UR Plan Area Linn"                  "" "28,320,161 32,563,113"  
            "Central Albany UR Plan Area Linn"              "" "246,939,463 276,449,354"
            "Mcgilchrist UR Plan Area Marion"               "" "63,226,321 68,576,956"  
            "Riverfront/Downtown UR Plan Area Marion"       "" "263,051,195 271,697,842"
            "Mill Creek UR Plan Area Marion"                "" "127,330,481 90,391,467" 
            "South Waterfront UR Plan Area Marion"          "" "29,170,507 30,813,298"  
            "North Gateway UR Plan Area Marion"             "" "256,495,856 271,436,276"
            "West Salem UR Plan Area Polk"                  "" "95,274,493 107,259,923" 
            "Woodburn UR Plan Area Marion"                  "" "50,226,653 49,394,832"  
            "Silverton UR Plan Area Marion"                 "" "49,697,293 60,750,515"  
            end

            Comment


            • #7
              The majority of the data for "Plan Area" ends with the word "area".
              You say the majority here. If there are special cases, they will need to be accounted for separately. We need to follow some kind of pattern to parse the strings. Does something like this work?

              Code:
              drop if _n == 1
              split B, parse("Plan Area")
              split D
              rename B1 Plan_Area
              rename B2 County
              rename D1 FY_2019_20
              rename D2 FY_2020_21
              Edit: If not, please clearly tell us why not.
              Last edited by Daniel Schaefer; 30 Dec 2023, 14:24.

              Comment


              • #8
                That worked perfectly thanks! Guess I made that more complicated than it was.

                Comment

                Working...
                X