Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trouble identifying lowercase-uppercase regular expression

    I have a string variable, chosen_list that concatenates two strings with proper noun capitalization (uppercase followed by all lowercase), so where the two strings are concatenated it has a lowercase letter followed by an uppercase one.

    In order to split them I'm trying to replace the lowercase-uppercase sequence with another character ("/" in this case), but regexm() and regexr() are not picking it up - when I type "gen chosenpta=regexr(chosen_list, "[a-z][A-Z]", "/")" it does nothing.

    Am I doing something wrong in this expression, and if so, what?

    Thanks,
    Zach Groff

  • #2
    Seems sound to me

    Code:
    . di regexr("FrogToad", "[a-z][A-Z]", "/")
    Fro/oad
    
    . set obs 3
    number of observations (_N) was 0, now 3
    
    . gen test = "FrogToad" in 1
    (2 missing values generated)
    
    . replace test = "HungarianHorntail" in 2
    variable test was str8 now str17
    (1 real change made)
    
    . replace test = "ChineseFireball" in 3
    (1 real change made)
    
    . gen test2 = regexr(test, "[a-z][A-Z]", "/")
    
    . l
    
         +--------------------------------------+
         |              test              test2 |
         |--------------------------------------|
      1. |          FrogToad            Fro/oad |
      2. | HungarianHorntail   Hungaria/orntail |
      3. |   ChineseFireball     Chines/ireball |
         +--------------------------------------+
    Hence we need to see the precise circumstances behind the claim of "does nothing".

    Comment


    • #3
      Okay, it is specific to this variable. I think what is going on has to do with something else: while this variable is nonmissing for all observations (as I can see in the spreadsheet or by using tab), when I open up the data browser and click on this variable for any observation the "Value" it shows (in the bar on top of the window) is blank.

      I've seen this happen before, but I don't remember the cause. Do you know why this happens and how to fix it?

      Comment


      • #4
        Presumably, you want to preserve the lowercase/uppercase letters when you split the string. You can use moss (from SSC) to match capitalized words:

        Code:
        clear
        input str20 chosen_list
        "ZachGroff"
        "NickCox"
        "RobertPicard"
        "Statalist"
        end
        
        moss chosen_list, match("([A-Z][^A-Z]*)") regex
        and the results:
        Code:
        . list
        
             +-------------------------------------------------------------+
             |  chosen_list   _count     _match1   _pos1   _match2   _pos2 |
             |-------------------------------------------------------------|
          1. |    ZachGroff        2        Zach       1     Groff       5 |
          2. |      NickCox        2        Nick       1       Cox       5 |
          3. | RobertPicard        2      Robert       1    Picard       7 |
          4. |    Statalist        1   Statalist       1                 . |
             +-------------------------------------------------------------+

        Comment


        • #5
          The toy dataset I created satisfies "non-missing for all observations" and it looks fine in the Data Editor.

          Code:
           
          clear 
          set obs 3
          gen test = "FrogToad" in 1
          replace test = "HungarianHorntail" in 2
          replace test = "ChineseFireball" in 3
          gen test2 = regexr(test, "[a-z][A-Z]", "/")
          edit
          Again, you don't provide a reproducible example in terms of any code we can run or a dataset we can try out.

          Sorry, but I don't recognise your error report as something that I've experienced or that makes sense otherwise.

          Comment


          • #6
            Here is a dataset with just the variable I'm using. When I open it, if I look in the browser, all observations appear nonmissing, but the Value bar at the top of the browser shows nothing as if it were missing: Statalist_example.dta

            Comment


            • #7
              At a quick glance, the recipe in #1 and #2 doesn't work because you have spaces between words too!

              You probably need some clean up first, e.g. "Site selection"/"Site Selection".

              I think this is closer to what you want (see #5 for explanation of moss)

              Code:
              moss chosen_list, regex match("([A-z][a-z  &]*)")
              Last edited by Nick Cox; 06 Jun 2016, 12:28.

              Comment


              • #8
                Do regular expressions not work with spaces? I've used them before and thought they worked. I'm in the process of cleaning, but I don't see why that clean up ("Site Selection"->"Site selection") needs to happen first since the space is not a lowercase letter, no?

                Comment


                • #9
                  Interesting. You managed to get newline characters into a Stata variable. When I try to view your dataset in the Browser, all values appear missing. If I click in the edit field at the top, I can scroll down and see the text. A simple solution is to bulk remove these and then use moss, as described in #4:

                  Code:
                  use "Statalist_example.dta", clear
                  
                  gen s = subinstr(chosen_list, char(10)," ",.)
                  moss s, match("([A-Z][^A-Z]*)") regex
                  
                  list s _match* in 1/2, string(30)
                  and the results:
                  Code:
                  . list s _match* in 1/2, string(30)
                  
                       +-------------------------------------------------------------------------------------------------------------------------+
                       | s                                  _match1                                          _match2           _match3   _match4 |
                       |-------------------------------------------------------------------------------------------------------------------------|
                    1. |  Site Selection Weed management    Site                                          Selection    Weed management           |
                    2. |  Mulching & organic fertilizer..   Mulching & organic fertilizer ..   Using certified seeds                             |
                       +-------------------------------------------------------------------------------------------------------------------------+

                  Comment


                  • #10
                    Thank you, Robert - that worked!

                    Comment


                    • #11
                      As an answer to #8:

                      1. Regular expressions work with spaces as you wish if you specify spaces, somehow or other, when they are part of what you are looking for. That was what you weren't doing in #1. The example in #2 was intended partly as a hint that you probably didn't mean to do what your regular expression would do.

                      2. The solutions of #7 and #9 both hinge on looking for an upper case letter to start each string. It follows that "Site Selection" will be parsed as "Site" "Selection" and "Site selection" as itself. That's explicit in the very first observation of #9 output. I don't think that is what you want, which is why I suggested prior clean-up.

                      Comment

                      Working...
                      X