Trouble identifying lowercase-uppercase regular expression

Zach Groff

Join Date: Jun 2016

Posts: 5
#1

Trouble identifying lowercase-uppercase regular expression

06 Jun 2016, 09:03

I have a string variable, chosen_list that concatenates two strings with proper noun capitalization (uppercase followed by all lowercase), so where the two strings are concatenated it has a lowercase letter followed by an uppercase one.

In order to split them I'm trying to replace the lowercase-uppercase sequence with another character ("/" in this case), but regexm() and regexr() are not picking it up - when I type "gen chosenpta=regexr(chosen_list, "[a-z][A-Z]", "/")" it does nothing.

Am I doing something wrong in this expression, and if so, what?

Thanks,
Zach Groff
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35698

06 Jun 2016, 10:37

Seems sound to me

Code:

. di regexr("FrogToad", "[a-z][A-Z]", "/")
Fro/oad

. set obs 3
number of observations (_N) was 0, now 3

. gen test = "FrogToad" in 1
(2 missing values generated)

. replace test = "HungarianHorntail" in 2
variable test was str8 now str17
(1 real change made)

. replace test = "ChineseFireball" in 3
(1 real change made)

. gen test2 = regexr(test, "[a-z][A-Z]", "/")

. l

     +--------------------------------------+
     |              test              test2 |
     |--------------------------------------|
  1. |          FrogToad            Fro/oad |
  2. | HungarianHorntail   Hungaria/orntail |
  3. |   ChineseFireball     Chines/ireball |
     +--------------------------------------+

Hence we need to see the precise circumstances behind the claim of "does nothing".

Comment

Zach Groff

Join Date: Jun 2016

Posts: 5
#3

06 Jun 2016, 11:48

Okay, it is specific to this variable. I think what is going on has to do with something else: while this variable is nonmissing for all observations (as I can see in the spreadsheet or by using tab), when I open up the data browser and click on this variable for any observation the "Value" it shows (in the bar on top of the window) is blank.

I've seen this happen before, but I don't remember the cause. Do you know why this happens and how to fix it?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

06 Jun 2016, 11:57

Presumably, you want to preserve the lowercase/uppercase letters when you split the string. You can use moss (from SSC) to match capitalized words:

Code:

clear
input str20 chosen_list
"ZachGroff"
"NickCox"
"RobertPicard"
"Statalist"
end

moss chosen_list, match("([A-Z][^A-Z]*)") regex

and the results:

Code:

. list

     +-------------------------------------------------------------+
     |  chosen_list   _count     _match1   _pos1   _match2   _pos2 |
     |-------------------------------------------------------------|
  1. |    ZachGroff        2        Zach       1     Groff       5 |
  2. |      NickCox        2        Nick       1       Cox       5 |
  3. | RobertPicard        2      Robert       1    Picard       7 |
  4. |    Statalist        1   Statalist       1                 . |
     +-------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

06 Jun 2016, 11:58

The toy dataset I created satisfies "non-missing for all observations" and it looks fine in the Data Editor.

Code:

clear set obs 3 gen test = "FrogToad" in 1 replace test = "HungarianHorntail" in 2 replace test = "ChineseFireball" in 3 gen test2 = regexr(test, "[a-z][A-Z]", "/") edit

Again, you don't provide a reproducible example in terms of any code we can run or a dataset we can try out.

Sorry, but I don't recognise your error report as something that I've experienced or that makes sense otherwise.
Comment
Zach Groff

Join Date: Jun 2016

Posts: 5
#6

06 Jun 2016, 12:05

Here is a dataset with just the variable I'm using. When I open it, if I look in the browser, all observations appear nonmissing, but the Value bar at the top of the browser shows nothing as if it were missing: Statalist_example.dta
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

06 Jun 2016, 12:07

At a quick glance, the recipe in #1 and #2 doesn't work because you have spaces between words too!

You probably need some clean up first, e.g. "Site selection"/"Site Selection".

I think this is closer to what you want (see #5 for explanation of moss)

Code:

moss chosen_list, regex match("([A-z][a-z &]*)")

Last edited by Nick Cox; 06 Jun 2016, 12:28.
Comment
Zach Groff

Join Date: Jun 2016

Posts: 5
#8

06 Jun 2016, 12:31

Do regular expressions not work with spaces? I've used them before and thought they worked. I'm in the process of cleaning, but I don't see why that clean up ("Site Selection"->"Site selection") needs to happen first since the space is not a lowercase letter, no?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

06 Jun 2016, 12:31

Interesting. You managed to get newline characters into a Stata variable. When I try to view your dataset in the Browser, all values appear missing. If I click in the edit field at the top, I can scroll down and see the text. A simple solution is to bulk remove these and then use moss, as described in #4:

Code:

use "Statalist_example.dta", clear

gen s = subinstr(chosen_list, char(10)," ",.)
moss s, match("([A-Z][^A-Z]*)") regex

list s _match* in 1/2, string(30)

and the results:

Code:

. list s _match* in 1/2, string(30)

     +-------------------------------------------------------------------------------------------------------------------------+
     | s                                  _match1                                          _match2           _match3   _match4 |
     |-------------------------------------------------------------------------------------------------------------------------|
  1. |  Site Selection Weed management    Site                                          Selection    Weed management           |
  2. |  Mulching & organic fertilizer..   Mulching & organic fertilizer ..   Using certified seeds                             |
     +-------------------------------------------------------------------------------------------------------------------------+

Comment

Zach Groff

Join Date: Jun 2016

Posts: 5
#10

06 Jun 2016, 13:35

Thank you, Robert - that worked!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

06 Jun 2016, 14:17

As an answer to #8:

1. Regular expressions work with spaces as you wish if you specify spaces, somehow or other, when they are part of what you are looking for. That was what you weren't doing in #1. The example in #2 was intended partly as a hint that you probably didn't mean to do what your regular expression would do.

2. The solutions of #7 and #9 both hinge on looking for an upper case letter to start each string. It follows that "Site Selection" will be parsed as "Site" "Selection" and "Site selection" as itself. That's explicit in the very first observation of #9 output. I don't think that is what you want, which is why I suggested prior clean-up.
Comment

Announcement