Manipulate complex long strings - Drop everything after word "hectares" and keep number of hectares

Sandra Schafhautle

Join Date: Feb 2022

Posts: 6
#1

Manipulate complex long strings - Drop everything after word "hectares" and keep number of hectares

11 Feb 2022, 12:53

Dear all,
I have searched on the Forum for quite some time and tried different approaches to manipulating a quite messy and long string. I would appreciate any help in answering my question.

I have a variable labelled "infringement" that contains a lot of text (see two examples below):

infringement
Destruir (danificar, desmatar) florestas ou demais formas de vegetações consideradas de preservação permanente (áreas do art. 2º da Lei 4.771/65)
Ficam embargadas todas e quaisquer atividades em uma área 26,823 hectares, delimitada pelas coordenadas geográficas constantes no processo administrativo correspondente.

My question is how can I extract only the number of hectares (as highlighted in red in the second example) using Stata 17?
My thought was to drop everything after hectares (including the word hectares) and then keep the numerical values that indicate the number of hectares from the end of the remaining string until the next whitespace. Note that the length of the unit of hectares can vary and that the number might be interrupted by a comma or dot. I want the full number saved as a string as I intend to subsequently destring the variable separately (i.e., although the comma should separate decimals in this dataset, it is quite messy: I find that commas and dots are likely used interchangeably).

I hope someone can help!

Thanks a lot.
Sandra
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

11 Feb 2022, 13:26

Here is one approach that uses Stata's regular expression functions to locate the first string of digits, dots, and commas followed by a space and the characters "hectare" and return the digits, dots, and commas as the result.

Code:

input str80 inf
"atividades em uma área 26,823 hectares, delimitada pelas"
"nothing but 1,234.567 hectares"
"1 empty hectare"
"3 hectares and 4 hectares"
end
generate area_s = ustrregexs(1) if ustrregexm(inf, "([\d\.,]*) hectare")
list

Code:

     +----------------------------------------------------------------------+
     |                                                      inf      area_s |
     |----------------------------------------------------------------------|
  1. | atividades em uma área 26,823 hectares, delimitada pelas      26,823 |
  2. |                           nothing but 1,234.567 hectares   1,234.567 |
  3. |                                          1 empty hectare             |
  4. |                                3 hectares and 4 hectares           3 |
     +----------------------------------------------------------------------+

Comment

Sandra Schafhautle

Join Date: Feb 2022

Posts: 6
#3

11 Feb 2022, 17:21

Thank you!
Comment
Sandra Schafhautle

Join Date: Feb 2022

Posts: 6
#4

16 Feb 2022, 15:05

I just quickly want to come back to this. Again, thanks to William, the answer was fantastic.

For anyone reading this and having similar questions, I found the following link extremely helpful (when you are new to regular expressions etc...): https://medium.com/the-stata-guide/r...a-6e5c200ef27c
2 likes
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#5

16 Feb 2022, 16:22

Originally posted by Sandra Schafhautle View Post

I just quickly want to come back to this. Again, thanks to William, the answer was fantastic.

For anyone reading this and having similar questions, I found the following link extremely helpful (when you are new to regular expressions etc...): https://medium.com/the-stata-guide/r...a-6e5c200ef27c

Yes, I love that cheat sheet he provided. Very detailed tutorial indeed.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

16 Feb 2022, 17:43

Sandra Schafhautle -

Thank you for providing that link to the work from Asjad Naqvi, a member here. I am glad to see a tutorial built around the more powerful unicode regular expression engine implemented in Stata 14. My thanks to the author. A small but growing group of Statalist members are evangelizing the power of regular expressions, especially in situations like this one. But of course, like every similarly powerful tool, it's sometimes possible to produce a "write-only" regular expression - meaning that when you return to the program a month later and try to read and remember what it does, that is no longer possible. :-)

I should have including in post #2 my usual regular expression technical background information.

To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at https://unicode-org.github.io/icu/us...gs/regexp.html. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

I will be adding the link from post #4 to future posts of this information.
1 like
Comment
Asjad Naqvi

Join Date: Oct 2014

Posts: 93
#7

17 Feb 2022, 02:18

Since I was tagged, I would just like to briefly say that one typically resorts to regular expression in really messy circumstances. And if one is in such circumstances often, then it really pays off to learn about the logic of constructing regular expressions. Otherwise it is hard to find expressions that match your particular and very specific application. Also as William Lisowski suggested, leave some notes on your regex searches before you forget what they do!

Sandra Schafhautle you example is also a good case of incorporating a "look back" search. That is, you look for the word hectare and then find the number that exists before it. I don't discuss look back and look forward searches in my guide (still pending!) but the brilliant Hua Peng has a short blog post on it:

https://huapeng01016.github.io/blogs/2021-09-12-dyntext
2 likes
Comment
Sandra Schafhautle

Join Date: Feb 2022

Posts: 6
#8

17 Feb 2022, 07:10

Very cool. Thanks everybody for the additional input!
Comment

Announcement

Manipulate complex long strings - Drop everything after word "hectares" and keep number of hectares

Comment

Comment

Comment

Comment

Comment

Comment

Comment