Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Manipulate complex long strings - Drop everything after word "hectares" and keep number of hectares

    Dear all,
    I have searched on the Forum for quite some time and tried different approaches to manipulating a quite messy and long string. I would appreciate any help in answering my question.

    I have a variable labelled "infringement" that contains a lot of text (see two examples below):

    infringement
    Destruir (danificar, desmatar) florestas ou demais formas de vegetações consideradas de preservação permanente (áreas do art. 2º da Lei 4.771/65)
    Ficam embargadas todas e quaisquer atividades em uma área 26,823 hectares, delimitada pelas coordenadas geográficas constantes no processo administrativo correspondente.

    My question is how can I extract only the number of hectares (as highlighted in red in the second example)?
    My thought was to drop everything after hectares (including the word hectares) and then keep the numerical values that indicate the number of hectares from the end of the remaining string until the next whitespace. Note that the length of the unit of hectares can vary and that the number might be interrupted by a comma or dot. I want the full number saved as a string as I intend to subsequently destring the variable separately (i.e., although the comma should separate decimals in this dataset, it is quite messy: I find that commas and dots are likely used interchangeably).

    I hope someone can help!

    Thanks a lot.
    Sandra

  • #2
    Welcome to Statalist.

    You have accidentally posted your topic in Statalist's Mata Forum, which is used for discussions of Stata's Mata language, which is different than Stata's command language. Your question will see a more appropriate and much larger audience if you post it in Statalist's General Forum.

    Also, if you have not already done so, take a look at the Statalist FAQ linked to at the top of this page for posting guidelines and suggestions.

    Comment


    • #3
      Oh! Thank you. I will repost this in the General Forum.

      Comment


      • #4
        I could not find the question in the general forum, so I post a solution here.
        Try regular expressions:
        Code:
        mata regexm(txt,"[0-9,]+ hectares")
        Kind regards

        nhb

        Comment


        • #5
          The question can be found in the General Forum at

          https://www.statalist.org/forums/for...er-of-hectares

          where I too recommended a solution using regular expressions. I'm pleased to see agreement that regular expressions are likely to be the tool of choice. :-)

          Comment

          Working...
          X