Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • changing numbers that surround decimals

    I have 18,000 observations in a variable, which I have cleaned down to 60 left to address.
    They are supposed to be pH values, normal format: 7.40

    The trouble is, many say:
    4.42, when it should be 7.42
    7036, when it should be 7.036
    696, when it should be 6.96
    .701, when it should be 7.01

    For the first example, I need to replace any non 7 integer ahead of the decimal with a 7. Example with 4.
    I attempted
    Code:
    replace var=7.* if var==4.*
    This errored "cannot find if." I am assuming you cannot use this expression on observations, but only variables? It seems * doesn't work in observations. So is there a way to do similar work on observations? I cannot think of anything.

    For the last 3 examples, I simply need to move decimal places, which I could do by multiples or divisions of 10,100, etc. However, must I manually adjust each one, or can I somehow indicate that each observation "formatted" as **** or *** or .*** should be replaced by *.*** ? This would need to actually involve changing the numbers' true values, not just changing format in variable properties.

    Thank you.
    Last edited by Leonard Scott; 15 Jan 2021, 00:47.

  • #2
    This can probably be done using numerical functions but I think it's easier to look for these patterns by treating it as a string variable and using regular expressions.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float myvar
    4.42
    7036
    696
    .701
    end
    
    // Case 1
    replace myvar = real("7" + regexs(1)) if regexm(string(myvar), "^[0-9](\.[0-9]+)$")
    
    // Case 2 and 3
    replace myvar = real(regexs(1) + "." + regexs(2)) if regexm(string(myvar), "^([0-9])([0-9][0-9][0-9]?)$")
    
    // Case 4
    replace myvar = real(regexs(1) + "." + regexs(2)) if regexm(string(myvar), "^\.([0-9])([0-9]*)$")
    list
    
         +-------+
         | myvar |
         |-------|
      1. |  7.42 |
      2. | 7.036 |
      3. |  6.96 |
      4. |  7.01 |
         +-------+
    Last edited by Wouter Wakker; 15 Jan 2021, 02:13.

    Comment


    • #3
      It seems that you know what is the replacement rule, so -recode- should be sufficient here.

      Comment


      • #4
        Oddly I still remember measuring the acidity of a soil in 1968 or so as 4.2. (Not confident about the .2, but the 4, yes.) In context you might be clear that 4 is unacceptable, but it's a perfectly plausible pH measurement in nature.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Oddly I still remember measuring the acidity of a soil in 1968 or so as 4.2. (Not confident about the .2, but the 4, yes.) In context you might be clear that 4 is unacceptable, but it's a perfectly plausible pH measurement in nature.
          pH of Humans! Normal is 7.40.
          4.00 wouldn't be compatible with life. However, "dust to dust..."

          Comment


          • #6
            Thank you all--I am going to evaluate the suggestions and see how it plays out and any ongoing questions.

            Comment


            • #7
              Interpreting the numbers as strings has worked well for this--to move decimal positions, etc. Thank you!

              I have another question, though, regarding Reg Expressions, as it's not clear:
              As an example, say I am trying to isolate only 3 character strings (they're actually numbers, but say they are coded as strings for simplicity), beginning with 6 or 7.
              If I try to isolate them using:
              Code:
              list if regexm(string(ph),"(^[6-7][0-9][0-9]$)")
              it will not find them.

              However, if I remove the "$" anchor, it will:
              Code:
              list if regexm(string(ph),"(^[6-7][0-9][0-9])")
              Why is this so? Couldn't I specify between the 2 anchors to say this is the beginning and this is the end, and ensure I have a 3 character string?
              If I just have the beginning anchor "^" I'm afraid it will just start the match at 3 characters, but I could end up with any length of matches thereafter, say 4 or 5 digit numbers.

              Thank you.

              Comment


              • #8
                This works for me.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float myvar
                4.42
                7036
                 696
                .701
                end
                
                list if regexm(string(myvar),"(^[6-7][0-9][0-9]$)")
                Res.:

                Code:
                . list if regexm(string(myvar),"(^[6-7][0-9][0-9]$)")
                
                     +-------+
                     | myvar |
                     |-------|
                  3. |   696 |
                     +-------+
                More generally, there is nothing special about a regular expression condition. You can combine it with an extra condition.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float myvar
                4.42
                7036
                 696
                .701
                end
                
                list if ustrregexm(string(myvar),"(^[6-7][0-9]{2})") & length(string(myvar))==3

                If I try to isolate them using:

                Code:

                list if regexm(string(ph),"(^[6-7][0-9][0-9]$)")

                it will not find them.
                This could be due to leading spaces, although unlikely if your original variable is numeric. You can eliminate these as well as trailing spaces using the -trim()- function.

                Code:
                list if regexm(trim(string(ph)),"(^[6-7][0-9][0-9]$)")
                .
                Last edited by Andrew Musau; 18 Jan 2021, 12:17.

                Comment

                Working...
                X