Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing specific observations of a string variable before destringing.

    Hi,

    I'm new here so hope I am not repeating something that has already been asked. I searched for similar posts (and have been scouring Google) but couldn't find an answer. I am a newbie to Stata, so please bear with me as I'm learning.

    I have a dataset with a variable for a blood biomarker. Most of the observations are absolute concentration of this biomarker, but some of the measurements were below the lower detection limit of the measurement method and have been assigned as "<5" (umol/L), so the variable is a string variable. I am trying to replace "<5" with something like "-1", so that I can destring the variable and know that everything with "-1" was below the detection limit. I generated a new variable (var) identical to the original to test a few things out.

    I tried the following but it didn't work:

    replace var = "-1" if var == "<5"
    (0 real changes made)

    I then tried the following, but it replaced all of the "<5" values with "."

    destring var, force replace

    I considered trying to remove the "<" symbol from the variable, then destringing and replacing the remaining "5"s, but there are some observations that are exactly 5 umol/L, so this won't work.

    Does anyone know if there is a way to replace these observations collectively, or even a way to identify all instances of "<5" within the variable easily so that I can replace these one by one before destringing? I tried the 'list' command to find each instance of "<5" appearing for var, but this command doesn't seem to work for string variables. I noticed that the following does work, and will put "-1" in at row 106.

    replace var = -1 in 106

    So if I can't replace these collectively, I could do it one by one if I can find out where they all are.

    I hope that makes sense, apologies if anything is unclear, and thank you for any help!
    Jamie

  • #2
    Hi Jamie, what's the output you get if you do -tab var-?
    Last edited by George Taylor; 20 Aug 2024, 08:10.

    Comment


    • #3
      Welcome!

      It probably means that they didn't enter "<5" but something like " <5" or "<5 ".

      Try this first and do the replace again:

      Code:
      replace var = trim(var)
      replace var = "-1" if var == "<5"
      Let us know how that works out.

      Comment


      • #4
        Thank you George and Ken!

        I managed to tabulate the variable and I could see the total number of "<5" in the variable, which was very useful.

        Ken's suggestion worked like a charm! I feel a little silly that I didn't think of this, but I'll know in future. A very easy solution to a problem I've been trying to fix for the past 2 hours....

        Thanks again - very much appreciated!!

        Comment


        • #5
          FWIW, Jamie, I've worked with data from several different sectors over my career. Health care data is the poorest quality, and within health data, microbiology labs are the poorest quality, probably because it is mostly, as in your case, free-form text. This kind of data should always be viewed with suspicion and subjected to extensive data cleaning and internal consistency checks before analyzing it. Free-form text variables, in any context, are always difficult to work with. And unless capitalization and spacing are actually informative (a very rare situation) it probably makes sense to routinely do something like
          Code:
          ds, has(type string)
          foreach v in `r(varlist)' {
              replace `v' = trim(itrim(lower(`v'))
          }
          as one of your first steps in data management. Of course, you will still have to worry about spelling errors and missing or stray punctuation, but this is a good start and generally picks up a substantial fraction of the messiness.

          By the way, there was already a clue about the spacing issue in the output Stata originally gave you:
          replace var = "-1" if var == "<5"
          (0 real changes made)
          That 0 real changes means that Stata actually never found any instances of var == "<5". So "<5" doesn't actually occur in the data set.

          Comment

          Working...
          X