Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing two string variables and extract difference into a new variable

    Hi,

    I am trying to tackle the problem of comparing text between two string variables and identify (and extract) “updated” parts.
    String Var1 String Var2 Result new variable
    “I wrote this in 2020” “I wrote this in 2020. I updated this in 2021” I updated this in 2021
    “someone said this” “In 2020, someone said this” In 2020,
    “numbers reported in 2020” “numbers changed in 2021” changed 2021
    I found some VBA script for Excel but only works for two cells (not automated to check two columns via loops). I don’t know how to modify VBA scripts. There is a STATA command for sequence analysis (based on Needleman-Wunsch) but I cannot figure out how it applies to comparing sentences. Anyone knows any other program or how the sequence analysis works for comparing sentences?

    Thanks!

    Xiaodong

  • #2
    Code:
    // Wherever you find the string var1 present in the string var2, substitute a blank string,
    // and return the remaining part of var2 as a result variable.
    gen result = subinstr(var2, var1, "", .)
    I don't see anything in your description and illustration of your problem that suggests sequence analysis.

    For future reference, I'd encourage you to read -help string functions-. No one remembers or fully understands that material on first reading, but it will give a background of knowledge such that the next time you face something involving two string variables, you will remember that relevant materials might be found there.

    Comment


    • #3
      Hi Mike,

      Thanks for suggesting the "subinstr" command. This wouldn't work as not all character/phrases of Var1 are in Var2. Suppose there is a string of two sentences in one observation of var1 and a string of two sentences in the same observation under var2. One sentence is the same in both observations. The desired comparison and extract command would identify the "new" sentence in var2 and put that in a new/result variable. The subinstr command cannot do that as it looks for the entirety of var1 in var2 and would fail in this instance. I mentioned "Sequence analysis" because I saw the exact function written in VBA (for Excel) based on Needleman-Wunsch, an algorithm developed for DNA sequence analysis and further adapted in Natural Language Processing. I didn't know anything about it until I read something in the STATA Journal today. I have reached out to the authors but hope I can also get some help from STATA List :-)

      Comment


      • #4
        I understand your rules somewhat better now but not well enough. The issue for me, perhaps not for others that might help you, is that a precise definition of what an "updated part" means for your purpose would be needed, i.e., "A part (word? character? phrase?) in var2 is to be considered part of an update if it meets the following conditions that involve the relation of the content in var2 to the content in var1: ... ."

        Comment

        Working...
        X