Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Converting a variable's values using a concordance table with keywords.

    Dear Statalisters,

    Apologies in advance if this thread contains any English errors. I am glad to join this forum.

    Let's suppose I have a variable named var that is coded according to the classification oldclass at the 4-digits level for individuals. I would like to convert var's values into the latest classification newclass at the 2-digits level, but as some of you may know, the mapping between two classifications isn't always clear-cut as there are codes in oldclass that are split into many codes in newclass and vice-versa, even considering the two-digits level.

    The concordance table I have at my disposal provides a way to deal with this issue. Please have a look at a sample of the table:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4(oldclass newclass) str33 key byte(repl1 repl2)
    "0111" "0115" "" . .
    "0111" "0119" "" . .
    "0111" "0129" "" . .
    "0111" "0112" "" . .
    "0111" "0116" "" . .
    "0111" "0113" "" . .
    "0111" "0128" "" . .
    "0111" "0163" "" . .
    "0111" "0114" "" . .
    "0111" "0126" "" . .
    end
    As you can see, at the 2-digits level, there is nothing to be concerned about: even if there are changes at the 4-digits level, ultimately var would be coded 01 in both cases. But consider a situation where a value of oldclass gets split into two parts in newclass and those two parts do not have the same first two digits, as in the following example:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4(oldclass newclass) str33 key byte(repl1 repl2)
    "1551" "2011" "keywordnumber1, keyword2" 20 11
    "1551" "1101" ""                . 11
    end
    Here's where it gets complicated. In my database, I have a variable named details that gives information about the individual, in addition to his oldclass value. To understand what will be his new newclass value, I must use the variable key displayed above: it contains words like keywordnumber1 or keyword2 that can also be found in the variable details. If a keyword of the variable key matches a word of the variable details, then the individual's new value at the 2-digits level in newclass will be given by the variable repl1. Otherwise, if there's no match, then I should give the individual the value given by the variable repl2 displayed above.

    Forgive me if it's unclear, but if you understood well, my request is simple: I would like to find a piece of code to automatically search for certain keywords (maybe those would be stored in a global that would be looped over?) in the variable details, and then, if there is a match, to create a variable newclass_2digits that is equal to the final value of the individual's var according to the newclass classification.

    Please note that I can destring or extract a substring of every variable displayed so far if needed. One solution could be to look for matches by hand, but it seems very long and I am worried I can do mistakes caused by inattention. The tricky part is that each key value can have several keywords in it, each separated by a comma, and I need to have this procedure for each keyword individually.

    For instance, if individual X belonged in code 1551 in oldvar as in the example above, and if her details variable (a string variable that is used to write additional info) matches EITHER keywordnumber1 OR keyword2, or BOTH, then her newclass_2digits should be 20 according to repl1. If there's no match, it should be repl2's number, i.e. 11!

    Thanks a lot for any help that would make me do this correspondance faster than a line-by-line method!






  • #2
    But consider a situation where a value of oldclass gets split into two parts in newclass and those two parts do not have the same first two digits, as in the following example:
    Is there ever a situation where a value of oldclass gets split into more than two parts in newclass at the 2-digit level? If so, how is that handled?

    Also, do I understand correctly that the variable details may contain an arbitrary number of words (also separated by commas), and that repl1 is chosen if any of the words of detail matches any of the words in key, (and repl2 is chosen otherwise).

    Finally, please explain the meaning of the observation "1551" "1101" "" . 11. In particular, why does it not have the same value of key and repl1 as the observation immediately preceding it?

    Also, when you respond, please also post some example data from the person file that can be used to debug and test code with the examples you have shown of the newclass-oldclass crosswalk.


    Last edited by Clyde Schechter; 04 Jul 2022, 11:19.

    Comment


    • #3
      Dear Clyde:

      Is there ever a situation where a value of oldclass gets split into more than two parts in newclass at the 2-digit level? If so, how is that handled?
      Indeed such situation can occur. Here's an example:

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str4 (oldclass newclass) str33 key byte(repl1 repl2)
      "1920" "1512" ""                 . 15
      "1920" "1520" ""                 . 15
      "1920" "1629" ""                 . 15
      "1920" "2219" "keyword1, keyword2" 22 15
      "1920" "2220" "keyword1, keyword2" 22 15
      "1920" "3230" "keyword3"       32 15
      end
      Notice how the same keywords appear for the fourth and fifth line, probably because they share the same two first digits in newclass. If the individuals do not have these words in their details description, then their new code at the 2-digits level should be 15. However I have to admit that I don't understand why the third line do not have "16" for repl2 given that no keyword is provided, which also seem strange.

      Also, do I understand correctly that the variable details may contain an arbitrary number of words (also separated by commas), and that repl1 is chosen if any of the words of detail matches any of the words in key, (and repl2 is chosen otherwise).
      details do not contain an arbitrary number of words separated by commas, but rather a description of the individual's activity. For instance key could have "meat, food" as a value and detailscould have "Production of meat". In that example, there should be a match between key and details.

      Finally, please explain the meaning of the observation "1551" "1101" "" . 11. In particular, why does it not have the same value of key and repl1 as the observation immediately preceding it?
      1511 in oldclass gets split into 2011 and 1101 in newclass. To know what will be an individual's new classification, I need to check her details value. In that case you mentionned, if 1101 in newclass do not have any value in key and repl1, it's because any other individual classified 1511 in oldclass who do not belong to 2011 in newclass would be classified 1101. See this as a choice by elimination.
      Last edited by Adam Sadi; 04 Jul 2022, 11:37.

      Comment


      • #4
        I'm afraid I'm going to take a pass on this one. What deters me is "For instance key could have "meat, food" as a value and detailscould have "Production of meat". In that example, there should be a match between key and details." This is a pretty complicated condition in its own right, because I suspect that the "any word match" condition is not literally true: I suspect that matches on words like "the," "of," "and" etc. would not count. But excluding those matches on small functional words is difficult. In the end, you will probably have to do fuzzy matching (see Julio Raffit's -matchit- program, available from SSC) to get this to work. Adding that level of complexity on what would already be a fairly complicated data management process even if the details matching were simple, takes it beyond a level that I can reasonably devote time and effort to today. Sorry. Perhaps somebody else who sees a simpler way to approach this than I do, or has more time to work on it, will respond.
        Last edited by Clyde Schechter; 04 Jul 2022, 11:54.

        Comment


        • #5
          Clyde: I understand that this issue requires some time. Thank you anyways for yours!

          If this information can help anyone, there are exactly 86 keywords displayed by the variable key (some may be gathered in one cell).

          EDIT : I'm not an expert on Stata, so I'm just trying to bring solutions with what I currently know about it, but what about something this :

          Code:
          local keywords /* The 86 different keywords here - they have to be written by hand because they can appear in one cell only*/
          gen newclass_2d = substr(newclass, 1, 2)
          
          foreach k of local keywords {
          replace newclass_2d = repl1 if strpos(details, "`k'")
          replace newclass_2d = repl2 * I am not sure of this very last line. 
          }
          Maybe this code can inspire someone?
          Last edited by Adam Sadi; 04 Jul 2022, 12:48.

          Comment

          Working...
          X