Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with LABELS derived from words in a GLOBAL

    Hi Everyone, I have a question about syntax. I have a huge text database. I'm searching through hundreds of thousands of entries looking for key words. I want to create indicator variables indicating observations that contain the words and then compile those indicators (using the egen group command) into a variable indicating the key words in each variable. I create sample data and code below. My code creates the indicators that I desire but not the labels. In the code below, I'd like the variable act_industry_A to have the label A and the variable act_Industry_B to have the label B. After grouping, I'd like the variable act_Industry to have labels indicating A, B, A B, or blank. In the code that I created, the label var command does not create a label. I cannot figure out why. Please help if you can. Thanks. Gary

    clear
    input year str10 title_proper
    year title_proper
    1900 "z z z A"
    1901 "z z z B"
    1902 "z z A C"
    1903 "z z A B"
    1904 "z z z z"
    end

    global Industries "A B"

    foreach I in $Industries {
    gen act_Industry_`I' = 1 if regexm(title_proper,"`I'")==1
    replace act_Industry_`I' = 0 if act_Industry_`I'==.
    label var act_Industry_`I' "`I'"
    }

    egen act_Industry = group(act_Industry_*), label

    br
    label list

    Note the output from browse is
    year title_proper act_Industry_A act_Industry_B act_Industry
    1900 z z z A 1 0 1 0
    1901 z z z B 0 1 0 1
    1902 z z A C 1 0 1 0
    1903 z z A B 1 1 1 1
    1904 z z z z 0 0 0 0

    The indicators are correct, but act_Industry_A and act_Industry_B lack labels. The labels for the last variable are strings of number, but I wanted to create strings of letters (e.g. A, B, A B, or blank).

    Thanks

    Gary

  • #2
    This seems a simple misunderstanding. Your code does create variable labels, but label list would not show them any way: it is for display of value labels.

    The use of globals here is not material and before my friend Clyde Schechter makes the point I'd remark that it is not necessary and poor practice any way, as in programming nothing should be made global that doesn't need to be.

    More positively, I note that

    Code:
    gen act_Industry_`I' = regexm(title_proper,"`I'")==1
    gets you (0, 1) indicators in one line as explained at (e.g.) https://www.stata.com/support/faqs/d...mmy-variables/ or https://www.stata-journal.com/articl...article=dm0099

    Although you can no doubt find examples in my own code I recommend against l as a macro value as it's so easy to confuse with 1.

    Comment


    • #3
      Hi Nick,

      Thanks. That helped. I updated the code as you suggested (see below). Now, I understand my question.

      Label list yields

      act_Industry:
      1 0 0
      2 0 B
      3 A 0
      4 A B
      B:
      0
      1 B
      A:
      0
      1 A


      Browse yields

      year title_proper act_Industry_A act_Industry_B act_Industry
      1900 z z z A A A 0
      1901 z z z B B 0 B
      1902 z z A C A A 0
      1903 z z A B A B A B
      1904 z z z z 0 0

      For variable act_Industry_A, the label is A or a blank string. For variable act_Industry_B, the label is B or a blank string. For the grouped variable, I want the labels to read either A, or B, or AB, or blank. Right now, the labels are A 0, 0 B, A B, and 00. In other words, I want the labels for the grouped variable to read as they are but eliminate the zeros. Could you suggest a straightforward way of achieving this? The table that I'm hoping to produce appears below, with the changes that I'm hoping to induce bolded.

      year title_proper act_Industry_A act_Industry_B act_Industry
      1900 z z z A A A
      1901 z z z B B B
      1902 z z A C A A
      1903 z z A B A B A B
      1904 z z z z

      In my case, redefining the labels is impractical. I'm reading through several hundred thousand text records. These name about 200 different industries, usually individually, but occasionally in combinations of two or three industries.

      Thanks again.

      Gary

      *********

      clear
      input year str10 title_proper
      year title_proper
      1900 "z z z A"
      1901 "z z z B"
      1902 "z z A C"
      1903 "z z A B"
      1904 "z z z z"
      end

      local Industries "A B"

      foreach X in `Industries' {
      gen act_Industry_`X' = regexm(title_proper,"`X'")==1
      label var act_Industry_`X' "`X'"
      label define `X' 0 "" 1 "`X'"
      label values act_Industry_`X' `X'
      }

      egen act_Industry = group(act_Industry_*), label

      br
      label list
      Last edited by Gary Richardson; 02 Nov 2021, 14:28.

      Comment


      • #4
        I'd work backwards from whether what you want is compatible with limits on the length of variable labels. (80 characters). However, there are always notes.

        Comment

        Working...
        X