Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a dummy variable for a large number of keywords in a dataset

    Hi Statalist,

    I am trying to clean a large dataset. Currently, I’m trying to clean the data using the observations under one variable, called “program”. The data under “program” is string data. In a different frame, I have a list of 500+ keywords (some of which are more than just one word). I’m trying to create a dummy variable (called "keeper_program") with a value of 1 if the observation under “program” contains any of the keywords and a value of 0 if the observation under "program" does not contain any of the keywords.

    How can I do this? I've tried to manually enter some of the keywords in the following format, but I get the error message "strpos not found":

    generate keeper_program = strpos(program, " band ") | strpos(program, " cree ") | strpos(program, " f n ")

    I've also tried the following. It runs, but it just gives me a value of 1 for every observation under "program" that contains anything at all, not just the keywords:

    foreach keywords in keywords_frame {
    local keywordsmacro
    }

    gen keeper_program = 0
    foreach keywords in keywordsmacro {
    quietly replace keeper_program = 1 if strpos(program, "`keywordsmacro'")
    }


    Here's an example of the data in "program:"

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str153 program                                                                                                              
    "Association for Canadian Studies, Montreal, Quebec"                                                                
    "Canada's National History Society, Winnipeg, Manitoba"                                                             
    "Governing Council of the Toronto University, Toronto, Ontario"                                                     
    "Governor General's Canadian Leadership Conference, Toronto, Ontario"                                               
    "Historica Canada, Toronto, Ontario"                                                                                
    "Transfer payments under $100,000 (1 recipient)"                                                                    
    ""                                                                                                                  
    "1454119 Ontario Ltd Teach Magazine, Toronto, Ontario"                                                              
    "1772887 Ontario Ltd, Toronto, Ontario"                                                                             
    "2017 Canada Summer Games Host Society Inc, Winnipeg, Manitoba"                                                     
    "3763455 Canada Inc, Ottawa, Ontario"                                                                               
    "3e Evenements, Quebec, Quebec"                                                                                     
    "4Elements Living Arts, Kagawong, Ontario"                                                                          
    "9291571 Canada Society O/A Elpio Productions, Kanata, Ontario"                                                     
    "Aboriginal Peoples Television Network Inc, Winnipeg, Manitoba"                                                     
    "Action Promotion Grande Allee, Quebec, Quebec"                                                                     
    "Actua, Ottawa, Ontario"                                                                                            
    "Algonquin Anishinabeg Nation Inc, Maniwaki, Quebec"                                                                
    "Arts Ottawa East, Ottawa, Ontario"                                                                                 
    "Association de la Presse francophone, Ottawa, Ontario"                                                             
    "Atlantic Presenters Association Inc, Charlottetown, Prince Edward Island"                                          
    "Brand Live Management Group Inc, Vancouver, British Columbia"                                                      
    "Canada Games Council, Ottawa, Ontario"                                                                             
    "Canada Place Corporation, Vancouver, British Columbia"                                                                                                                   
    end


    And here's an example of the data in the keywords list:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str44 keywords
    ""                          
    " band "                    
    " cree "                    
    " f n "                     
    " fn "                      
    " nation "                  
    " native  "                 
    "170629 canada"             
    "3051802 nova"              
    "4 directions"              
    "4-directions"              
    "4891156 manitoba"          
    "613860 saskatchewan"       
    "a-tlegay fisheries"        
    "abenakis"                  
    "abenaquis"                 
    "aboriginal"                
    "adah'dene cultural healing"
    "adams lake band,"          
    "ahousaht"                  
    end


    Thank you in advance for your help!

    Matthias Hoenisch










  • #2
    Here is oneway:

    Code:
    clear*
    input str44 keywords
    ""                          
    " band "                    
    " cree "                    
    " f n "                     
    " fn "                      
    " nation "                  
    " native  "                 
    "170629 canada"             
    "3051802 nova"              
    "4 directions"              
    "4-directions"              
    "4891156 manitoba"          
    "613860 saskatchewan"       
    "a-tlegay fisheries"        
    "abenakis"                  
    "abenaquis"                 
    "aboriginal"                
    "adah'dene cultural healing"
    "adams lake band,"          
    "ahousaht"   
    "Columbia"    
    "East"           
    end
    count
    local N_keywords = r(N)
    frame copy default keywords
    
    clear
    input str153 program                                                                                                              
    "Association for Canadian Studies, Montreal, Quebec"                                                                
    "Canada's National History Society, Winnipeg, Manitoba"                                                             
    "Governing Council of the Toronto University, Toronto, Ontario"                                                     
    "Governor General's Canadian Leadership Conference, Toronto, Ontario"                                               
    "Historica Canada, Toronto, Ontario"                                                                                
    "Transfer payments under $100,000 (1 recipient)"                                                                    
    ""                                                                                                                  
    "1454119 Ontario Ltd Teach Magazine, Toronto, Ontario"                                                              
    "1772887 Ontario Ltd, Toronto, Ontario"                                                                             
    "2017 Canada Summer Games Host Society Inc, Winnipeg, Manitoba"                                                     
    "3763455 Canada Inc, Ottawa, Ontario"                                                                               
    "3e Evenements, Quebec, Quebec"                                                                                     
    "4Elements Living Arts, Kagawong, Ontario"                                                                          
    "9291571 Canada Society O/A Elpio Productions, Kanata, Ontario"                                                     
    "Aboriginal Peoples Television Network Inc, Winnipeg, Manitoba"                                                     
    "Action Promotion Grande Allee, Quebec, Quebec"                                                                     
    "Actua, Ottawa, Ontario"                                                                                            
    "Algonquin Anishinabeg Nation Inc, Maniwaki, Quebec"                                                                
    "Arts Ottawa East, Ottawa, Ontario"                                                                                 
    "Association de la Presse francophone, Ottawa, Ontario"                                                             
    "Atlantic Presenters Association Inc, Charlottetown, Prince Edward Island"                                          
    "Brand Live Management Group Inc, Vancouver, British Columbia"                                                      
    "Canada Games Council, Ottawa, Ontario"                                                                             
    "Canada Place Corporation, Vancouver, British Columbia"                                                                                                                   
    end
    
    generate keeper_program  = .
    //Pick up empty keywords and empty program name
    replace keeper_program  =  1 if program == ""
    forvalues i = 1/`N_keywords' {
      qui replace  keeper_program  = 1 if  ustrpos(program, _frval(keywords, keywords, `i'))
    }
    list if keeper_program  == 1

    Comment


    • #3
      Hi Scott,

      Thanks for your answer. Unfortunately, I got an "invalid syntax" error message when I tried to input the following commands:

      Code:
        
       forvalues i = 1/`N_keywords' {   qui replace  keeper_program  = 1 if  ustrpos(program, _frval(keywords, keywords, `i')) }
      Last edited by Matthias Hoenisch; 24 Jul 2023, 12:00. Reason: Edited for incorrectly formatted code

      Comment


      • #4
        Are you running the code line-by-line? Place the code in a do-file and run it all at once.

        Comment


        • #5
          Hi Andrew,

          I am not copying the entire code because the example I provided of my data is only a small portion of the overall data; in reality, there are nearly 700 observations for "keywords" and over 100,000 observations in "program." What I've been trying to do with Scott's contribution is the following. It all works smoothly until it gets to the command I reference above, starting with "forvalues i".

          Code:
          use "main_dataset.dta"
          frame rename default maindata
          frame create keywords_frame
          frame change keywords_frame
          
          * Creating a keywords "vector"
          import excel "keywords_spreadsheet"
          local N_keywords = r(N)
          frame copy maindata keywords
          
          frame change maindata
          generate keeper_program = .
          replace keeper_program = 1 if program == ""
          forvalues i = 1/`N_keywords' {
            qui replace keeper_program  = 1 if strpos(program, frval(keywords, keywords, `i'))
          }

          Comment


          • #6
            import excel "keywords_spreadsheet"
            local N_keywords = r(N)
            -r(N)- emerges after count in #2, so the local is empty.

            Code:
            import excel "keywords_spreadsheet"
            count
            local N_keywords = r(N)

            Comment


            • #7
              Thanks for your input Andrew and Scott, but I still can't get it to work. After some fiddling, I have got to the current code:

              Code:
               use "main_data.dta"
              
              . frame rename default maindata
              
              . frame create keywords_frame
              
              . frame change keywords_frame
              
              . 
              . * Creating a keywords "vector"
              . import excel "Vocabulary.xlsx", sheet("Feuil1") firstrow
              (12 vars, 690 obs)
              
              . drop D E F H I J K L
              
              . gen keywords = strlower(recipientprogram)
              (4 missing values generated)
              
              . sort keywords
              
              . quietly by keywords: gen dup = cond(_N==1,0,_n)
              
              . drop if dup>1
              (128 observations deleted)
              
              . count
                562
              
              . local N_keywords = r(N)
              
              . frame change maindata
              
              . gen matchkeywords = ""
              (128,423 missing values generated)
              
              . frlink m:1 matchkeywords, frame(keywords_frame keywords) generate(linkedkeywords)
              (all observations in frame maindata matched)
              . 
              . gen keeper_program = ""
              (128,423 missing values generated)
              
              . forvalues i = 1/`N_keywords' {
                2.   qui replace keeper_program  = "1" if strpos(program, linkedkeywords, `i')
                3. }
              It all works until the final command, the forvalues command. At this point, I get "invalid syntax." Does anyone have any idea where my syntax has gone wrong here? Earlier in the week I was having some problems with data matching because I have string and numerical data in here, so I'm trying to solve that too.

              Thanks so much,

              Matthias

              Comment


              • #8
                Matthias Hoenisch In #5
                Code:
                  qui replace keeper_program = 1 if strpos(program, frval(keywords, keywords, `i'))
                You made two changes to my example code: 1. changed -ustrpos()- to -strpos()-. Your original list of keywords (where I added "East" and "Columbia" so there would be positive matches) contained an empty string. There is a difference in how empty strings are evaluated. See for example:
                Code:
                 clear
                
                . set obs 2
                Number of observations (_N) was 0, now 2.
                
                . gen s1 ="A"
                
                . gen s2 = "A" in 1
                (1 missing value generated)
                
                . gen found1 = strpos(s1, s2)
                
                . gen found2 = ustrpos(s1, s2)
                
                . l
                
                     +---------------------------+
                     | s1   s2   found1   found2 |
                     |---------------------------|
                  1. |  A    A        1        1 |
                  2. |  A             1        0 |
                     +---------------------------+

                2. You changed -_frval()- to -frval()- which have different arguments.

                In #7.
                Code:
                gen matchkeywords = ""
                frlink m:1 matchkeywords, frame(keywords_frame keywords) generate(linkedkeywords)
                You are matching an empty string variable to every keywords variable in the keyword_frame. I don't think this is what you want.

                Code:
                strpos(program, linkedkeywords, `i')
                strpos only has two arguments.

                Comment


                • #9
                  Scott Merryman I just tried again, copying your code from #2. The only differences are that I have loaded the dataset using the complete dataset, not just the variable "program," and I am including all the steps of cleaning up my keywords list in case that might affect something. However, I am once again getting an "invalid syntax" error message near the end of your code. Here's the code I used:

                  Code:
                  import excel "Keywords.xlsx", sheet("Feuil1") firstrow
                  drop D E F H I J K L
                  gen keywords = strlower(recipientprogram)
                  sort keywords
                  quietly by keywords: gen dup = cond(_N==1,0,_n)
                  drop if dup>1
                  count
                  local N_keywords = r(N)
                  frame copy default keywords
                  
                  clear
                  use "Big_Dataset.dta"
                  frame rename default maindata
                  
                  generate keeper_program  = .
                  //Pick up empty keywords and empty program name
                  replace keeper_program  =  1 if program == ""
                  forvalues i = 1/`N_keywords' {
                    qui replace  keeper_program  = 1 if  ustrpos(program, _frval(keywords, keywords, `i'))
                  }
                  list if keeper_program  == 1
                  This code all works until I get to the forvalues command; this is where I get the "invalid syntax" message.

                  Please let me know where I'm going wrong.

                  Thank you for your continued support; I know this is probably really frustrating on your end so I really appreciate your help.

                  Matthias

                  Comment


                  • #10
                    You are probably running your code line by line. That means your local macros disappear after each line. Instead you must run your entire do file in one go.

                    ​​​​​​
                    ---------------------------------
                    Maarten L. Buis
                    University of Konstanz
                    Department of history and sociology
                    box 40
                    78457 Konstanz
                    Germany
                    http://www.maartenbuis.nl
                    ---------------------------------

                    Comment


                    • #11
                      Well, I feel quite stupid. That worked. Thank you so much for all your help Scott Merryman , Andrew Musau, and Maarten Buis. You made this total Stata newbie's experience a lot smoother.

                      Comment

                      Working...
                      X