Hi Statalist,
I am trying to clean a large dataset. Currently, I’m trying to clean the data using the observations under one variable, called “program”. The data under “program” is string data. In a different frame, I have a list of 500+ keywords (some of which are more than just one word). I’m trying to create a dummy variable (called "keeper_program") with a value of 1 if the observation under “program” contains any of the keywords and a value of 0 if the observation under "program" does not contain any of the keywords.
How can I do this? I've tried to manually enter some of the keywords in the following format, but I get the error message "strpos not found":
generate keeper_program = strpos(program, " band ") | strpos(program, " cree ") | strpos(program, " f n ")
I've also tried the following. It runs, but it just gives me a value of 1 for every observation under "program" that contains anything at all, not just the keywords:
foreach keywords in keywords_frame {
local keywordsmacro
}
gen keeper_program = 0
foreach keywords in keywordsmacro {
quietly replace keeper_program = 1 if strpos(program, "`keywordsmacro'")
}
Here's an example of the data in "program:"
And here's an example of the data in the keywords list:
Thank you in advance for your help!
Matthias Hoenisch
I am trying to clean a large dataset. Currently, I’m trying to clean the data using the observations under one variable, called “program”. The data under “program” is string data. In a different frame, I have a list of 500+ keywords (some of which are more than just one word). I’m trying to create a dummy variable (called "keeper_program") with a value of 1 if the observation under “program” contains any of the keywords and a value of 0 if the observation under "program" does not contain any of the keywords.
How can I do this? I've tried to manually enter some of the keywords in the following format, but I get the error message "strpos not found":
generate keeper_program = strpos(program, " band ") | strpos(program, " cree ") | strpos(program, " f n ")
I've also tried the following. It runs, but it just gives me a value of 1 for every observation under "program" that contains anything at all, not just the keywords:
foreach keywords in keywords_frame {
local keywordsmacro
}
gen keeper_program = 0
foreach keywords in keywordsmacro {
quietly replace keeper_program = 1 if strpos(program, "`keywordsmacro'")
}
Here's an example of the data in "program:"
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str153 program "Association for Canadian Studies, Montreal, Quebec" "Canada's National History Society, Winnipeg, Manitoba" "Governing Council of the Toronto University, Toronto, Ontario" "Governor General's Canadian Leadership Conference, Toronto, Ontario" "Historica Canada, Toronto, Ontario" "Transfer payments under $100,000 (1 recipient)" "" "1454119 Ontario Ltd Teach Magazine, Toronto, Ontario" "1772887 Ontario Ltd, Toronto, Ontario" "2017 Canada Summer Games Host Society Inc, Winnipeg, Manitoba" "3763455 Canada Inc, Ottawa, Ontario" "3e Evenements, Quebec, Quebec" "4Elements Living Arts, Kagawong, Ontario" "9291571 Canada Society O/A Elpio Productions, Kanata, Ontario" "Aboriginal Peoples Television Network Inc, Winnipeg, Manitoba" "Action Promotion Grande Allee, Quebec, Quebec" "Actua, Ottawa, Ontario" "Algonquin Anishinabeg Nation Inc, Maniwaki, Quebec" "Arts Ottawa East, Ottawa, Ontario" "Association de la Presse francophone, Ottawa, Ontario" "Atlantic Presenters Association Inc, Charlottetown, Prince Edward Island" "Brand Live Management Group Inc, Vancouver, British Columbia" "Canada Games Council, Ottawa, Ontario" "Canada Place Corporation, Vancouver, British Columbia" end
And here's an example of the data in the keywords list:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str44 keywords "" " band " " cree " " f n " " fn " " nation " " native " "170629 canada" "3051802 nova" "4 directions" "4-directions" "4891156 manitoba" "613860 saskatchewan" "a-tlegay fisheries" "abenakis" "abenaquis" "aboriginal" "adah'dene cultural healing" "adams lake band," "ahousaht" end
Thank you in advance for your help!
Matthias Hoenisch
Comment