Hey all,
I have read several posts on matchit and a couple of other matching algorithms (see the end of my post).
I have 2 datasets I want to match by firm name, master data has 600,000 names (across 20 plus years), and dataset two has 13,000 distinct names. I have not run matchit on these two datasets, due to my concern over the matching time.
For matchit, some suggest to drop common words , in the case of firm names, “inc” “co” “limited”, might be common, that helps speed up matching, especially considering the large size of my datasets.
My questions:
if there are other such transformations I should apply to firm names, could you share here please?
Thanks,
Rochelle
http://www.statalist.org/forums/foru...-two-databases
http://www.statalist.org/forums/foru...reclink-syntax
http://www.statalist.org/forums/foru...-e-fuzzy-match
I have read several posts on matchit and a couple of other matching algorithms (see the end of my post).
I have 2 datasets I want to match by firm name, master data has 600,000 names (across 20 plus years), and dataset two has 13,000 distinct names. I have not run matchit on these two datasets, due to my concern over the matching time.
For matchit, some suggest to drop common words , in the case of firm names, “inc” “co” “limited”, might be common, that helps speed up matching, especially considering the large size of my datasets.
My questions:
- How do you extract those common words e.g. “inc” “co” “limited”, then drop it
- In this post (http://www.statalist.org/forums/foru...-two-databases) Clyde and Julio suggest pretreat names
if there are other such transformations I should apply to firm names, could you share here please?
Thanks,
Rochelle
http://www.statalist.org/forums/foru...-two-databases
http://www.statalist.org/forums/foru...reclink-syntax
http://www.statalist.org/forums/foru...-e-fuzzy-match
Comment