Cluster based on string similarity

Justus Deters

Join Date: Sep 2022
Posts: 1

Cluster based on string similarity

26 Sep 2022, 05:02

Hey Community,

I'm quite new to working with Stata and therefore desperately looking for help! I have a dataset consisting of >200 firms and different characteristics of these firms such as their industry affiliation (see example below). However, each firm has multiple industry group affiliations. My goal is to cluster these firms based on the similarity of industry group affiliation and to create a new categorical variable consisting of those 3 clusters. Has anyone experience with this kind of problem or can help me on how to ideally approach this? Thank you so much in advance!!

Data:

firm_id	industry_groups
1	Advertising, Commerce and Shopping, Sales and Marketing
2	Advertising, Media and Entertainment, Mobile, Sales and Marketing, Software
3	Energy, Natural Resources, Sustainability
...	...

Last edited by Justus Deters; 26 Sep 2022, 05:06.

Tags: None

Fei Wang

Join Date: Oct 2021

Posts: 726
#2

26 Sep 2022, 18:54

Justus, you may have to decide what the "similarity" means and then let Stata process. Stata is not able to automatically define the "similarity".
Comment

Announcement

Cluster based on string similarity

Comment