How can I efficiently generate many dummy variables from substrings? (other than one by one using regexm)

Enrique Figueroa

Join Date: May 2018

Posts: 4
#1

How can I efficiently generate many dummy variables from substrings? (other than one by one using regexm)

15 Apr 2019, 17:25

Excuse my terrible title, I struggled to communicate what I wanted to do in a concise way.

I am working on a database of federal contractors and I'm trying to create dummy variables for the different business types an entity can have. The problem I am facing is that in this dataset a contractor can have multiple business types which are all contained in one string variable. For example:

contractor Busn_type_str

A 23~2X~PI

B 1D~23~27~A5~A8~H2~HK~PI~QF

Each of these two-digit alphanumerics represents a different business type. I want to create a dummy variable for each of the business types so that I can make some tables for the project I am working on. I could go one by one and generate dummy = regexm(Busn_type_str, {code}), but there are 78 of these codes. I will do this if there is no other alternative but I would rather work smart not hard. Any suggestions on how to generate these dummy variables efficiently?

In summary: I have a string of 2 digit codes (78 different codes) that I want to use to make 78 dummy variables indicating the presence of individual codes. I am trying to avoid going one by one and generating them using regexm. I am considering using a for loop combined with regexm but I want to see if there are any other methods out there that might save me some time.

Thanks for all your help!

-Enrique A Figueroa
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30173
#2

15 Apr 2019, 17:40

You don't need to generate a bunch of indicator ("dummy") variables in order to generate a table of business types for your project. In fact, it is fairly rare in modern Stata that you ever need to do that, for any reason. The most common use of indicator variables in early versions of Stata was to represent categorical variables (like business type) in regression models. But factor-variable notation does that much more efficiently now.

What you need to do is reorganize your data. The variable busn_type_str isn't very useful. You need to break it up into the separate codes, and then you need to reorganize your data into long layout. Your life will also be easier if you then convert the resulting business code variable to a numeric version with value labels. The code below does that, and concludes by showing you a table of frequencies of the different business types, which is presumably what you were aiming for.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 contractor str26 busn_type_str "A" "23~2X~PI" "B" "1D~23~27~A5~A8~H2~HK~PI~QF" end split busn_type_str, parse("~") gen(_bus_type) drop busn_type_str reshape long _bus_type, i(contractor) drop if missing(_bus_type) encode _bus_type, gen(bus_type) tab bus_type

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

contractor	Busn_type_str
A	23~2X~PI
B	1D~23~27~A5~A8~H2~HK~PI~QF

Announcement

How can I efficiently generate many dummy variables from substrings? (other than one by one using regexm)

Comment