Frequencies of an array/ dummy variables from multencoce

julie anderson

Join Date: Mar 2020

Posts: 3
#1

Frequencies of an array/ dummy variables from multencoce

26 Mar 2020, 10:02

Hi,
I have a dataset for firms with different characteristics. One variables was a string variable with many tags divided by a commas.
I created a tag* variables using the following command:

. split tags, p(",")
variables created as string:
tags1 tags2 tags3

Then I used the multencode to create an identifier for each tag:
. multencode tags1-tags3, gen(ntags1-ntags3)

See below the dataex.

Now the tags are not uniforms in number that is some have 49 tags some have 3 tags but all the tags* variables are the same:

1) I would like to have a frequency table of all the different tags* variables I created (not only tag1, tag2, tag 3 seperately but rather how many times robots appear in all the matrix of tags)
2) I would like to create a dummy variable for each tag so that instead of now having a column tag1 which can include different values having a column variable for each company named: tec which equals 1 if tags1="tec" or tags2="tec" or tags3="tec" and so forth for all the tags and all the values.
I created the below example but in reality I have more than 5000 firms and 50 possible tags so doing it manually is impossible

Thanks a lot

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 company str50 tags str6 tags1 str8 tags2 str7 tags3 byte(ntags1 ntags2 ntags3) "a" "tec, medical, COV19" "tec" " medical" " COV19" 7 3 2 "b" "robots, COV19, yeast" "robots" " COV19" " yeast" 6 2 5 "c" "tec, AI, mobile" "tec" " AI" " mobile" 7 1 4 "d" "robots, AI" "robots" " AI" "" 6 1 . end label values ntags1 tags1 label values ntags2 tags1 label values ntags3 tags1 label def tags1 6 "robots", modify label def tags1 7 "tec", modify label def tags1 1 " AI", modify label def tags1 2 " COV19", modify label def tags1 3 " medical", modify label def tags1 4 " mobile", modify label def tags1 5 " yeast", modify
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10213

26 Mar 2020, 13:10

I do not get #2. For #1, just reshape long

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 company str50 tags str6 tags1 str8 tags2 str7 tags3 byte(ntags1 ntags2 ntags3)
"a" "tec, medical, COV19"  "tec"    " medical" " COV19"  7 3 2
"b" "robots, COV19, yeast" "robots" " COV19"   " yeast"  6 2 5
"c" "tec, AI, mobile"      "tec"    " AI"      " mobile" 7 1 4
"d" "robots, AI"           "robots" " AI"      ""        6 1 .
end
label values ntags1 tags1
label values ntags2 tags1
label values ntags3 tags1
label def tags1 6 "robots", modify
label def tags1 7 "tec", modify
label def tags1 1 " AI", modify
label def tags1 2 " COV19", modify
label def tags1 3 " medical", modify
label def tags1 4 " mobile", modify
label def tags1 5 " yeast", modify

preserve
drop tags ntag*
reshape long tags, i(company)
contract tags if !missing(tags)
l
restore

Res.:

Code:

. l, sep(10)

     +------------------+
     |     tags   _freq |
     |------------------|
  1. |       AI       2 |
  2. |    COV19       2 |
  3. |  medical       1 |
  4. |   mobile       1 |
  5. |    yeast       1 |
  6. |   robots       2 |
  7. |      tec       2 |
     +------------------+

Announcement

Frequencies of an array/ dummy variables from multencoce

Comment