Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new dummy variable from several dummies

    Hi,
    I have a long list of dummy variables (more than 2000) [these are tags for companies which I converted to 1/0 dummies] for each firm - notes as t1-t2100
    In a separate file I have a matrix that converts that split these t* dummies into categories such as software, hardware:
    tag_m tagss soft hard bio OTHER MEDICAL telcomobile ECOMMERCE CYBER FINTECH INDUSTRY4 AGRI
    t6 3d-technology 1
    t30 adtech 1
    t31 advertisers 1
    t32 advertising 1
    t46 agriculture 1
    t48 agtech 1
    t59 alert-system 1 1
    t63 algorithms 1
    t74 analytics 1
    t87 anti-fraud 1 1
    I want to create a variable in the original dataset using this category matrix such that basically
    gen software==1 if t30==1 | t31==1| t32==1 ......

    But doing it for so many category and t* variables is very tedious - I am sure there is a neat way to do it rather than manually putting it like above?

    Would appreciate any help or suggestion.
    To clarify the legend matrix which I copied above is in a separate data file than the main data which looks like this:
    company_id value amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11
    x 100 10 4 0 1 1 1 0 0 0 1 1 0 0
    y 4000 4 8 0 1 0 0 0 0 0 1 0 0 1
    I want to create additional columns to this dataset so that based on the categories matrix above new variables will be soft=0/1 if if t30==1 | t31==1| t32==1 ......
    company_id value amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 soft hard bio other
    x 100 10 4 0 1 1 1 0 0 0 1 1 0 0
    y 4000 4 8 0 1 0 0 0 0 0 1 0 0 1
    Thanks

  • #2
    I sense from your use of terminology that you have begun to use Stata relatively recently and are steeped in habits of thought and practices appropriate to some other statistical package. One of the keys to becoming efficient in the use of Stata is to break those old habits and develop new ones that are more congenial to the way Stata works.

    For example, the wide layout of your company data, with all those t* variables makes things hard to do in Stata. A more workable layout is the long one, with multiple observations per company, one for each t* that applies to it. Another example is the use of 1/missing value coding for your indicators in the categories data. When Stata evaluates logical expressions, 1 and missing value both evaluate as true: only 0 evaluates as false. So you are setting yourself up for logic errors later on when you use 1 for yes and missing value for no. It should be 1 for yes and 0 for no.

    So, given what your starting with, some transformations of the data are required in order to put these two data sets together. Once that is done, it's very simple.

    Code:
    clear*
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 tag_m str13 tagss byte(var3 hard bio other medical telcomobile ecommerce cyber fintech industry4 agri)
    "t6"  "3d-technology" . 1 . . . . . . . . .
    "t30" "adtech"        1 . . . . . . . . . .
    "t31" "advertisers"   1 . . . . . . . . . .
    "t32" "advertising"   1 . . . . . . . . . .
    "t46" "agriculture"   . . . . . . . . . . 1
    "t48" "agtech"        . . . . . . . . . . 1
    "t59" "alert-system"  1 1 . . . . . . . . .
    "t63" "algorithms"    1 . . . . . . . . . .
    "t74" "analytics"     1 . . . . . . . . . .
    "t87" "anti-fraud"    . . . . . . . . . . .
    end
    ds tag_m tagss, not
    tempfile categories
    save `categories'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 company_id int value byte(amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11)
    "x"  100 10 4 0 1 1 1 0 0 0 1 1 0 0
    "y" 4000  4 8 0 1 0 0 0 0 0 1 0 0 1
    end
    
    //  GO TO LONG LAYOUT
    reshape long t, i(company_id) j(tag_m)
    drop if t == 0
    drop t
    tostring tag_m, replace
    replace tag_m = "t" + tag_m
    
    
    frame create categories
    frame change categories
    use `categories'
    
    //  MAKE THE INDICATORS FOR HARD, BIO, ETC. PROPER 0/1 VARIABLES
    ds tag_m tagss, not
    mvencode `r(varlist)', mv(0)
    
    //  NOW PUT THEM TOGETHER
    frame change default
    frlink m:1 tag_m, frame(categories)
    frget hard-agri, from(categories)

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Note: In your data example tableaus, the only instances of the t* variables being 1 are t2, t3, t4, t8, t9, and t11. But none of these appear in the categories data set. So applying the code to your example data turns up all zeroes for soft hard bio, etc. Generally when posting it is better to show example data that fits together and illustrates the phenomena you are trying to capture. Presumably in your full data, this problem does not arise.

    Added: It dawns on me that you will actually need to revert to your original data organization with a single observation per company, perhaps using the various categories (hard, soft, etc.) as predictors in some kind of model. So the following code extends what is above to do that. In addition, I have shortened the code somewhat by eliminating some unnecessary steps in the management of the categories data set.

    Code:
    clear*
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 company_id int value byte(amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11)
    "x"  100 10 4 0 1 1 1 0 0 0 1 1 0 0
    "y" 4000  4 8 0 1 0 0 0 0 0 1 0 0 1
    end
    
    //  RETAIN A COPY OF THE DATA IN ITS ORIGINAL FORM
    frame copy default original
    
    //  GO TO LONG LAYOUT
    reshape long t, i(company_id) j(tag_m)
    drop if t == 0
    drop t
    tostring tag_m, replace
    replace tag_m = "t" + tag_m
    
    
    frame create categories
    frame change categories
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 tag_m str13 tagss byte(var3 hard bio other medical telcomobile ecommerce cyber fintech industry4 agri)
    "t6"  "3d-technology" . 1 . . . . . . . . .
    "t30" "adtech"        1 . . . . . . . . . .
    "t31" "advertisers"   1 . . . . . . . . . .
    "t32" "advertising"   1 . . . . . . . . . .
    "t46" "agriculture"   . . . . . . . . . . 1
    "t48" "agtech"        . . . . . . . . . . 1
    "t59" "alert-system"  1 1 . . . . . . . . .
    "t63" "algorithms"    1 . . . . . . . . . .
    "t74" "analytics"     1 . . . . . . . . . .
    "t87" "anti-fraud"    . . . . . . . . . . .
    end
    //  MAKE THE INDICATORS FOR HARD, BIO, ETC. PROPER 0/1 VARIABLES
    ds tag_m tagss, not
    mvencode `r(varlist)', mv(0)
    
    //  NOW GRAB THE CATEGORIES
    frame change default
    frlink m:1 tag_m, frame(categories)
    frget hard-agri, from(categories)
    //  AND REDUCE TO ONE OBSERVATION PER COMPANY
    collapse (max) hard-agri, by(company_id)
    
    //  AND NOW PUT THAT TOGETHER WITH THE ORIGINAL DATA
    frame change original
    frlink 1:1 company_id, frame(default)
    frget hard-agri, from(default)
    drop default
    frame drop default
    Last edited by Clyde Schechter; 01 Apr 2020, 13:31.

    Comment

    Working...
    X