Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Append multiple datasets

    I have multiple panel data sets. All of them contain the same variables, in the same order. Before appending datasets, I have 12 unique "isin" observations in all datasets combined. Once I append the other data sets, the number of distinct values is smaller than 12.

    Using - distinct - in the fully appended dataset shows only 4 distinct values, while it should show 12. The other variables seem to be unaffected, yet this is unconfirmed. I have changed the data type of isin and display format, but it did not help.


    Two types of code possible:

    Code:
    clear all
    use "Results\Add_April.dta"
    append using "Results\Add_June.dta"
    append using "Results\Add_July.dta"
    append using "Results\Add_September.dta"
    append using "Results\Add_December.dta"
    Code:
    clear
    save appended, emptyok
    local filelist: dir . files "*.dta"
    foreach file of local filelist {
      use `"`file'"', clear
      gen source = `"`file'"'
      append using appended.dta
      save appended.dta, replace
    }
    - describe -

    HTML Code:
    Contains data from Results\Add_April.dta
      obs:         3,783                          
     vars:             5                          30 Aug 2018 23:03
     size:        68,094                          
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    date            int     %td                   DATE
    isin            float   %12.0f     isin      
    rollspread      float   %9.0g                
    obs_count       float   %9.0g                
    acc_count       float   %9.0g                
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: isin


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int date float(isin rollspread obs_count acc_count)
    19570 1    .358836   1 -631
    19571 1   .3559643   2 -630
    19572 1   .3604009   3 -629
    19575 1   .3710254   4 -628
    19576 1   .3779745   5 -627
    19577 1   .3691023   6 -626
    19578 1   .4060649   7 -625
    19579 1  .19622084   8 -624
    19582 1   .2437022   9 -623
    19583 1  .18175964  10 -622
    19584 1  .14028528  11 -621
    19585 1  .12642506  12 -620
    19586 1  .08998363  13 -619
    19589 1  .09458884  14 -618
    19590 1  .09458884  15 -617
    19591 1  .09458884  16 -616
    19592 1  .09458884  17 -615
    19593 1  .09458884  18 -614
    19596 1  .09458884  19 -613
    19597 1   .1148953  20 -612
    19598 1  .13956119  21 -611
    19599 1  .13956119  22 -610
    19600 1  .13956119  23 -609
    19603 1   .1000957  24 -608
    19604 1  .07179035  25 -607
    19605 1  .07179035  26 -606
    19606 1          0  27 -605
    19607 1          0  28 -604
    19610 1          0  29 -603
    19611 1          0  30 -602
    19612 1 .017816523  31 -601
    19613 1 .017816523  32 -600
    19614 1 .017816523  33 -599
    19617 1 .017816523  34 -598
    19618 1 .017816523  35 -597
    19619 1 .017816523  36 -596
    19620 1 .017816523  37 -595
    19621 1 .017816523  38 -594
    19624 1 .017816523  39 -593
    19625 1 .017816523  40 -592
    19626 1 .017816523  41 -591
    19627 1 .017816523  42 -590
    19628 1 .017816523  43 -589
    19631 1 .017816523  44 -588
    19632 1 .017816523  45 -587
    19633 1 .017816523  46 -586
    19634 1 .017816523  47 -585
    19635 1 .017816523  48 -584
    19638 1 .017816523  49 -583
    19639 1 .017816523  50 -582
    19640 1  .04874808  51 -581
    19641 1   .4369839  52 -580
    19642 1   .4318686  53 -579
    19645 1   .4318686  54 -578
    19646 1   .4318686  55 -577
    19647 1  .43874395  56 -576
    19648 1   .4115384  57 -575
    19649 1   .4162376  58 -574
    19652 1    .411549  59 -573
    19653 1   .4073917  60 -572
    19654 1   .4073917  61 -571
    19655 1   .4073917  62 -570
    19656 1   .4073917  63 -569
    19659 1   .4073917  64 -568
    19660 1   .4073917  65 -567
    19661 1   .4073917  66 -566
    19662 1   .4073917  67 -565
    19663 1   .4073917  68 -564
    19666 1   .4073917  69 -563
    19667 1   .4073917  70 -562
    19668 1   .4123865  71 -561
    19669 1   .3880012  72 -560
    19670 1          0  73 -559
    19673 1          0  74 -558
    19674 1          0  75 -557
    19675 1          0  76 -556
    19676 1          0  77 -555
    19677 1  .05367249  78 -554
    19680 1  .06559968  79 -553
    19681 1   .0757331  80 -552
    19682 1  .07581708  81 -551
    19683 1  .16967574  82 -550
    19684 1  .18460906  83 -549
    19687 1  .18460906  84 -548
    19688 1  .17089532  85 -547
    19689 1  .23653385  86 -546
    19690 1  .24286927  87 -545
    19691 1  .24286927  88 -544
    19694 1  .24286927  89 -543
    19695 1  .24286927  90 -542
    19696 1  .24286927  91 -541
    19697 1  .15771736  92 -540
    19698 1   .3278265  93 -539
    19701 1   .3450763  94 -538
    19702 1  .36467785  95 -537
    19703 1   .3523918  96 -536
    19704 1   .3573144  97 -535
    19705 1   .3754696  98 -534
    19708 1   .3877769  99 -533
    19709 1   .4009039 100 -532
    end
    format %td date
    label values isin isin
    label def isin 1 "DE000A0TU305", modify
    Thanks!
    Last edited by Felix Schrock; 30 Aug 2018, 16:10.

  • #2
    Without seeing all four of the files being appended I have to be a little bit speculative here, but I think I know what is happening.

    The clue is that isin has a value label. I'm going to guess that it was created using the -encode- command at some point. Whether that's the case or not, having a value label means what you see is not what you get. So even though you see DE000A0TU305 in a listing, or in the browser,
    Code:
    label values isin isin
    label def isin 1 "DE000A0TU305", modify
    tells me that what Stata actually has in the data is 1, with a side note to always display that as "DE000A0TU305."

    Now if the variable isin is like that in all four data sets, the value 1 may be associated to "DE000A0TU305" in one of those files, and that same value 1 may be associated to something completely different in each of the other files. When you append the files together, only one of those conventions (if memory serves, it's the one from the first file, but don't count on that) survives. So the four isin's that were 1 in the four separate files now are indistinguishable from each other in the appended file.

    Moral of the Story: Appending files (or, for that matter, merging them) that have variables with value labels is treacherous unless you first verify that all of the separate files are using the exact same value labeling. If not, you will end up with chaos.

    How to fix it: Go back to the original files, and -decode- the isin variable in each of them. Now you will no longer have 1, 2, 3, 4, but the strings that labeled them when you started. If you now -append- all these files together, there will be no confusion. Once you have finished the append, if you prefer to have these as a numeric variable, you can -encode- them, and Stata will create a new value label that applies to, and distinguishes, all 12 distinct isin strings. Or you can just leave the isin as a string variable for many purposes. (If you need to -xtset- with isin as a panel variable, then you must make it numeric. For other purposes, I think keeping it as a string is simpler, although in very large data sets, a labeled numeric value will take up less memory and disk space.)

    Comment


    • #3
      Yes, you are right. I used -encode- in earlier steps, since I was using isin as a panel variable in the beginning. But now it is not necessary anymore. Thank you for your explanation.

      Comment

      Working...
      X