Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import CSV with duplicate variable names

    I am trying to import a csv file that is structured as below with duplicate variable names:

    VariableA,VariableA, VariableA, VariableX,VariableX, VariableX
    2.5%, 50%, 97.5%,2.5%, 50%, 975%,
    10,20,30,40,30,40
    15,25,35,45,30,40
    20,30,40,50,30,40
    25,35,45,55,30,40
    30,40,50,60,30,60

    When I import the csv file as
    Code:
     import delimited using "filename.csv", clear varnames(1)
    Only the first variable gets the varname, while the duplicates get v2, v3 as
    VariableA, v2, v3, VariableX, v5, v6

    I have a lot of variables. I need a way to use combined values from the first and second row to create variable names

    So, maybe I can start with
    Code:
     import delimited using "filename.csv", clear
    without specifying to use row1 as varnames
    This creates all variables as v1, v2, v3, v4 and so on

    what would be the next step?

    Or, is there any other solution?

  • #2
    I don't think there is a universal recipe here. As you say you need to take metadata from observations 1 and 2. Given this kind of problem I would import everything as string and then clean up within Stata.

    I copied your example into whatever.csv and kept revising a script until it got just about where I guess you want to be.

    Code:
    import delimited "whatever.csv", clear 
    
    forval j = 1/7 {
       capture { 
           local newname = strtoname(v`j'[1] + v`j'[2])
           local newlabel = trim(v`j'[1] + " " + v`j'[2]) 
           label var v`j' "`newlabel'"
           rename v`j' `newname'
       }
    } 
    
    describe 
    
    list 
    
    drop in 1/2 
    
    destring, replace 
    
    rename _* * 
    
    describe 
    
    list

    Here's the result:

    Code:
    . describe 
    
    Contains data
     Observations:             5                  
        Variables:             7                  
    ----------------------------------------------------------------------------------------------------------
    Variable      Storage   Display    Value
        name         type    format    label      Variable label
    ----------------------------------------------------------------------------------------------------------
    VariableA2_5_   byte    %10.0g                VariableA 2.5%
    VariableA50_    byte    %10.0g                VariableA 50%
    VariableA97_5_  byte    %10.0g                VariableA 97.5%
    VariableX2_5_   byte    %10.0g                VariableX 2.5%
    VariableX50_    byte    %10.0g                VariableX 50%
    VariableX975_   byte    %10.0g                VariableX 975%
    v7              byte    %8.0g                 
    ----------------------------------------------------------------------------------------------------------
    Sorted by: 
         Note: Dataset has changed since last saved.
    
    . 
    . list 
    
         +----------------------------------------------------------------------+
         | Va~A2_5_   Var~A50_   Var~7_5_   Va~X2_5_   Var~X50_   Var~975_   v7 |
         |----------------------------------------------------------------------|
      1. |       10         20         30         40         30         40    . |
      2. |       15         25         35         45         30         40    . |
      3. |       20         30         40         50         30         40    . |
      4. |       25         35         45         55         30         40    . |
      5. |       30         40         50         60         30         60    . |
         +----------------------------------------------------------------------+

    The problems that remain include some that are just side-effects of typos or inconsistencies in your example.

    Otherwise you shouldn't expect variable names to include any punctuation other than underscores.

    Comment


    • #3
      Consider this:

      Code:
      *IMPORTED DATASET
      clear
      set obs 10
      forval i=1/12{
          gen var`i'=runiformint(0, 300)
      }
      list, sep(0)
      
      *START HERE
      local k 0
      local vars
      forval i=1/`=c(k)'{
          local j= ceil(`i'/3)
          local ++k 
          local k= cond(`k'==4, 1, `k')  
          rename var`i' name`j'_`k'
          local vars `vars' name`j'
      }
      local vars: list uniq vars
      gen long obsno=_n
      reshape long `vars', i(obsno) j(which) string
      destring which, ignore(_) replace
      sort which obsno
      list, sepby(which)
      This will work as long as variables follow a consistent pattern (consecutive groups of 3 variables) and each sequence has the same variable type (either all numeric or all string). You will have to rename the groups of variables subsequently.

      Res.:

      Code:
      *IMPORTED DATASET
      . list, sep(0)
      
           +--------------------------------------------------------------------------------------+
           | var1   var2   var3   var4   var5   var6   var7   var8   var9   var10   var11   var12 |
           |--------------------------------------------------------------------------------------|
        1. |   75    109    212     30    271     97    150     14    261     276     227     206 |
        2. |  219     11     24    212     87    300     73     82    105      66     245      20 |
        3. |  283      1    136    194    169     90     91      6    296      29      18      41 |
        4. |  254    128    185    222    277    166    150    189    206     196     279     137 |
        5. |   40    142     62    123     68     33    298      5    206     143       6     157 |
        6. |  114    212     53     65    259     51    289    293    188      87     204     225 |
        7. |   96    201    300    120    182     67    284    125    228      10     241      72 |
        8. |    2     90    114    229     41      0    249     37     38     182     221     122 |
        9. |  249    131     54    256     46     34    178    186     73      98     160      46 |
       10. |   85    163    190    136     28    266      4    119    268     140     100      37 |
           +--------------------------------------------------------------------------------------+
      
      *RESHAPED DATASET
      . 
      . list, sepby(which)
      
           +-----------------------------------------------+
           | obsno   which   name1   name2   name3   name4 |
           |-----------------------------------------------|
        1. |     1       1      75      30     150     276 |
        2. |     2       1     219     212      73      66 |
        3. |     3       1     283     194      91      29 |
        4. |     4       1     254     222     150     196 |
        5. |     5       1      40     123     298     143 |
        6. |     6       1     114      65     289      87 |
        7. |     7       1      96     120     284      10 |
        8. |     8       1       2     229     249     182 |
        9. |     9       1     249     256     178      98 |
       10. |    10       1      85     136       4     140 |
           |-----------------------------------------------|
       11. |     1       2     109     271      14     227 |
       12. |     2       2      11      87      82     245 |
       13. |     3       2       1     169       6      18 |
       14. |     4       2     128     277     189     279 |
       15. |     5       2     142      68       5       6 |
       16. |     6       2     212     259     293     204 |
       17. |     7       2     201     182     125     241 |
       18. |     8       2      90      41      37     221 |
       19. |     9       2     131      46     186     160 |
       20. |    10       2     163      28     119     100 |
           |-----------------------------------------------|
       21. |     1       3     212      97     261     206 |
       22. |     2       3      24     300     105      20 |
       23. |     3       3     136      90     296      41 |
       24. |     4       3     185     166     206     137 |
       25. |     5       3      62      33     206     157 |
       26. |     6       3      53      51     188     225 |
       27. |     7       3     300      67     228      72 |
       28. |     8       3     114       0      38     122 |
       29. |     9       3      54      34      73      46 |
       30. |    10       3     190     266     268      37 |
           +-----------------------------------------------+

      Comment


      • #4
        Thank you, it works perfectly!

        Comment

        Working...
        X