Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting variable without creating duplicates

    Hi, I am using Eurostat data for one project, I download them using:

    HTML Code:
    eurostatuse namq_10_gdp, noflags long
    But I have a problem since seasonally adjusted and unadjusted data are being exported into the same column, thus creating duplicates. Is there a way for me to split the variable into separate variables according to the seasonal adjustment identifier in a way that there will be no duplicates?

    Or alternatively, is there a way how to import them individually in the first place?

    So far I tried to generate new variables by the seasonal adjustment identifier, but the problem is that when I do this the new variables have values not next to each other but in a sort of staircase-like fashion:
    year country SA GDP NSA GDP
    1999 AT 1100 .
    1999 AT . 1000
    and thus I have a problem with duplicates. What I would like would be to have them like
    year country SA GDP NSA GDP
    1999 AT 1100 1000


    Thanks for any help.
    Last edited by Jozef Patrnciak; 24 Jun 2018, 16:46.

  • #2
    Although many of the Forum members work in finance or economics, this is not a finance and economics forum, and some of the members who respond most often to questions are from other disciplines. So in posting here, it is best not to refer to things that only those in your discipline will know about, such as Eurostat data. It would have been more helpful had you used the -dataex- command to show an actual example of the data you are starting from.

    That said, I have made some guesses about what is going on from your description. Do your data look like this?
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float year str3 country float gdp str3 adjustment
    1999 "AT"  1100 "SA" 
    1999 "AT"  1000 "NSA"
    2000 "AT"  1250 "SA" 
    2000 "AT"  1225 "NSA"
    1999 "XYZ"  750 "SA" 
    1999 "XYZ"  900 "NSA"
    2000 "XYZ"  800 "SA" 
    2000 "XYZ"  880 "NSA"
    end
    If that is what you are starting with, you can come very close to what you asked for by doing this:
    Code:
    reshape wide gdp, i(year country) j(adjustment) string
    Notes: 1. It is not possible to get exactly what you asked for because "SA GDP" and "NSA GDP", which contain blanks, are not legal variable names in Stata.

    2. The code above will only work on the assumption that, apart from the issue of some observations being SA and others being NSA, the variables country and year uniquely identify observations in the data. If that is not the case, then you will need to modify the code and include additional variables in the -i()- option.

    -reshape- is one of Stata's most useful and important commands. Do read the -reshape- chapter of the [D] volume of your PDF documentation. It takes a little bit of getting used to, and in the beginning you will make mistakes, but at some point it will "click" in your head and using it will become routine and effortless.

    If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.


    Comment


    • #3
      First, a piece of advice. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

      It is asking a lot of someone who wants to help you to first install software, then figure out how to use it to download a small sample of data rather than the full database for all years, all this before attacking your problem.

      The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

      With that said, I did install the software and downloaded a (relatively small) sample of data.
      Code:
      . eurostatuse namq_10_gdp, noflags long start(2000Q1) end(2000Q4) geo(FR)
      
          Dataset: GDP and main components  (output, expenditure and income)
          Last update: 22.06.2018
          Start: 1975Q1
          End: 2018Q1
      
      Downloading and formating data ...
      Reshaping dataset ...
      Downloading and formating labels ...
      
      . describe
      
      Contains data
        obs:         4,320                          
       vars:            10                          
       size:     1,386,720                          
      ------------------------------------------------------------------------------------------------
                    storage   display    value
      variable name   type    format     label      variable label
      ------------------------------------------------------------------------------------------------
      unit            str14   %14s                  Unit of measure
      unit_label      str110  %110s                 
      s_adj           str3    %9s                   Seasonal adjustment
      s_adj_label     str77   %77s                  
      na_item         str11   %11s                  National accounts indicator (ESA 2010)
      na_item_label   str88   %88s                  
      geo             str2    %9s                   Geopolitical entity (reporting)
      geo_label       str6    %9s                   
      time            int     %tq                   
      namq_10_gdp     double  %10.0g                
      ------------------------------------------------------------------------------------------------
      Sorted by: unit  unit_label  s_adj  s_adj_label  na_item  na_item_label  geo  geo_label  time
           Note: Dataset has changed since last saved.
      
      . tab s_adj
      
         Seasonal |
       adjustment |      Freq.     Percent        Cum.
      ------------+-----------------------------------
               CA |        240        5.56        5.56
              NSA |      1,944       45.00       50.56
               SA |         64        1.48       52.04
              SCA |      2,072       47.96      100.00
      ------------+-----------------------------------
            Total |      4,320      100.00
      This suggests to me that the following may start you in a helpful direction.
      Code:
      rename namq_10_gdp gdp_
      drop s_adj_label
      generate seq = _n
      reshape wide gdp_, i(seq) j(s_adj) string
      describe
      which produces the following results
      Code:
      . rename namq_10_gdp gdp_
      
      . drop s_adj_label
      
      . generate seq = _n
      
      . reshape wide gdp_, i(seq) j(s_adj) string
      (note: j = CA NSA SA SCA)
      
      Data                               long   ->   wide
      -----------------------------------------------------------------------------
      Number of obs.                     4320   ->    4320
      Number of variables                  10   ->      12
      j variable (4 values)             s_adj   ->   (dropped)
      xij variables:
                                         gdp_   ->   gdp_CA gdp_NSA ... gdp_SCA
      -----------------------------------------------------------------------------
      
      . describe
      
      Contains data
        obs:         4,320                          
       vars:            12                          
       size:     1,162,080                          
      ------------------------------------------------------------------------------------------------
                    storage   display    value
      variable name   type    format     label      variable label
      ------------------------------------------------------------------------------------------------
      seq             float   %9.0g                 
      gdp_CA          double  %10.0g                CA gdp_
      gdp_NSA         double  %10.0g                NSA gdp_
      gdp_SA          double  %10.0g                SA gdp_
      gdp_SCA         double  %10.0g                SCA gdp_
      unit            str14   %14s                  Unit of measure
      unit_label      str110  %110s                 
      na_item         str11   %11s                  National accounts indicator (ESA 2010)
      na_item_label   str88   %88s                  
      geo             str2    %9s                   Geopolitical entity (reporting)
      geo_label       str6    %9s                   
      time            int     %tq                   
      ------------------------------------------------------------------------------------------------
      Sorted by: seq

      Comment


      • #4
        My apologies for not asking the question in a proper way.

        the following William's solution worked perfectly for me:

        Code:
         
         rename namq_10_gdp gdp_ drop s_adj_label generate seq = _n reshape wide gdp_, i(seq) j(s_adj) string describe
        Also, although I don't need it now, for the future I would be also interested to know if there is a way to reshape it in a way that all individual combinations of na_items (individual components of GDP) and s_adj would be created as new variables?

        I tried to include both na_item and s_adj like j(s_adj na_item) but it did not work

        Thanks again for the help

        Comment


        • #5
          First, an apology. The code in post #3 did not work correctly. A close look at at the output from reshape shows that the number of observations was unchanged by the command. While four new variables were created, they each appeared on a separate observation. That was the result of my not at all understanding the data, and thus using a made-up sequence number to identify groups of observations that were meant to be identical except for the seasonal adjustment indicator. Made-up sequence numbers work for reshape long, but not for reshape wide.

          Here is corrected code and output for the task set about in post #3.
          Code:
          rename namq_10_gdp gdp_
          drop s_adj_label
          reshape wide gdp_, i(unit na_item geo time) j(s_adj) string
          describe
          list unit na_item geo time gdp_* in 1/16, sep(4)
          Code:
          . drop s_adj_label
          
          . reshape wide gdp_, i(unit na_item geo time) j(s_adj) string
          (note: j = CA NSA SA SCA)
          
          Data                               long   ->   wide
          -----------------------------------------------------------------------------
          Number of obs.                     7936   ->    4208
          Number of variables                   9   ->      11
          j variable (4 values)             s_adj   ->   (dropped)
          xij variables:
                                             gdp_   ->   gdp_CA gdp_NSA ... gdp_SCA
          -----------------------------------------------------------------------------
          
          . describe
          
          Contains data
            obs:         4,208                          
           vars:            11                          
           size:     1,291,856                          
          ------------------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          ------------------------------------------------------------------------------------------------
          unit            str14   %14s                  Unit of measure
          na_item         str11   %11s                  National accounts indicator (ESA 2010)
          geo             str2    %9s                   Geopolitical entity (reporting)
          time            int     %tq                  
          gdp_CA          double  %10.0g                CA gdp_
          gdp_NSA         double  %10.0g                NSA gdp_
          gdp_SA          double  %10.0g                SA gdp_
          gdp_SCA         double  %10.0g                SCA gdp_
          unit_label      str110  %110s                
          na_item_label   str88   %88s                  
          geo_label       str48   %48s                  
          ------------------------------------------------------------------------------------------------
          Sorted by: unit  na_item  geo  time
               Note: Dataset has changed since last saved.
          
          . list unit na_item geo time gdp_* in 1/16, sep(4)
          
               +-------------------------------------------------------------------------------+
               |       unit   na_item   geo     time     gdp_CA    gdp_NSA   gdp_SA    gdp_SCA |
               |-------------------------------------------------------------------------------|
            1. | CLV05_MEUR       B1G    DE   2000q1          .   490212.9        .   496314.7 |
            2. | CLV05_MEUR       B1G    DE   2000q2          .   497432.6        .   500701.9 |
            3. | CLV05_MEUR       B1G    DE   2000q3          .   509206.1        .   502423.5 |
            4. | CLV05_MEUR       B1G    DE   2000q4          .   505651.8        .   503256.5 |
               |-------------------------------------------------------------------------------|
            5. | CLV05_MEUR       B1G    FR   2000q1   362324.7   363607.6        .   360657.9 |
            6. | CLV05_MEUR       B1G    FR   2000q2   366022.7   365468.9        .   364253.2 |
            7. | CLV05_MEUR       B1G    FR   2000q3   357135.8   356423.6        .   366504.4 |
            8. | CLV05_MEUR       B1G    FR   2000q4   375718.4   375105.3        .   369785.2 |
               |-------------------------------------------------------------------------------|
            9. | CLV05_MEUR      B1GQ    DE   2000q1          .   547692.7        .   555231.8 |
           10. | CLV05_MEUR      B1GQ    DE   2000q2          .   556866.8        .   560858.7 |
           11. | CLV05_MEUR      B1GQ    DE   2000q3          .   566713.7        .   560063.6 |
           12. | CLV05_MEUR      B1GQ    DE   2000q4          .   565184.7        .   560491.8 |
               |-------------------------------------------------------------------------------|
           13. | CLV05_MEUR      B1GQ    FR   2000q1   400527.8   401824.2        .   401244.8 |
           14. | CLV05_MEUR      B1GQ    FR   2000q2   406961.4   406432.7        .   404910.4 |
           15. | CLV05_MEUR      B1GQ    FR   2000q3   397637.9     396945        .   407638.2 |
           16. | CLV05_MEUR      B1GQ    FR   2000q4   419763.5   419180.3        .   411097.2 |
               +-------------------------------------------------------------------------------+
          Now to address the question in post #4. There are two approaches to the task, and my preference is to do two successive reshapes, rather than generate a combined "seasonal adjustment and NA item" variable. So starting from the results above
          Code:
          rename (gdp_*) (=_)
          drop na_item_label
          reshape wide gdp_*_, i(unit geo time) j(na_item) string
          describe, short
          list unit geo time in 1/16, sep(4)
          Code:
          . rename (gdp_*) (=_)
          
          . drop na_item_label
          
          . reshape wide gdp_*_, i(unit geo time) j(na_item) string
          (note: j = B11 B111 B112 B1G B1GQ B2A3G D1 D11 D12 D2 D21 D21X31 D2X3 D3 D31 P3 P31_S13 P31_S14
          > P31_S14_S15 P31_S15 P32_S13 P3_P5 P3_P6 P3_S13 P41 P51G P52 P52_P53 P53 P5G P6 P61 P62 P7 P71
          > P72 YA0 YA1 YA2)
          
          Data                               long   ->   wide
          -----------------------------------------------------------------------------
          Number of obs.                     4208   ->     176
          Number of variables                  10   ->     161
          j variable (39 values)          na_item   ->   (dropped)
          xij variables:
                                          gdp_CA_   ->   gdp_CA_B11 gdp_CA_B111 ... gdp_CA_YA2
                                         gdp_NSA_   ->   gdp_NSA_B11 gdp_NSA_B111 ... gdp_NSA_YA2
                                          gdp_SA_   ->   gdp_SA_B11 gdp_SA_B111 ... gdp_SA_YA2
                                         gdp_SCA_   ->   gdp_SCA_B11 gdp_SCA_B111 ... gdp_SCA_YA2
          -----------------------------------------------------------------------------
          
          . describe, short
          
          Contains data
            obs:           176                          
           vars:           161                          
           size:       250,624                          
          Sorted by: unit  geo  time
               Note: Dataset has changed since last saved.
          
          . list unit geo time in 1/16, sep(4)
          
               +---------------------------+
               |       unit   geo     time |
               |---------------------------|
            1. | CLV05_MEUR    DE   2000q1 |
            2. | CLV05_MEUR    DE   2000q2 |
            3. | CLV05_MEUR    DE   2000q3 |
            4. | CLV05_MEUR    DE   2000q4 |
               |---------------------------|
            5. | CLV05_MEUR    FR   2000q1 |
            6. | CLV05_MEUR    FR   2000q2 |
            7. | CLV05_MEUR    FR   2000q3 |
            8. | CLV05_MEUR    FR   2000q4 |
               |---------------------------|
            9. | CLV05_MNAC    DE   2000q1 |
           10. | CLV05_MNAC    DE   2000q2 |
           11. | CLV05_MNAC    DE   2000q3 |
           12. | CLV05_MNAC    DE   2000q4 |
               |---------------------------|
           13. | CLV05_MNAC    FR   2000q1 |
           14. | CLV05_MNAC    FR   2000q2 |
           15. | CLV05_MNAC    FR   2000q3 |
           16. | CLV05_MNAC    FR   2000q4 |
               +---------------------------+

          Comment


          • #6
            Oh yes thanks for correction I first thought it worked but then discovered there is still the problem. Thanks for all the effort.

            Comment

            Working...
            X