Splitting variable without creating duplicates

Jozef Patrnciak

Join Date: Dec 2017

Posts: 24
#1

Splitting variable without creating duplicates

24 Jun 2018, 16:35

Hi, I am using Eurostat data for one project, I download them using:

HTML Code:

eurostatuse namq_10_gdp, noflags long

But I have a problem since seasonally adjusted and unadjusted data are being exported into the same column, thus creating duplicates. Is there a way for me to split the variable into separate variables according to the seasonal adjustment identifier in a way that there will be no duplicates?

Or alternatively, is there a way how to import them individually in the first place?

So far I tried to generate new variables by the seasonal adjustment identifier, but the problem is that when I do this the new variables have values not next to each other but in a sort of staircase-like fashion:
year country SA GDP NSA GDP

1999 AT 1100 .

1999 AT . 1000

and thus I have a problem with duplicates. What I would like would be to have them like
year country SA GDP NSA GDP

1999 AT 1100 1000

Thanks for any help.

Last edited by Jozef Patrnciak; 24 Jun 2018, 16:46.
Tags: data, importing data, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

24 Jun 2018, 17:10

Although many of the Forum members work in finance or economics, this is not a finance and economics forum, and some of the members who respond most often to questions are from other disciplines. So in posting here, it is best not to refer to things that only those in your discipline will know about, such as Eurostat data. It would have been more helpful had you used the -dataex- command to show an actual example of the data you are starting from.

That said, I have made some guesses about what is going on from your description. Do your data look like this?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float year str3 country float gdp str3 adjustment 1999 "AT" 1100 "SA" 1999 "AT" 1000 "NSA" 2000 "AT" 1250 "SA" 2000 "AT" 1225 "NSA" 1999 "XYZ" 750 "SA" 1999 "XYZ" 900 "NSA" 2000 "XYZ" 800 "SA" 2000 "XYZ" 880 "NSA" end

If that is what you are starting with, you can come very close to what you asked for by doing this:

Code:

reshape wide gdp, i(year country) j(adjustment) string

Notes: 1. It is not possible to get exactly what you asked for because "SA GDP" and "NSA GDP", which contain blanks, are not legal variable names in Stata.

2. The code above will only work on the assumption that, apart from the issue of some observations being SA and others being NSA, the variables country and year uniquely identify observations in the data. If that is not the case, then you will need to modify the code and include additional variables in the -i()- option.

-reshape- is one of Stata's most useful and important commands. Do read the -reshape- chapter of the [D] volume of your PDF documentation. It takes a little bit of getting used to, and in the beginning you will make mistakes, but at some point it will "click" in your head and using it will become routine and effortless.

If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

24 Jun 2018, 17:12

First, a piece of advice. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

It is asking a lot of someone who wants to help you to first install software, then figure out how to use it to download a small sample of data rather than the full database for all years, all this before attacking your problem.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

With that said, I did install the software and downloaded a (relatively small) sample of data.

Code:

. eurostatuse namq_10_gdp, noflags long start(2000Q1) end(2000Q4) geo(FR)

    Dataset: GDP and main components  (output, expenditure and income)
    Last update: 22.06.2018
    Start: 1975Q1
    End: 2018Q1

Downloading and formating data ...
Reshaping dataset ...
Downloading and formating labels ...

. describe

Contains data
  obs:         4,320                          
 vars:            10                          
 size:     1,386,720                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
unit            str14   %14s                  Unit of measure
unit_label      str110  %110s                 
s_adj           str3    %9s                   Seasonal adjustment
s_adj_label     str77   %77s                  
na_item         str11   %11s                  National accounts indicator (ESA 2010)
na_item_label   str88   %88s                  
geo             str2    %9s                   Geopolitical entity (reporting)
geo_label       str6    %9s                   
time            int     %tq                   
namq_10_gdp     double  %10.0g                
------------------------------------------------------------------------------------------------
Sorted by: unit  unit_label  s_adj  s_adj_label  na_item  na_item_label  geo  geo_label  time
     Note: Dataset has changed since last saved.

. tab s_adj

   Seasonal |
 adjustment |      Freq.     Percent        Cum.
------------+-----------------------------------
         CA |        240        5.56        5.56
        NSA |      1,944       45.00       50.56
         SA |         64        1.48       52.04
        SCA |      2,072       47.96      100.00
------------+-----------------------------------
      Total |      4,320      100.00

This suggests to me that the following may start you in a helpful direction.

Code:

rename namq_10_gdp gdp_
drop s_adj_label
generate seq = _n
reshape wide gdp_, i(seq) j(s_adj) string
describe

which produces the following results

Code:

. rename namq_10_gdp gdp_

. drop s_adj_label

. generate seq = _n

. reshape wide gdp_, i(seq) j(s_adj) string
(note: j = CA NSA SA SCA)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     4320   ->    4320
Number of variables                  10   ->      12
j variable (4 values)             s_adj   ->   (dropped)
xij variables:
                                   gdp_   ->   gdp_CA gdp_NSA ... gdp_SCA
-----------------------------------------------------------------------------

. describe

Contains data
  obs:         4,320                          
 vars:            12                          
 size:     1,162,080                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
seq             float   %9.0g                 
gdp_CA          double  %10.0g                CA gdp_
gdp_NSA         double  %10.0g                NSA gdp_
gdp_SA          double  %10.0g                SA gdp_
gdp_SCA         double  %10.0g                SCA gdp_
unit            str14   %14s                  Unit of measure
unit_label      str110  %110s                 
na_item         str11   %11s                  National accounts indicator (ESA 2010)
na_item_label   str88   %88s                  
geo             str2    %9s                   Geopolitical entity (reporting)
geo_label       str6    %9s                   
time            int     %tq                   
------------------------------------------------------------------------------------------------
Sorted by: seq

Comment

Jozef Patrnciak

Join Date: Dec 2017

Posts: 24
#4

25 Jun 2018, 04:22

My apologies for not asking the question in a proper way.

the following William's solution worked perfectly for me:

Code:

rename namq_10_gdp gdp_ drop s_adj_label generate seq = _n reshape wide gdp_, i(seq) j(s_adj) string describe

Also, although I don't need it now, for the future I would be also interested to know if there is a way to reshape it in a way that all individual combinations of na_items (individual components of GDP) and s_adj would be created as new variables?

I tried to include both na_item and s_adj like j(s_adj na_item) but it did not work

Thanks again for the help
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

25 Jun 2018, 08:29

First, an apology. The code in post #3 did not work correctly. A close look at at the output from reshape shows that the number of observations was unchanged by the command. While four new variables were created, they each appeared on a separate observation. That was the result of my not at all understanding the data, and thus using a made-up sequence number to identify groups of observations that were meant to be identical except for the seasonal adjustment indicator. Made-up sequence numbers work for reshape long, but not for reshape wide.

Here is corrected code and output for the task set about in post #3.

Code:

rename namq_10_gdp gdp_
drop s_adj_label
reshape wide gdp_, i(unit na_item geo time) j(s_adj) string
describe
list unit na_item geo time gdp_* in 1/16, sep(4)

Code:

. drop s_adj_label

. reshape wide gdp_, i(unit na_item geo time) j(s_adj) string
(note: j = CA NSA SA SCA)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     7936   ->    4208
Number of variables                   9   ->      11
j variable (4 values)             s_adj   ->   (dropped)
xij variables:
                                   gdp_   ->   gdp_CA gdp_NSA ... gdp_SCA
-----------------------------------------------------------------------------

. describe

Contains data
  obs:         4,208                          
 vars:            11                          
 size:     1,291,856                          
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
unit            str14   %14s                  Unit of measure
na_item         str11   %11s                  National accounts indicator (ESA 2010)
geo             str2    %9s                   Geopolitical entity (reporting)
time            int     %tq                  
gdp_CA          double  %10.0g                CA gdp_
gdp_NSA         double  %10.0g                NSA gdp_
gdp_SA          double  %10.0g                SA gdp_
gdp_SCA         double  %10.0g                SCA gdp_
unit_label      str110  %110s                
na_item_label   str88   %88s                  
geo_label       str48   %48s                  
------------------------------------------------------------------------------------------------
Sorted by: unit  na_item  geo  time
     Note: Dataset has changed since last saved.

. list unit na_item geo time gdp_* in 1/16, sep(4)

     +-------------------------------------------------------------------------------+
     |       unit   na_item   geo     time     gdp_CA    gdp_NSA   gdp_SA    gdp_SCA |
     |-------------------------------------------------------------------------------|
  1. | CLV05_MEUR       B1G    DE   2000q1          .   490212.9        .   496314.7 |
  2. | CLV05_MEUR       B1G    DE   2000q2          .   497432.6        .   500701.9 |
  3. | CLV05_MEUR       B1G    DE   2000q3          .   509206.1        .   502423.5 |
  4. | CLV05_MEUR       B1G    DE   2000q4          .   505651.8        .   503256.5 |
     |-------------------------------------------------------------------------------|
  5. | CLV05_MEUR       B1G    FR   2000q1   362324.7   363607.6        .   360657.9 |
  6. | CLV05_MEUR       B1G    FR   2000q2   366022.7   365468.9        .   364253.2 |
  7. | CLV05_MEUR       B1G    FR   2000q3   357135.8   356423.6        .   366504.4 |
  8. | CLV05_MEUR       B1G    FR   2000q4   375718.4   375105.3        .   369785.2 |
     |-------------------------------------------------------------------------------|
  9. | CLV05_MEUR      B1GQ    DE   2000q1          .   547692.7        .   555231.8 |
 10. | CLV05_MEUR      B1GQ    DE   2000q2          .   556866.8        .   560858.7 |
 11. | CLV05_MEUR      B1GQ    DE   2000q3          .   566713.7        .   560063.6 |
 12. | CLV05_MEUR      B1GQ    DE   2000q4          .   565184.7        .   560491.8 |
     |-------------------------------------------------------------------------------|
 13. | CLV05_MEUR      B1GQ    FR   2000q1   400527.8   401824.2        .   401244.8 |
 14. | CLV05_MEUR      B1GQ    FR   2000q2   406961.4   406432.7        .   404910.4 |
 15. | CLV05_MEUR      B1GQ    FR   2000q3   397637.9     396945        .   407638.2 |
 16. | CLV05_MEUR      B1GQ    FR   2000q4   419763.5   419180.3        .   411097.2 |
     +-------------------------------------------------------------------------------+

Now to address the question in post #4. There are two approaches to the task, and my preference is to do two successive reshapes, rather than generate a combined "seasonal adjustment and NA item" variable. So starting from the results above

Code:

rename (gdp_*) (=_)
drop na_item_label
reshape wide gdp_*_, i(unit geo time) j(na_item) string
describe, short
list unit geo time in 1/16, sep(4)

Code:

. rename (gdp_*) (=_)

. drop na_item_label

. reshape wide gdp_*_, i(unit geo time) j(na_item) string
(note: j = B11 B111 B112 B1G B1GQ B2A3G D1 D11 D12 D2 D21 D21X31 D2X3 D3 D31 P3 P31_S13 P31_S14
> P31_S14_S15 P31_S15 P32_S13 P3_P5 P3_P6 P3_S13 P41 P51G P52 P52_P53 P53 P5G P6 P61 P62 P7 P71
> P72 YA0 YA1 YA2)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     4208   ->     176
Number of variables                  10   ->     161
j variable (39 values)          na_item   ->   (dropped)
xij variables:
                                gdp_CA_   ->   gdp_CA_B11 gdp_CA_B111 ... gdp_CA_YA2
                               gdp_NSA_   ->   gdp_NSA_B11 gdp_NSA_B111 ... gdp_NSA_YA2
                                gdp_SA_   ->   gdp_SA_B11 gdp_SA_B111 ... gdp_SA_YA2
                               gdp_SCA_   ->   gdp_SCA_B11 gdp_SCA_B111 ... gdp_SCA_YA2
-----------------------------------------------------------------------------

. describe, short

Contains data
  obs:           176                          
 vars:           161                          
 size:       250,624                          
Sorted by: unit  geo  time
     Note: Dataset has changed since last saved.

. list unit geo time in 1/16, sep(4)

     +---------------------------+
     |       unit   geo     time |
     |---------------------------|
  1. | CLV05_MEUR    DE   2000q1 |
  2. | CLV05_MEUR    DE   2000q2 |
  3. | CLV05_MEUR    DE   2000q3 |
  4. | CLV05_MEUR    DE   2000q4 |
     |---------------------------|
  5. | CLV05_MEUR    FR   2000q1 |
  6. | CLV05_MEUR    FR   2000q2 |
  7. | CLV05_MEUR    FR   2000q3 |
  8. | CLV05_MEUR    FR   2000q4 |
     |---------------------------|
  9. | CLV05_MNAC    DE   2000q1 |
 10. | CLV05_MNAC    DE   2000q2 |
 11. | CLV05_MNAC    DE   2000q3 |
 12. | CLV05_MNAC    DE   2000q4 |
     |---------------------------|
 13. | CLV05_MNAC    FR   2000q1 |
 14. | CLV05_MNAC    FR   2000q2 |
 15. | CLV05_MNAC    FR   2000q3 |
 16. | CLV05_MNAC    FR   2000q4 |
     +---------------------------+

Comment

Jozef Patrnciak

Join Date: Dec 2017

Posts: 24
#6

25 Jun 2018, 12:34

Oh yes thanks for correction I first thought it worked but then discovered there is still the problem. Thanks for all the effort.
Comment

year	country	SA GDP	NSA GDP
1999	AT	1100	.
1999	AT	.	1000

year	country	SA GDP	NSA GDP
1999	AT	1100	1000

Announcement