Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with merging aggregate data to survey data

    Dear Stata list members,

    I am currently trying to add several macro indicators from the Eurostat database towards the cross-sectional study data (European Social Survey).
    However, I am struggling at a pre-processing stage.

    The excel files were already successfully imported into Stata.

    The macro indicators are differentiated by the NUTS regional classification of European regions. The concept of NUTS classification is that NUTS 3 can be aggregated to NUTS 2, and correspondingly NUTS 2 to NUTS 1 data.
    Now, I do have the issue that some countries like Germany and UK have only NUTS level 1 data, meaning that they have rather broad regions(Federal states), whereas some other countries have smaller spatial data like NUTS level 2 and NUTS level 3.

    My aim is to use the smallest possible scale of regional differentiation. This is only limited to NUTS level 2 due to the availability of macro indicators up to that level in Eurostat, e.g. regional unemployment numbers, population density etc.

    So, I do have two data files containing the spatial area in km² , one with NUTS level 2 data (19 countries) and the other with NUTS level 1 data (Germany and UK).

    Appending results in the following:

    Code:
    use "$area\reg_area3_nuts1.dta", clear
    append using "$area\reg_area3_nuts2.dta", nolabel gen(source)
    
    . describe
    
    Contains data from ...\Data\Macro-indicators\Area in square kilometers\reg_area3_nuts1.dta
      obs:           447                          
     vars:             4                          7 Jul 2019 15:56
     size:        83,589                          
    ------------------------------------------------------------------------------------------------------------------------
                  storage  display     value
    variable name   type   format      label      variable label
    ------------------------------------------------------------------------------------------------------------------------
    nuts_label      str61  %61s                   NUTS label
    area           long   %8.0g       area       Area in km2
    nuts1label      str61  %61s                   NUTS 1 label
    nuts2label      str61  %61s                   NUTS 2 label
    Two things strike me as odd:
    1. The values for the variable "area" are apparently sorted by the first digit oft the value and not in an ascending manner. How is that possible?


    Code:
    . sort area
    . list in 1/10
    +--------------------------------------+
         |   area           nuts_label   source |
         |--------------------------------------|
      1. | 100450           Basilicata        1 |
      2. | 100450               Ísland        0 |
      3. |  11952        Île de France        0 |
      4. |  11952               Ísland        1 |
      5. |  12998   West Midlands (UK)        0 |
         +--------------------------------------+
    Code:
    . list in 17/21
    
    +-------------------------------------+
    area           nuts_label   source
    -------------------------------------
    17.  15408              Abruzzo        1
    18.  15408   Schleswig-Holstein        0
    19.   1553           Nord-Norge        1
    20.   1553                Åland        0
    21.  15623   East Midlands (UK)        0
    +-------------------------------------+
    And the second issue that occurs is:

    The values for the variable "area" of the appended file with NUTS 2 level data have changed and seem to have rather arbitrary values.
    Two examples for illustration purposes: Prov. Antwerpen has 2804 km² as correct and original value in "reg_area3_nuts2.dta" but now has changed to 149 km²
    and Prov. Limburg (BE) has 2390 km² as original value and now is 124 km².

    What went wrong? Is it the storage type?


    In case it helps, I run Stata on a Windows systems and do use version 12.1, but have also tried it on 13. so I guess it is not the Stata version that's amiss here

    Best regards
    Thomas
    Last edited by Thomas Enns; 07 Jul 2019, 09:40.

  • #2
    Welcome to Statalist.

    The values for the variable "area" are apparently sorted by the first digit oft the value and not in an ascending manner. How is that possible?

    The values for the variable "area" of the appended file with NUTS 2 level data have changed and seem to have rather arbitrary values.
    Both of these are a consequence of they way you imported your data from Excel. I believe that area was imported a string variable, and you used the encode command to convert it to numeric values. That is very much a mistake, as the output of help encode tells us.

    Do not use encode if varname contains numbers that merely happen to be stored as strings
    Instead, you should read the output of help destring and use the destring command to convert area from string to numeric. What you currently have is of no use to you.

    As a further explanation of the symptoms you are seeing, as opposed to the mistake that led to the symptoms, the output of the describe command tells us
    Code:
    ----------------------------------------------------------------------------------
                  storage  display     value
    variable name   type   format      label      variable label
    ----------------------------------------------------------------------------------
    area            long   %8.0g       area       Area in km2
    From that we understand that area is a numeric variable that has a value label assigned to it, which translates each numeric value into a text string - which in your case happens to consist of numeric digits. So what you are seeing are the value labels assigned to the actual values, and they are sorted alphabetically. Try
    Code:
    list in 17/21, nolabel
    to see what the actual values coded for area are, and see
    Code:
    label list area
    to see what all possible values and labels are.

    Let me also add a tip. If destring tells you that there are observations of area with nonnumeric values, so it cannot convert area to numeric, then
    Code:
    browse if real(area)==.
    will show you the observations in question, which you will have to fix. Very often "NA" is used in Excel workbooks to indicate a "not applicable" data item, and if that is the case, you will first want to
    Code:
    replace area = "" if area=="NA"
    before running the destring command.

    Comment


    • #3
      Dear William,

      thank you for the warm welcome and the precise help.
      Your answer solved my problems.

      Originally posted by William Lisowski View Post
      Let me also add a tip. If destring tells you that there are observations of area with nonnumeric values, so it cannot convert area to numeric, then
      Ironically, this "error" notice of destring was exactly the reason why I used the encode command in the first place.
      Thanks again, for your help.

      Comment

      Working...
      X