importing text file to STATA

Fuga Iwama

Join Date: Dec 2018

Posts: 28
#1

importing text file to STATA

05 Dec 2018, 06:39

Hi, I am new to STATA and trying to figure out how to import the text files to the STATA.

I know that in order to import text file data, File →import→text data→select a file and so on...
However, my data seems not to have variable names and such I have nothing but the weird series of numbers...

As I cannot attach the data, I have included the link to download the data. It is the 2009 survey and the bottom link that allows you to download the data. I have also attached the variable lists written in English.

Could anyone help me importing data on stata?

Thanks in advance!

Attached Files

Dise､o Adulto_Publicaci｢n.es.en (1).xls (334.5 KB, 1 view)

INEbase / Society /Health /European Survey of Health in Spain / Results/ Microdata

https://www.ine.es

INE. Instituto Nacional de Estadística. National Statistics Institute. Spanish Statistical Office. El INE elabora y distribuye estadisticas de Espana. Este servidor contiene: Censos de Poblacion y Viviendas 2001, Informacion general, Productos de difusion, Espana en cifras, Datos coyunturales, Datos municipales, etc.. Q2016.es
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#2

05 Dec 2018, 07:31

A file whose type is "xls" is likely a Microsoft Excel file, as the website you list indicates. You should try file-import-Excel spreadsheet.

Furthermore, few people on this forum will inspect an attached file, especially not ones that are not text files.
Comment
Fuga Iwama

Join Date: Dec 2018

Posts: 28
#3

05 Dec 2018, 10:58

Dear sir,

Firstly, thank you so much for the advise.

The file that I have attached (The Excel file) is the variable lists, not the actual data.

I somehow cannot attach a actual text file data, hence I put up the link.
For the further details, it is shown in the attached image below.

I would appreciate if you can show me how these numerical numbers fit in with the variables that listed on the Excel file.

If there is any misunderstandings, please do let me know as I am totally new to STATA!
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

05 Dec 2018, 11:05

It seems the file is compressed. Perhaps you may unzip beforehand.

Best regards,

Marcos
Comment
Fuga Iwama

Join Date: Dec 2018

Posts: 28
#5

05 Dec 2018, 11:20

To Marcos,

Thanks for the comment, I have unzipped the file after downloading the file. After unzipping, the text file appears with a bunch of numerical numbers with no variable names and so on...
Comment

David Benson

Join Date: Oct 2018
Posts: 489

05 Dec 2018, 17:41

Fuga,

I was able to import the data into Excel and then paste it into Stata (for some reason I had trouble using Stata's import manager to get the data in). Essentially, each observation (or row) is a 532 character string, which each string position meaning something (for example, the first 2 digits are code for which region they are in (01 Andalucía, 02 Aragón, 03 Asturias ( Princ de), 04 Baleares ( Islas), etc. That's all in the Excel file with the variable names, positions, and codes.

What I don't know is an easy way to automate all of the substring calls to parse the long string into all of the various variables, and also to assign the various labels.
N = 22,188 observations

Code:

* Creating the first 3 variables
rename var1 orig_data
label var orig_data "Data from survey; 532-char string"
gen region_resid = substr( orig_data, 1, 2)
gen muni_size = substr( orig_data, 3, 1)
gen hogar = substr( orig_data, 4, 8)

* Creating a note for each var that lists how created (and which digit it is)
note region_resid: substr( orig_data, 1, 2)
note muni_size: substr( orig_data, 3, 1)
note hogar: substr( orig_data, 4, 8)

tabulate region_resid
tabulate muni_size

I combined a couple of tables with the positions and descriptions of the variables.
Hopefully someone can come along and help you create a data dictionary, or automate the creation of the variables.

orig_data = Original 532 char string variable
pos_initial = position where this var starts
var_length = length of variable

So you ought to then be able to do some loop that does

gen CCAA = substr(orig_data, pos_initial, var_length)
and repeat for each variable.
You could label the variable using var_label at the same time

variable	var_length	pos_initial	pos_final	var_label
CCAA	2	1	2	Comunidad Autónoma
TMUNI	1	3	3	Tamaño de municipio
IDENTHOGAR	8	4	11	Sección + Vivienda + Hogar
NORDEN	2	12	13	Identificación de la persona seleccionada: Número de orden
SEXO	1	14	14	Identificación de la persona seleccionada:Sexo
EDAD	3	15	17	Identificación de la persona seleccionada:Edad
HH.PROXY_0	1	18	18	¿El informante es la persona seleccionada?
HH.PROXY_1	1	19	19	Informante proxy: ¿Cuál es el motivo por el que la persona seleccionada no facilita sus datos?
HH.PROXY_2	1	20	20	Informante proxy: ¿Es miembro del hogar el informante?
HH.PROXY_2b	2	21	22	Informante proxy: Número de orden del miembro del hogar
HH.PROXY_4	3	23	25	Informante proxy: Edad del Informante
HH.PROXY_5	1	26	26	Informante proxy: Relación del informante con el adulto seleccionado
HH9_1	1	27	27	País de nacimiento
HH9_2	3	28	30	País de nacimiento (código)
HH10_1a	1	31	31	Nacionalidad:Española
HH10_1b	1	32	32	Nacionalidad:Extranjera
HH10_1c	1	33	33	Nacionalidad:No sabe
HH10_1d	1	34	34	Nacionalidad:No contesta
HH10_2	3	35	37	País Nacionalidad (código)
HH11	1	38	38	Estado civil legal
HH12	1	39	39	Convive actualmente en pareja
HH12b	2	40	41	Número de orden de la pareja del adulto seleccionado
HH13	2	42	43	Nivel de estudios
HH14	1	44	44	Ha trabajado como asalariado o por cuenta propia
HH15a	1	45	45	Situación profesional en trabajo actual
HH15b	1	46	46	Situación profesional en último trabajo
HH16a	1	47	47	Tipo de contrato o relación laboral actual
HH16b	1	48	48	Tipo de contrato o relación laboral en último trabajo
HH17a	1	49	49	Ocupación actual: tiempo completo o parcial
HH17b	1	50	50	Última ocupación: tiempo completo o parcial
HH18a_3	2	51	52	Ocupación, profesión u oficio actual:Código ISCO-88, 2 dígitos
HH18b_3	2	53	54	Última ocupación, profesión u oficio:Código ISCO-88, 2 dígitos
HH19a_2	2	55	56	Actividad del establecimiento en que trabaja: código NACE rev.2, 2 dígitos
HH19b_2	2	57	58	Actividad del establecimiento en que trabajó: código NACE rev.2, 2 dígitos

I put it here in dataex to make it easier for others:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 variable byte(var_length pos_initial pos_final) str96 var_label
"CCAA"        2  1  2 "Comunidad Autónoma"                                                                            
"TMUNI"       1  3  3 "Tamaño de municipio"                                                                            
"IDENTHOGAR"  8  4 11 "Sección + Vivienda + Hogar"                                                                    
"NORDEN"      2 12 13 "Identificación de la persona seleccionada: Número de orden"                                    
"SEXO"        1 14 14 "Identificación de la persona seleccionada:Sexo"                                                
"EDAD"        3 15 17 "Identificación de la persona seleccionada:Edad"                                                
"HH.PROXY_0"  1 18 18 "¿El informante es la persona seleccionada?"                                                    
"HH.PROXY_1"  1 19 19 "Informante proxy: ¿Cuál es el motivo por el que la persona seleccionada no facilita sus datos?"
"HH.PROXY_2"  1 20 20 "Informante proxy: ¿Es miembro del hogar el informante?"                                        
"HH.PROXY_2b" 2 21 22 "Informante proxy: Número de orden del miembro del hogar"                                        
"HH.PROXY_4"  3 23 25 "Informante proxy: Edad del Informante"                                                          
"HH.PROXY_5"  1 26 26 "Informante proxy: Relación del informante con el adulto seleccionado"                          
"HH9_1"       1 27 27 "País de nacimiento"                                                                            
"HH9_2"       3 28 30 "País de nacimiento (código)"                                                                  
"HH10_1a"     1 31 31 "Nacionalidad:Española"                                                                          
"HH10_1b"     1 32 32 "Nacionalidad:Extranjera"                                                                        
"HH10_1c"     1 33 33 "Nacionalidad:No sabe"                                                                            
"HH10_1d"     1 34 34 "Nacionalidad:No contesta"                                                                        
"HH10_2"      3 35 37 "País Nacionalidad (código)"                                                                    
"HH11"        1 38 38 "Estado civil legal"                                                                              
"HH12"        1 39 39 "Convive actualmente en pareja"                                                                  
"HH12b"       2 40 41 "Número de orden de la pareja del adulto seleccionado"                                          
"HH13"        2 42 43 "Nivel de estudios"                                                                              
"HH14"        1 44 44 "Ha trabajado como asalariado o por cuenta propia"                                                
"HH15a"       1 45 45 "Situación profesional en trabajo actual"                                                        
"HH15b"       1 46 46 "Situación profesional en último trabajo"                                                      
"HH16a"       1 47 47 "Tipo de contrato o relación laboral actual"                                                    
"HH16b"       1 48 48 "Tipo de contrato o relación laboral en último trabajo"                                        
"HH17a"       1 49 49 "Ocupación actual: tiempo completo o parcial"                                                    
"HH17b"       1 50 50 "Última ocupación: tiempo completo o parcial"                                                  
"HH18a_3"     2 51 52 "Ocupación, profesión u oficio actual:Código ISCO-88, 2 dígitos"                              
"HH18b_3"     2 53 54 "Última ocupación, profesión u oficio:Código ISCO-88, 2 dígitos"                            
"HH19a_2"     2 55 56 "Actividad del establecimiento en que trabaja: código NACE rev.2, 2 dígitos"                    
"HH19b_2"     2 57 58 "Actividad del establecimiento en que trabajó: código NACE rev.2, 2 dígitos"                  
end

Last edited by David Benson; 05 Dec 2018, 17:53.

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#7

05 Dec 2018, 17:52

I tried to attach the files, but Statalist is telling me that they are too big. So DM me and I will post them to Dropbox or something.

Also, I found I could import the text file into Stata using:

Code:

import delimited "Adulto.txt", delimiter(comma)

Using comma as the delimiter is a trick because there are no commas in the file, but that sticks everything into a single variable that is 532 characters wide. You can then parse as we discussed above.

Hope that helps!
--David

Last edited by David Benson; 05 Dec 2018, 17:56.
1 like
Comment
Fuga Iwama

Join Date: Dec 2018

Posts: 28
#8

10 Dec 2018, 17:51

David Benson Thanks for the help! I have sent the DM and please have a look!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

11 Dec 2018, 10:21

The microdata link in #1 points to the following page. The Microdata files (ASCII compressed format) link on that page downloads a zip file that contains the following two text files:

Code:

Hogar.txt
Adulto.txt.TXT

These are fixed format text files and the layout of the fields can be downloaded using the Register design and valid variable values (EXCEL ZIP file) link, which contains when unzipped:

Code:

Dise§o Adulto_Publicaci¢n.xls
Dise§o Hogar_Publicaci¢n.xls

The first spreadsheet contains a sheet called "Diseño Registro Adultos" that contains all the information needed to construct a Stata dictionary which can be used by infile to input the data (see help infile2). Here's code to create a dictionary for the "Adulto.txt.TXT" data (leftalign is from SSC):

Code:

* create a Stata dictionary to read "Adulto.txt.TXT"

import excel using "Dise§o Adulto_Publicaci¢n.xls", sheet("Diseño Registro Adultos") clear
leftalign
list A-E in 1/13, string(30)

* check that all field are contiguous
gen len  = real(B)
gen last = real(D)
gen pos  = sum(len)
keep if !mi(len)
assert last == pos

* must have valid unique variable names
gen vname = ustrlower(ustrtoname(A))
isid vname

* char(34) is a double quote
gen dict = "str" + B + " " + vname + " %" + B + "s " + char(34) + E + char(34) if !mi(len)
replace dict = "dictionary { " + dict in 1
replace dict = dict + "}" in l
leftalign
list A-E dict in 1/10, string(30)

outfile dict using "adulto.dict", noquote replace

type "adulto.dict", lines(10)

and here's the results:

Code:

. import excel using "Dise§o Adulto_Publicaci¢n.xls", sheet("Diseño Registro Adultos") clear

. leftalign

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------
A               str63   %-63s                 
B               str8    %-9s                  
C               str16   %-16s                 
D               str15   %-15s                 
E               str129  %-129s                

. list A-E in 1/13, string(30)

     +-------------------------------------------------------------------------------------------------------------------+
     | A                                  B          C                 D                E                                |
     |-------------------------------------------------------------------------------------------------------------------|
  1. | ENCUESTA EUROPEA DE SALUD. CUE..                                                                                  |
  2. |                                                                                                                   |
  3. |                                                                                                                   |
  4. | DATOS DE IDENTIFICACIÓN                                                                                           |
  5. |                                                                                                                   |
     |-------------------------------------------------------------------------------------------------------------------|
  6. | CAMPO                              LONGITUD   POSICIÓN INICIO   POSICIÓN FINAL   DESCRIPCIÓN DEL CAMPO            |
  7. |                                                                                                                   |
  8. | CCAA                               2          1                 2                Comunidad Autónoma               |
  9. | TMUNI                              1          3                 3                Tamaño de municipio              |
 10. | IDENTHOGAR                         8          4                 11               Sección + Vivienda + Hogar       |
     |-------------------------------------------------------------------------------------------------------------------|
 11. | NORDEN                             2          12                13               Identificación de la persona s.. |
 12. | SEXO                               1          14                14               Identificación de la persona s.. |
 13. | EDAD                               3          15                17               Identificación de la persona s.. |
     +-------------------------------------------------------------------------------------------------------------------+

. 
. * check that all field are contiguous
. gen len  = real(B)
(108 missing values generated)

. gen last = real(D)
(108 missing values generated)

. gen pos  = sum(len)

. keep if !mi(len)
(108 observations deleted)

. assert last == pos

. 
. * must have valid unique variable names
. gen vname = ustrlower(ustrtoname(A))

. isid vname

. 
. * char(34) is a double quote
. gen dict = "str" + B + " " + vname + " %" + B + "s " + char(34) + E + char(34) if !mi(len)

. replace dict = "dictionary { " + dict in 1
(1 real change made)

. replace dict = dict + "}" in l
(1 real change made)

. leftalign

              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------
vname           str12   %-12s                 
dict            str146  %-146s                

. list A-E dict in 1/10, string(30)

     +-------------------------------------------------------------------------------------------------+
     | A             B   C    D    E                                  dict                             |
     |-------------------------------------------------------------------------------------------------|
  1. | CCAA          2   1    2    Comunidad Autónoma                 dictionary { str2 ccaa %2s "Co.. |
  2. | TMUNI         1   3    3    Tamaño de municipio                str1 tmuni %1s "Tamaño de muni.. |
  3. | IDENTHOGAR    8   4    11   Sección + Vivienda + Hogar         str8 identhogar %8s "Sección +.. |
  4. | NORDEN        2   12   13   Identificación de la persona s..   str2 norden %2s "Identificació.. |
  5. | SEXO          1   14   14   Identificación de la persona s..   str1 sexo %1s "Identificación .. |
     |-------------------------------------------------------------------------------------------------|
  6. | EDAD          3   15   17   Identificación de la persona s..   str3 edad %3s "Identificación .. |
  7. | HH.PROXY_0    1   18   18   ¿El informante es la persona s..   str1 hh_proxy_0 %1s "¿El infor.. |
  8. | HH.PROXY_1    1   19   19   Informante proxy: ¿Cuál es el ..   str1 hh_proxy_1 %1s "Informant.. |
  9. | HH.PROXY_2    1   20   20   Informante proxy: ¿Es miembro ..   str1 hh_proxy_2 %1s "Informant.. |
 10. | HH.PROXY_2b   2   21   22   Informante proxy: Número de or..   str2 hh_proxy_2b %2s "Informan.. |
     +-------------------------------------------------------------------------------------------------+

. 
. outfile dict using "adulto.dict", noquote replace

. 
. type "adulto.dict", lines(10)
dictionary { str2 ccaa %2s "Comunidad Autónoma"
str1 tmuni %1s "Tamaño de municipio"
str8 identhogar %8s "Sección + Vivienda + Hogar"
str2 norden %2s "Identificación de la persona seleccionada: Número de orden"
str1 sexo %1s "Identificación de la persona seleccionada:Sexo"
str3 edad %3s "Identificación de la persona seleccionada:Edad"
str1 hh_proxy_0 %1s "¿El informante es la persona seleccionada?"
str1 hh_proxy_1 %1s "Informante proxy: ¿Cuál es el motivo por el que la persona seleccionada no facilita sus datos?"
str1 hh_proxy_2 %1s "Informante proxy: ¿Es miembro del hogar el informante?"
str2 hh_proxy_2b %2s "Informante proxy: Número de orden del miembro del hogar"
.

With a dictionary in hand, you can import the data using

Code:

infile using "adulto.dict", using("Adulto.txt.TXT") clear
destring, replace

The second dataset can be imported using the same technique. The code to create the dictionary is:

Code:

* create a Stata dictionary to read "Hogar.txt"

import excel using "Dise§o Hogar_Publicaci¢n.xls", sheet("Diseño Registro Hogar") clear
leftalign
list A-E in 1/10, string(20)

* check that all field are contiguous
gen len  = real(B)
gen last = real(D)
gen pos  = sum(len)
keep if !mi(len)
assert last == pos

* must have valid unique variable names
gen vname = ustrlower(ustrtoname(A))
isid vname

* see help infile2 for details on how to construct a dictionary
* char(34) is a double quote
gen dict = "str" + B + " " + vname + " %" + B + "s " + char(34) + E + char(34) if !mi(len)
replace dict = "dictionary { " + dict in 1
replace dict = dict + "}" in l
leftalign
list A-E dict in 1/10, string(20)

outfile dict using "Hogar.dict", noquote replace

type "adulto.dict", lines(10)

and the data can be imported using:

Code:

infile using "Hogar.dict", using("Hogar.txt") clear
destring, replace

For both datasets, the spreadsheets contain a sheet about the value labels for each variables. You can use similar techniques to process these and create a do-file that can be used to attach value labels for each variable.

Announcement