Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing csv with separate csv file containing variable names, labels, and value labels

    I'd appreciate any suggestions on importing csv data into Stata, where the data comes in 2 separate csv files containing i) the data and ii) the dictionary/codebook. (It is data from weather stations within the MetDB database, in case anyone is interested - but the problem/task is general.)

    The main data file is relatively straightforward and contains a number of variables, each of which is identified simply by a column number. E.g.:
    1 2 3
    2017 -80.367 0
    2017 -76.417 1
    2017 -73.633 2

    The "dictionary" file - I hesitate to call it that since a .dct file is something specific and not this - contains information on the variable name (metdb_short_name) of each column number in the main data file, the corresponding variable label (long_name), and the value label information where appropriate (flag_values and flag_meanings). E.g.:

    long_name 1 Year
    type 1 int
    metdb_short_name 1 YEAR
    long_name 2 Latitude
    type 2 int
    metdb_short_name 2 LTTD
    long_name 3 Station type
    type 3 int
    metdb_short_name 3 STTN_RPRT_TYPE
    flag_values 3 0 1 2
    flag_meanings 3 Automatic Manned Hybrid: both manned and automatic


    My instinct is to try to create 3 different tempfiles (one for each of varnames, varlabs, and value labels), and then try to use those to loop over the data and assign them using the usual rename, label variable etc. commands. Any suggestions?

    Thanks very much in advance!

  • #2
    Well, they certainly went out of their way to make life difficult when they created that!

    I would take a somewhat different approach. Stata will only let you have one active data set in memory at a time, so if you create a bunch of tempfiles you're going to be constantly swapping back and forth between them. I would start by just reading in the main data file using -import delimited-. Then I would use Stata's -file- commands to process the "dictionary" file. That way you keep the data in memory all the time, and you can read the "dictionary" one line at a time. Which is perfect, because except for the value labels, everything you need is always on a single line. So you loop reading one line at a time in the "dictionary" until you reach end of file in the "dictionary" file. For each line, split it into "words." The second word always identifies which variable the line is about. If the first word is metdb_short_name then you know that you will issue a -rename- command, with the target name being the third "word" of the line. Similarly a first word long_name will trigger a -label var- command. You can probably ignore the "type" lines: Stata will get the string vs numeric aspect of that right, and a -compress- at the very end will give you the most economical data storage types. That leaves the flag_values and flag_meanings lines. Those will be a bit harder. On encountering a flag_values line you'll have to proceed to reading the next line to get the flag_meanings information, and then dance back and forth between them to iteratively build a -label define- command, and finally a -label values- command.

    It will be somewhat tedious to code all this and get it right. But I think it will be easier to do it this way than to be constantly using, saving, preserving, and restoring your way through three data sets. The key insight here is that the data does need to be all in memory together, whereas for the "dictionary" file you only need the information from one (or sometimes two) lines at any given point.

    Hope this helps.

    Comment

    Working...
    X