Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing a subset of variables from .csv

    Hello,

    I am trying to use "import delimited" to import data from multiple .csv files. The .csv contains 180 columns/variables and I am interested in using only a few of them. Variables' names are written in the first row of each .csv. Is there a way to tell Stata "upload only those columns whose header (variable name) is X Y and Z?
    I am already aware that:
    1. I could import the whole .csv and "keep" those variables. However, this is extremely inefficient (each .csv weighs 1GB and I have 120 of them). So this is not a viable option. I want to import only the variables I need to speed up the process.
    2. "import delimited" has the option "colrange". There are two problems here. Firstly, colrange asks for the number of the column I want to upload but I don't know the number, I know the variable name. Secondly, my variables of interest are not grouped next to one another, they do not form a contiguous subset of the data. However, colrange appears to me to require a contiguous range of column numbers to upload.
    Is there a way to get around to these problems and upload only variable X Y and Z from a .csv?
    Any help or guidance would be greatly appreciated.

    Best,
    Edoardo

    PS: Stata version: 16.0 - OS: both Windows and iOS.

  • #2
    The second syntax in -help import delimited- shows you how to specify your -extvarlist- with -import delimited-. The catch here with using existing variable names is that you will need to specify them according to what Stata determines to be the final name, not necessarily as they are specified in the file. For example, if there are special characters, quotes or spaces in those names, Stata will first convert them to legal variable names. If you are unsure of what the names will be, simply import one observation to see the variable list that is generated from the file.

    Stata will still need to read all variables present in the file in order to know which variables it needs to import, but this should be somewhat faster than your current approach of loading the whole file then dropping variables that aren't needed.

    One thought is that you can work in two steps. The first step simply reads each CSV, takes the variables of interest, then saves those as a group of "minimal" datasets. The second step then reads in each of these minimal datasets and merges them together (or whatever is appropriate in your case).

    Comment


    • #3
      I am not sure of how to do this specifically in Stata. A potential solution to your specific problem can be the fread function in data.table package in R which can be used in Stata through the rcall package (https://github.com/haghish/rcall).

      Comment


      • #4
        Thanks for your help.

        I appreciate it.

        Comment

        Working...
        X