Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Up to 5,000 variables are currently allowed - Explanation

    Hi all,

    I will be performing a phenome wide association study on data from UK Biobank.

    I currently have a file with all sorts of continuous, integer and categorical variables, which I am trying to load and visualize in Stata. The size is about 6GB.

    I get the error

    Code:
    no room to add more variables
        Up to 5,000 variables are currently allowed, although you could reset the maximum using set maxvar
    I tried resetting using maxvar, but even when I specify 10,000 as the max amount of variables, running
    Code:
    describe
    won't do anything and I'm not even able to visualize the data. I only get an empty table.

    I have 315 GB in disk C.

    Any thoughts and advice as to how to solve that problem?

    Many thanks in advance all.

  • #2
    More information is necessary to help you. To start with, here are some questions:

    1)How many variables *should* be on each line, at least approximately? If it's greater than the maximum value of -maxvar- for your version of Stata (see -help maxvar-), it would be necessary to work out a way to read only *part* of each data line.

    2) What is the current structure of the file, and what kind of information do you have about the layout of the variables? For example, you might have a CSV file, or your raw data might be in a text file with a listing of variable names and locations in some other file. Is there one data line for each individual in the data set, or are there multiple lines per individual?

    3) Do you need to analyze *all* the variables at once, or can you work with subsets of them?

    4) What do you have in mind by "visualize?" This could mean *many* different things?

    Comment


    • #3
      Many thanks for responding and pointing me in the right direction.

      So, to answer your questions:

      1) The file should contain information on about 60k individuals. I've requested information on about 30 variables for each of those individuals. I've not been able to 'see' the actual data yet, but it should be: one row per subject, columns correspond to the different phenotypes. While there's data on many people, the number of variables is definitely nothing drastic, it should be anything between 30 and 50.

      2) The current structure of the file is a .dta data file. I could request it as a csv as well, essentially; it is a file downloaded from UK Biobank showcase that can be converted into many different types of data file.

      3) Ideally, I would have to be able to analyze all variables at once indeed, this is the aim of my PhD project. I have applied for access to my local high computing cluster, but that may take a long time to arrange.

      4) By visualizing I meant that I wanted to be able to read the data into Stata and browse it to essentially 'see' what I'm looking at, as I haven't been able to do so yet.

      Thanks again for any other thoughts you might have.

      Comment


      • #4
        The file should contain information on about 60k individuals. I've requested information on about 30 variables for each of those individuals. I've not been able to 'see' the actual data yet, but it should be: one row per subject, columns correspond to the different phenotypes.
        60,000 observations and 30 variables is not large. I think the best advice is to ask the data providers to provide you with a CSV or XLS file which you can import in parts and change the layout of the data in case the issue is that the observations are initially arranged as variables. But if the dataset is in .dta format, then it cannot have more than 120,000 variables to start with. So you have to check what other contents are present. See

        Code:
        help limits

        Comment


        • #5
          Many thanks, I will inquire!

          Comment

          Working...
          X