Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to convert SEER to Stata

    Good afternoon all

    I'm trying to work with the SEER database, (specifically with the breast cancer patients data) but I haven't been able to convert/transform to Stata format. Anyone can help me please to find out how can I handle this?

    Thanks in advance

  • #2
    SEER has several different data sources. Can you say something more specific? What kind of file do you have?

    Comment


    • #3
      I have the ASCI files they provide. I need to get them into Stata to run the analysis for this particular group of patients. (I will need all the variables they have). Thanks

      Comment


      • #4
        Well, there are two different kinds of ASCII files available from SEER. Some are fixed format (the same variables always start in the same columns and there are no delimiters (spaces, commas, tabs, pipes, whatever) separating the variables. The -infix- command reads those, and it requires you to specify which variables start in which columns (that information is typically contained in a readme.txt file that accompanies the data file.) See -help infix- for details. Most of the SEER data I work with comes in comma or tab delimited files, and these are easily read using -import delimited-.

        So first determine whether you have a fixed format or a delimited file. (Just open it in a text editor and it will be clear at a glance.) Then see the help file for the corresponding command. If you cannot discern which you have, post a small sample of the data.

        Comment


        • #5
          Thanks for your help. It seems it is a fixed format. I guess I need to specify how to reorder the variables but have one more question. When I use the infix command with specifications, how I do code the spaces in the raw data.
          For instance: the patient ID variable has 8 characters, registry ID 10 and then there is a space, (so at beginning I thought it was delimited ,but then when I check the dictionary I realized there were the two first variables together) and there rest are also randomly separated. So Stata transform the two variables as one long (col1).
          Last edited by Maria Nunez; 14 Sep 2017, 12:26.

          Comment


          • #6
            The specifications in -infix- ask you to tell Stata the first and last columns that the variable occupies. I don't understand your description of patient ID and registry ID, so I don't know what to tell you specifically about these. But the documentation should either say for each variable either what range of columns it occupies (which is what Stata asks for) or what column it begins with and how many columns wide it is (from which you can calculate the last column = first colulmn + width - 1). You don't need to say anything to Stata to deal explicitly with spaces. Stata will skip over them if they are in columns not covered in the specification, and will interpret them as part of the variable itself if they lie within a specified column range.

            If you want more specific advice about how to handle patient ID and registry ID, I suggest you post an excerpt from the file documentation that gives the location information for these variables, and perhaps post one or two representative rows from the file itself showing the way the data is laid out.

            Comment


            • #7
              Thank you for your help. I didn't know how to delimit the variables in each column because the ASCII file doesn't have them properly arranged. But I manage to get the information through the dictionary that SEER provides and with the " infile" command I was able to retrieve the information I needed.

              Comment


              • #8
                Hello! I am also in need of help. I am a newby to STATA and I am having a similar problem as Maria Nunez. My project is a study of lung cancer and stage of diagnosis in various states.

                Thank you in advance !

                Here is a sample of the SEER as I have downloaded it, which comes as a txt file....

                070000090000001502201 010811914 02041995C3411801238012332104099 99800 4999 6 00 01 217 220301623C341 1161003 0198063110990909 009009 500605006040 359999 19947 00081 99 8 9999990200
                070000110000001502201 010881884 03091973C34998010380103971 70 1999 9 09 01 218 220301629C349 1161003 0198063110 009009 22030220304 359999 19947 00001 99 8 0300
                070000150000001502201 010761902 02091978C34318070380703921 20-41000--000 2999 9 09 01 216 220301625C343 1161003 0298063110 009009 22030220304 359999 19947 00231 99 8 0200
                070000180000001502201 020641910 02021975C34118140381403311 -0 1999 9 09 01 213 220301623C341 1161003 0598063110 009001 22030220304 359999 19947 00041 99 8 0200
                070000230000001502401 020661924 02011991C3431801038010332103085 99800 4999 1 00 01 214 220301625C343 1161003 019806311044040 009009 220302203040 359999 19947 00031 99 8 9999100200
                070000480000001502201 010771897 02021974C34928021380213411 &6 1999


                Comment


                • #9
                  Which SEER dataset are you referring to? As Clyde mentioned, there are several to choose from.

                  Comment


                  • #10
                    Sorry for the delay, I wasn't notified about your response. I am interested in the entire SEER cohort between 1973 and 2014 for lung cancer diagnoses

                    Comment

                    Working...
                    X