Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can Stata or Mata give me the dataset label of a Stata dataset on disk?

    Is it possible, in Stata or Mata, to access the dataset label of a Stata dataset on disk without loading the dataset into memory? I ask because the describe command (with a using qualifier) is capable of returning a list of very useful results in r(), including a list of all variables and a list of variables by which the dataset is sorted, and it outputs the dataset label (and the time and date of dataset creation) in the printed output, suggesting that the describe command reads the dataset label (and the time and date of dataset creation) without loading the dataset, but I have not managed to find how I get the dataset label (or the time and date of dataset creation). I have checked

    help undocumented

    and discovered the dtaversion, webdescribe and dtaverify commands, but nothing about inputting the dataset label in a form that I could process elsewhere. (For instance, I might like to update my SSC command descgen, which inputs a dataset in memory with 1 obs for each of a set of datasets on disk, to create an output variable containing the dataset labbels.)

    Is there a solution?

    Best wishes

    Roger

  • #2
    Roger Newson
    The only solution I can think of would be to parse the dta manually and grab any of those additional elements that you’re wanting. Mata would be the way to go since it has better capabilities for scanning/moving by number of bytes and things like that.

    Comment


    • #3
      I agree with wbuchanan . Parsing the dataset header should be relatively straightforward since the dataset label will be tagged with <label> </label> (per 5.1 Header in .help dta) and the tags will exist even if empty.
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        A regex solution is straightforward:
        Code:
        findfile auto.dta
        
        mata:
        
           fh = fopen("`r(fn)'", "r")
           firstpart = fread(fh, 200)
           fclose(fh) 
           
           ustrregexm( ustrregexra( firstpart,"\p{Cc}", "" ) , "label>(.*)</label" )
           label = ustrregexs(1)
           
           label
           firstpart
           
        end
        Code:
        :    label
          1978 Automobile Data
        
        :    firstpart
          <stata_dta><header><release>117</release><byteorder>LSF</byteorder><K>\uc\u0</K><N>J\u0\u0\u0</N><label>\u141978 
         Automobile Data</label><timestamp>\u1113 Apr 2016 17:45</timestamp></header><map>\u0\u0\u0\u0\u0\u0\u0\u0\xad\u0\
         u0\u0\u0\u0\u0\u0(\u1\u0\u0\u0\u0

        Comment


        • #5
          Many thanks to all 3 of you for these solutions.

          The easy part of these seems to be using fopen(), fread() and fclose() to extract the first 200 bytes from a Stata datafile. And the more complicated part seems to be using the ustrregexm() function (in its documented Stata version or in its undocumented Mata version). This function appears to store the matching subexpressions in an unpublicized data cache for later retrieval by a later ustrrexxs() function. (Which is wierd behaviour for a function.)

          Of course, an inelegant solution would use preserve and restore, wrapped around code starting with a command like

          use "datafilename" if 0, clear

          and continuing with commands to extract the dataset label from the empty dataset in memory. Such a horribly inelegant solution might work for most people, most of the time, especially if the preserved dataset was not too large...

          Best wishes

          Roger

          Comment


          • #6
            FYI, the command describe now stores the dataset label in r(datalabel) in Stata 15. For the latest updates, type
            Code:
            update all

            Comment

            Working...
            X