Can Stata or Mata give me the dataset label of a Stata dataset on disk?

Roger Newson

Join Date: Apr 2014

Posts: 317
#1

Can Stata or Mata give me the dataset label of a Stata dataset on disk?

13 Oct 2018, 12:16

Is it possible, in Stata or Mata, to access the dataset label of a Stata dataset on disk without loading the dataset into memory? I ask because the describe command (with a using qualifier) is capable of returning a list of very useful results in r(), including a list of all variables and a list of variables by which the dataset is sorted, and it outputs the dataset label (and the time and date of dataset creation) in the printed output, suggesting that the describe command reads the dataset label (and the time and date of dataset creation) without loading the dataset, but I have not managed to find how I get the dataset label (or the time and date of dataset creation). I have checked

help undocumented

and discovered the dtaversion, webdescribe and dtaverify commands, but nothing about inputting the dataset label in a form that I could process elsewhere. (For instance, I might like to update my SSC command descgen, which inputs a dataset in memory with 1 obs for each of a set of datasets on disk, to create an output variable containing the dataset labbels.)

Is there a solution?

Best wishes

Roger
Tags: None
wbuchanan

Join Date: Mar 2014

Posts: 1362
#2

13 Oct 2018, 21:01

Roger Newson
The only solution I can think of would be to parse the dta manually and grab any of those additional elements that you’re wanting. Mata would be the way to go since it has better capabilities for scanning/moving by number of bytes and things like that.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

13 Oct 2018, 22:40

I agree with wbuchanan . Parsing the dataset header should be relatively straightforward since the dataset label will be tagged with <label> </label> (per 5.1 Header in .help dta) and the tags will exist even if empty.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

14 Oct 2018, 05:19

A regex solution is straightforward:

Code:

findfile auto.dta

mata:

   fh = fopen("`r(fn)'", "r")
   firstpart = fread(fh, 200)
   fclose(fh) 
   
   ustrregexm( ustrregexra( firstpart,"\p{Cc}", "" ) , "label>(.*)</label" )
   label = ustrregexs(1)
   
   label
   firstpart
   
end

Code:

:    label
  1978 Automobile Data

:    firstpart
  <stata_dta><header><release>117</release><byteorder>LSF</byteorder><K>\uc\u0</K><N>J\u0\u0\u0</N><label>\u141978 
 Automobile Data</label><timestamp>\u1113 Apr 2016 17:45</timestamp></header><map>\u0\u0\u0\u0\u0\u0\u0\u0\xad\u0\
 u0\u0\u0\u0\u0\u0(\u1\u0\u0\u0\u0

Comment

Roger Newson

Join Date: Apr 2014

Posts: 317
#5

15 Oct 2018, 05:55

Many thanks to all 3 of you for these solutions.

The easy part of these seems to be using fopen(), fread() and fclose() to extract the first 200 bytes from a Stata datafile. And the more complicated part seems to be using the ustrregexm() function (in its documented Stata version or in its undocumented Mata version). This function appears to store the matching subexpressions in an unpublicized data cache for later retrieval by a later ustrrexxs() function. (Which is wierd behaviour for a function.)

Of course, an inelegant solution would use preserve and restore, wrapped around code starting with a command like

use "datafilename" if 0, clear

and continuing with commands to extract the dataset label from the empty dataset in memory. Such a horribly inelegant solution might work for most people, most of the time, especially if the preserved dataset was not too large...

Best wishes

Roger
Comment
Kreshna Gopal (StataCorp)

StataCorp Employee

Join Date: Apr 2014

Posts: 43
#6

22 Feb 2019, 15:35

FYI, the command describe now stores the dataset label in r(datalabel) in Stata 15. For the latest updates, type

Code:

update all
Comment

Announcement

Can Stata or Mata give me the dataset label of a Stata dataset on disk?

Comment

Comment

Comment

Comment

Comment