Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    You can have Unicode named ado program too, not recommended since it can easily mess up the namespace, but you can, for example

    Code:
    *! version 1.0.0   17oct2019
    program 线性回归, byable(onecall) prop(svyb svyj svyr bayes)
            if _by() {
                    local by "by `_byvars'`_byrc0':"
            }
            `by' regress `0'
    end
    exit
    save the above into a file called 线性回归.ado, which by the way, is the Chinese translation of "linear regression", then the following works,

    Code:
    . sysuse auto
    (1978 Automobile Data)
    
    . 线性回归 mpg price
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =     20.26
           Model |  536.541807         1  536.541807   Prob > F        =    0.0000
        Residual |  1906.91765        72  26.4849674   R-squared       =    0.2196
    -------------+----------------------------------   Adj R-squared   =    0.2087
           Total |  2443.45946        73  33.4720474   Root MSE        =    5.1464
    
    ------------------------------------------------------------------------------
             mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           price |  -.0009192   .0002042    -4.50   0.000    -.0013263   -.0005121
           _cons |   26.96417   1.393952    19.34   0.000     24.18538    29.74297
    ------------------------------------------------------------------------------

    Comment


    • #17
      I can't post the file to the forum without losing the BOM in the first 3
      bytes, but here is an octal dump of the first 3 records. The first record
      has the filenames. Notice the 357 273 277 before the names start:


      0000000 357 273 277 B v D I D n u m b e r
      0000020 \t N A M E \n R x x x x x x x \t J
      0000040 o i n t - S t o c k C o m p a
      0000060 n y V t b C a p i t a l \n U
      0000100 x x x x x x x x \t W a l m a r t
      0000120 I n c . \n
      0000126


      If I cat the file, it looks fine - the first three bytes don't print:

      BvD ID number NAME
      Rxxxxxxx Joint-Stock Company Vtb Capital
      Uxxxxxxxx Walmart Inc.


      If Stata 16 reads the file the variable names are lost and become just v1 and v2:

      . import delimited using test.txt. ,encoding(utf8)
      (2 vars, 3 obs)

      . des

      Contains data
      obs: 3
      vars: 2
      -------------------------------------------------------------------------------
      storage display value
      variable name type format label variable label
      -------------------------------------------------------------------------------
      v1 str13 %13s
      v2 str31 %31s
      -------------------------------------------------------------------------------

      I can specify "delimiter("\t") without affecting the result. The actual file is very large, so removing the BOM with an editor is not easy. Thanks for looking into this. I am sorry I couldn't find a fixed pitch font on the forum page.

      Daniel Feenberg

      Comment


      • #18
        Thanks for sharing the data, I will look into it to see what I find.

        Comment


        • #19
          If Stata 16 reads the file the variable names are lost and become just v1 and v2:

          . import delimited using test.txt. ,encoding(utf8)
          (2 vars, 3 obs)
          Code:
          clear              
          import delimited using test.txt, encoding(utf8) varnames(1) case(preserve)
          describe
          list
          datasignature
          running the above on a file with BOM:
          Code:
          . hexdump test.txt                          
                           |                                         |    character
                           |           hex representation            |  representation
                   address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
          -----------------+-----------------------------------------+-----------------
                         0 | efbb bf22 4122 0922 4222 0922 4322 0d0a | "A"."B"."C"..
                        10 | 3109 3209 33                            | 1.2.3            
          
          . import delimited using test.txt, encoding(utf8) varnames(1) case(preserve)
          (3 vars, 1 obs)
          
          . describe
          
          Contains data
            obs:             1                          
           vars:             3                          
          -------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          -------------------------------------------------------------------------------------
          A               byte    %8.0g                
          B               byte    %8.0g                
          C               byte    %8.0g                
          -------------------------------------------------------------------------------------
          Sorted by:
               Note: Dataset has changed since last saved.
          
          . list
          
               +-----------+
               | A   B   C |
               |-----------|
            1. | 1   2   3 |
               +-----------+
          
          . datasignature
            1:3(64384):1771983699:2294851648
          removing the BOM using filefilter:
          Code:
          . filefilter test.txt test2.txt, from("\EFh\BBh\BFh") to("") replace
          (file test2.txt was replaced)
          
          . hexdump test2.txt
                           |                                         |    character
                           |           hex representation            |  representation
                   address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
          -----------------+-----------------------------------------+-----------------
                         0 | 2241 2209 2242 2209 2243 220d 0a31 0932 | "A"."B"."C"..1.2
                        10 | 0933                                    | .3              
          
          . clear                    
          
          . import delimited using test2.txt, encoding(utf8) varnames(1) case(preserve)
          (3 vars, 1 obs)
          
          . describe
          
          Contains data
            obs:             1                          
           vars:             3                          
          -------------------------------------------------------------------------------------
                        storage   display    value
          variable name   type    format     label      variable label
          -------------------------------------------------------------------------------------
          A               byte    %8.0g                
          B               byte    %8.0g                
          C               byte    %8.0g                
          -------------------------------------------------------------------------------------
          Sorted by:
               Note: Dataset has changed since last saved.
          
          . list
          
               +-----------+
               | A   B   C |
               |-----------|
            1. | 1   2   3 |
               +-----------+
          
          . datasignature
            1:3(64384):1771983699:2294851648
          Last edited by Bjarte Aagnes; 22 Oct 2019, 01:27.

          Comment


          • #20
            Deleted.

            Comment


            • #21
              I notice that you get correct results even when the BOM is left in the file. As far as I can tell, the difference between your test set and my actual data is that your variable names are quoted and you specify the option varnames(1). It turns out that varnames(1) is the key. The automatic detection of variable names seems to fail in the presence of a BOM.

              Comment


              • #22
                [email protected] it would help greatly if you can make the entire file available through Dropbox or Google, etc. If not perhaps you can use the split command that is available under Linux to break the file into multiple smaller files.

                Code:
                split ../myverybigfile.txt  -b 1M -d
                The split command can generate many files so it is best to create an empty directory and split from there. We only need the first file (named x00 or similar) which should contain the BOM.

                If you end up splitting the file rather than posting the whole file to Dropbox etc., then just email the first 1 megabyte file to [email protected].

                Comment

                Working...
                X