Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • PostgreSQL text field loads into Stata as strL

    Cross posted from https://stackoverflow.com/questions/...-stata-as-strl

    I have data stored in PostgreSQL with the data type text. When I load this data into Stata it has type strL, even if every string in a column is only one charter long. This takes up too much memory. I would like to continue using the text type in PostgreSQL.

    Is there a way to specify that `text` data from PostgreSQL is loaded into Stata with type str8? I also want numeric data to be loaded as numeric values so allstring is not a good solution. I would also like to avoid specifying data type on a column by column basis.

    The command I use to load data into Stata is

    Code:
    odbc load, exec("SELECT * FROM mytable") <connect_options>
    profile.do contains the following:

    Code:
    set odbcmgr unixodbc, permanently
    set odbcdriver ansi, permanently
    odci.ini contains the following:

    Code:
    [database_name]
    Debug = 0
    CommLog = 0
    ReadOnly = no
    Driver = /usr/local/lib/psqlodbcw.so
    Servername = <server>
    FetchBufferSize = 99
    Port = 5432
    Database = postgres

    In PosrgreSQL `mytable` looks like this:


    Code:
    postgres=# \d+ mytable
    Table "public.mytable"
    Column | Type | Modifiers | Storage | Stats target | Description
    --------+------+-----------+----------+--------------+-------------
    c1 | text | | extended | |
    c2 | text | | extended | |
    
    
    postgres=# select * from mytable;
    c1 | c2
    ----+-------
    a | one
    b | two
    c | three
    (3 rows)
    In Stata mytable looks like this:

    Code:
    . describe
    
    Contains data
    obs: 3
    vars: 2
    size: 500
    ---------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    ---------------------------------------------------------------------------
    c1 strL %9s
    c2 strL %9s
    ---------------------------------------------------------------------------
    Sorted by:
    Note: Dataset has changed since last saved.

    I am using PostgreSQL v9.6.5 and Stata v14.2

  • #2
    I don't know if you can get -odbc- to do this for your or not (perhaps somebody else does and will respond). But if you can't, you can use the -recast- command to change the strL's to str#'s of the appropriate length. If you don't know in advance the maximum length of c1 and c2, you can get those to make sure that -recast- will allow you to proceed. (-recast- will balk if you specify a str# too small to hold the data without loss of information.)

    Comment


    • #3
      Thanks for your reply. That is helpful information. I do hope there is a odbc solution because I’m trying to make accessing our Postgres database as easy as possible for Stata users at my organization.

      Comment


      • #4
        Molly,

        as per the postgresql data types definitions (see Table 8-1) data type text corresponds to "variable-length character string". When importing data a dialog similar to the following will occur:
        • Stata: "What type is this column?"
        • PostGresql : "It is variable-length character string"
        • Stata: "Ah, ok, I will store it as strL then. Let's talk about the next column..."
        Nowhere in this dialog, I believe, Stata is asking "Oh, by the way, you said it is variable-length character string, could you please clarify what is the maximum width of the content?" (in part because it should start receiving the output from the query before the query succeeds, so Postgres can't answer either what will it result in in the future). So Stata pessimistically assumes huge width for the column to be able to absorb the result of the query. Your knowledge that the resulting column is only a few letters is based either on external facts not known to either Postgresql, or Stata, (you could have declared the column as fixed width of w=NN to manifest this knowledge) or it is situational, and your data will eventually include a very wide string (in which case a cast may ruin the content, but only if you use it carelessly).

        This was not a big deal in earlier versions of Stata, where width of strings was limited to 244 characters. Nor is the penalty prohibitively large now, as you will have just a handful of bytes of overhead for larger content, and there is a saving from data compression (string-coalescing - the word I can't pronounce, but meaning simply that only one copy of a long string will be stored when it occurs multiple times in the data.) The savings are more essential with more repetition and longer content.

        If my memory is right, StatTransfer converts to strLs any string columns of width 50 or more characters by default (though this is configurable), see Stata strl threshold parameter in Output options 1 dialog.

        In short if "I would like to continue using the text type in PostgreSQL." then you must use strL type in Stata.

        Best, Sergiy Radyakin

        Comment


        • #5
          Hi Molly.

          I encountered this issue as well, when ingesting data from a Microsoft SQL Server into Stata via odbc load. I did not find a solution, but a workaround that may help (depending on the amount and layout of data that you load):
          You can force Stata to ask Sergiy's imaginary question about string width by using the option allstring when invoking odbc load. This will make Stata correctly import the variables as str# with the appropriate width.

          The price I pay for this workaround, however, is that I have to manually convert all numeric information in the data I ingest back to numeric after the import process. As it is mainly string information, and few numeric variables, I am importing, this is the fastest approach I found with my data layout, bur that may be different for your's.

          Kind regards
          Bela

          Comment


          • #6
            If it's a problem that variable-length string data are imported into Stata as strL, then why not query the database as to the maximum string length of that column and then CAST() or CONVERT() it to a fixed-length string in the SQL statement that retrieves the data?

            Comment


            • #7
              Joseph Coveney : good point; when I encountered this, I experimented with CAST()ing the data in my SQL query first. However, in my setup, I ran some timing tests on both variants, and using odbc load , allstring and afterwards re-converting numeric variables was considerably faster than casting each and any string column on the SQL side.

              This, of course, is again dependant on the data, and may be different in Molly's scenario. When there is only few string columns, the CAST() variant should be the quickest way to go. Thanks for the addition, I forgot about this variant.

              Regards
              Bela

              Comment


              • #8
                Is there a reason you cannot use compress to go from strL to str1 after loading the data:

                Code:
                . clear
                
                . input strL x
                
                             x
                  1. a
                  2. b
                  3. c
                  4. end
                
                . compress x
                  variable x was strL now str1
                  (267 bytes saved)

                Comment


                • #9
                  Originally posted by Dimitriy V. Masterov View Post
                  Is there a reason you cannot use compress to go from strL to str1 after loading the data: [...]
                  This works for example data, but not for (most) real-life strings: compress only converts strL to str# if this saves memory compared to a coalesced strL. So as soon as there is a considerable amount of redundant string-information, it will stick with strL (as documented in help compress):
                  Code:
                  clear
                  input strL x
                  "x"
                  "abcdefghuijklmnopqrstuvwxyzäöüß"
                  "bcdefghuijklmnopqrstuvwxyzäöüßa"
                  "cdefghuijklmnopqrstuvwxyzäöüßab"
                  "abcdefghuijklmnopqrstuvwxyzäöüß"
                  "bcdefghuijklmnopqrstuvwxyzäöüßa"
                  "cdefghuijklmnopqrstuvwxyzäöüßab"
                  "abcdefghuijklmnopqrstuvwxyzäöüß"
                  "bcdefghuijklmnopqrstuvwxyzäöüßa"
                  "cdefghuijklmnopqrstuvwxyzäöüßab"
                  "abcdefghuijklmnopqrstuvwxyzäöüß"
                  "bcdefghuijklmnopqrstuvwxyzäöüßa"
                  "cdefghuijklmnopqrstuvwxyzäöüßab"
                  end
                  replace x=x*5
                  compress
                  However, there is an option nocoalesce to compress, which overrides the string coalescing procedure; I did not try it with real-life data yet, maybe this is a solution for Molly.

                  Anyways, from a memory efficiency perspective, I have the feeling that this would not be the ideal way to go (same problem as with my workaround above): importing in the desired data type straight away should be memory-saving compared to importing a wrong data type and converting afterwards. The latter especially might cause trouble when ingesting large datasets into Stata. Also, I would assume that SQL would be the more efficient conversion tool than Stata. But that's more a feeling than a hard fact, and I can be wrong about it.

                  Regards
                  Bela
                  Last edited by Daniel Bela; 11 Jan 2018, 02:03.

                  Comment

                  Working...
                  X