Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • odbc insert with Hive odbc fails

    Hello,
    this is my first post in here. I have some issues with writing back to Hive table using odbc.
    More specific, I have configured unixodbc with the Cloudera's Hive ODBC Driver (Simba). the odbc connection is fine. ODBC query load exec describe work fine, but I have some issues with `odbc insert`.
    * When var is integer, the odbc insert with create, fails because the create table statement, is using `integer` which is not valid in Hive. The correct is `int`. I can bypass this issue by create the table in advance, with `odbc exec` and then try to insert data. If there is a flag or mapping table that I can modify , to change integer to int, please let me know.
    * Second, and most important, is that insert fails always. The error I snot helpful ( r(682) ), so I had to turn on trace in driver and run Stata in command line. Anyway, the error is
    Nov 10 09:14:03.211 TRACE 34121024 ::CInterface::SQLSetConnectAttr: +++++ enter +++++
    Nov 10 09:14:03.211 TRACE 34121024 Simba::ODBC::Connection::SQLSetConnectAttr: +++++ enter +++++
    Nov 10 09:14:03.211 INFO 34121024 Simba::ODBC::Connection::SQLSetConnectAttr: Attribute: SQL_ATTR_AUTOCOMMIT (102)
    Nov 10 09:14:03.211 TRACE 34121024 Simba::ODBC::ConnectionState::SQLSetConnectAttr: +++++ enter +++++
    Nov 10 09:14:03.219 ERROR 34121024 Simba::ODBC::Connection::SQLSetConnectAttr: [Cloudera][ODBC] (11470) Transactions are not supported.

    Is there something I can tune on Stata side, to make it work?

    Thanks in advance for your support.

    Information:
    Stata/IC 16.1 for Unix - Revision 29 Sep 2020
    unixODBC: 2.3.4
    OS: Ubuntu 18.04
    Cloudera Hive ODBC: 2.6.9

  • #2
    From the trace log, the fatal error has to do with an attempt to use transactions.

    You don't show anything, but I'm assuming that you're using a DSN. I'm not familiar with Unix / Linix ODBC drivers, but you might be able to set the transaction mode when you set up the data source name.

    If not, then you'll probably need to use the alternative, i.e., a connection string, where you can explicitly set the transaction mode to off.
    Code:
    help odbc
    for how to use connections strings in lieu of DSNs in Stata's -odbc- suite of commands.


    The documentation for the driver should show the string you'd need to use.

    Comment


    • #3
      Hello,
      first of all thanks for the reply.
      My mistake, I thought when writing about unixODBC is obvious that I am referring to DSN.
      The full command I am using is:
      Code:
      odbc insert my_var, table("my_db.my_table") dsn("Hive")
      The Hive driver has no such parameter, or at least I could not find it. Usually, this is set by the application.
      In any case, I have read the `help odbc` before I open this post, and could not find any useful information. The documentation about `connectionString` is not sufficient. I suppose the connection string is actually DSN-less and there is no additional parameter, other than the ones provided by the driver.

      Moreover, I don't believe it is an issue with the driver, as I am using the same one to write data from R, and this is working fine.

      Comment


      • #4
        It wasn't obvious to me; I don't work with unixODBC. The information that you've provided might be helpful to others on the list who do.

        As far as driver parameters regarding transaction mode, I was thinking of something like this.

        I don't think that Stata explicitly imposes inserts with transactions, but you can check in your case with the sqlshow option what the application is saying.

        Comment


        • #5
          Make sure that it's not something simple, such as a constraint violation or a token mix-up, such as attempting to insert duplicates into a primary key or usage of my_var for another object in the same database.

          I've been focusing on transactions, because that's what the error in the log says (and because a short excursion into the Internet shows that Hive acts squirrelly with respect to them). But sometimes the reported error is downstream from that which triggers it.

          You're inserting data for one column; I assume it's the primary key and that you don't have a few missing-value observations hanging around at the bottom of the Stata dataset.

          I assume that all of the other columns in my_db.my_table have defaults, are nullable and so on.

          If inserts from R haven't been a problem, then trying inserting this exact column of data from R, and see whether it goes in without a hitch.

          Try inserting it rowwise via odbc exec(), looping over observations. (Are you using block inserts?) See below for what I mean. (I'm using Micosoft's SQL Server and its ODBC driver, but I don't of any reason why the same approach wouldn't work for Hive and unixODBC.)

          As I mentioned, I doubt that Stata, itself, as an application is making unusual demands on the ODBC driver or the database management system—the problem might arise from more pedestrian errors like those mentioned here (constraint violations, NULLs in primary key, name confusion, lack of defaults for the other columns etc.) and that the driver is reporting the error as a transaction-not-supported fault when it's just that Hive peculiarly reports it that way as the ultimate manifestation of the problem.

          .ÿ
          .ÿversionÿ16.1

          .ÿ
          .ÿclearÿ*

          .ÿ
          .ÿquietlyÿsetÿobsÿ3

          .ÿ
          .ÿgenerateÿbyteÿpidÿ=ÿ_n

          .ÿgenerateÿstrÿstuffÿ=ÿchar(_nÿ+ÿ64)

          .ÿ
          .ÿlist,ÿnoobs

          ÿÿ+-------------+
          ÿÿ|ÿpidÿÿÿstuffÿ|
          ÿÿ|-------------|
          ÿÿ|ÿÿÿ1ÿÿÿÿÿÿÿAÿ|
          ÿÿ|ÿÿÿ2ÿÿÿÿÿÿÿBÿ|
          ÿÿ|ÿÿÿ3ÿÿÿÿÿÿÿCÿ|
          ÿÿ+-------------+

          .ÿ
          .ÿodbcÿexec("USEÿSandbox;"),ÿdsn(Sandbox13)


          .ÿ
          .ÿlocalÿsql_statmentÿ///
          >ÿÿÿÿÿÿÿÿÿCREATEÿTABLEÿdbo.Probeÿ(ÿ///
          >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿpidÿINTEGERÿNOTÿNULLÿPRIMARYÿKEY,ÿ///
          >ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿstuffÿCHAR(1)ÿNULLÿ///
          >ÿÿÿÿÿÿÿÿÿ);

          .ÿodbcÿexec("`sql_statment'"),ÿdsn(Sandbox13)


          .ÿ
          .ÿ*
          .ÿ*ÿLikeÿthis
          .ÿ*
          .ÿforvaluesÿrowÿ=ÿ1/`=_N'ÿ{
          ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿpidÿ=ÿpid[`row']
          ÿÿ3.ÿÿÿÿÿÿÿÿÿlocalÿstuffÿ=ÿstuff[`row']
          ÿÿ4.ÿÿÿÿÿÿÿÿÿlocalÿsql_statementÿINSERTÿINTOÿdbo.Probeÿ(pid,ÿstuff)ÿVALUESÿ(`pid',ÿ'`stuff'');
          ÿÿ5.ÿÿÿÿÿÿÿÿÿdisplayÿinÿsmclÿasÿtextÿ"`sql_statement'"
          ÿÿ6.ÿÿÿÿÿÿÿÿÿodbcÿexec("`sql_statement'"),ÿdsn(Sandbox13)
          ÿÿ7.ÿ}
          INSERTÿINTOÿdbo.Probeÿ(pid,ÿstuff)ÿVALUESÿ(1,ÿ'A');

          INSERTÿINTOÿdbo.Probeÿ(pid,ÿstuff)ÿVALUESÿ(2,ÿ'B');

          INSERTÿINTOÿdbo.Probeÿ(pid,ÿstuff)ÿVALUESÿ(3,ÿ'C');


          .ÿ
          .ÿlocalÿline_sizeÿ`c(linesize)'

          .ÿsetÿlinesizeÿ72

          .ÿ
          .ÿodbcÿexec("SELECTÿ*ÿFROMÿdbo.Probe;"),ÿdsn(Sandbox13)

          ÿÿÿÿ1.ÿ+---------------------------------------------------------+
          ÿÿÿÿÿÿÿ|pidÿÿÿ|ÿ1ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ|stuffÿ|ÿAÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ+---------------------------------------------------------+

          ÿÿÿÿ2.ÿ+---------------------------------------------------------+
          ÿÿÿÿÿÿÿ|pidÿÿÿ|ÿ2ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ|stuffÿ|ÿBÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ+---------------------------------------------------------+

          ÿÿÿÿ3.ÿ+---------------------------------------------------------+
          ÿÿÿÿÿÿÿ|pidÿÿÿ|ÿ3ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ|stuffÿ|ÿCÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ|
          ÿÿÿÿÿÿÿ+---------------------------------------------------------+



          .ÿ
          .ÿsetÿlinesizeÿ`line_size'

          .ÿ
          .ÿodbcÿexec("DROPÿTABLEÿdbo.Probe;"),ÿdsn(Sandbox13)


          .ÿ
          .ÿexit

          endÿofÿdo-file


          .

          Comment


          • #6
            Hello,
            first of all thank you for your time.
            One by one.
            The TransactionMode you refer to, is supported by another driver ( which probably eliminates such requests from applications) and not by the Cloudera one, that we are using. We use Cloudera mainly because we have a Cloudera licensed Hadoop cluster.
            Anyway, I am focusing on client, because as I mentioned before, the exact same scenario works from the same machine, using the same ODBC connection.

            About the rest, I am trying to insert only one column because I want to test write-back. I haven't mentioned from the beginning that I am Hadoop administrator (so this is my point of view) and configured this ODBC for our scientists, and currently I do the troubleshooting.
            About Primary Key, there is no such concept in Hive, so there is no way that insert fails due to multiple identical values.

            Based on the code you provided (really helpful and appreciate it), I have tried the following and it does work ( I had also mentioned in my original post that `odbc exec` works as I create the table by using it).
            Code:
            local my_values  ""
            
            forvalues row = 1/`=_N' {
            local item_id = item_id[`row']
            if "`my_values'" == "" {
                local my_values =  "(`item_id')"
              }
              else {
                local my_values =  "`my_values'" + "," + "(`item_id')"
              }
            }
            
            
            local sql = "insert into table default.test_from_stata values " + "`my_values'"
            
            display in smcl as text "`sql'"
            odbc exec("`sql'"), dsn("Hive")
            So the problem is only with the `odbc insert`.

            PS: I have modified it to create one big `insert` statement, because if we insert date into Hive row by row, then this will be catastrophical for it. This is because Hive (with HDFS as storage system) will create one file per row in HDFS, while the above insert statement will cause only one file to be generated (which is ideal for HDFS).
            Last edited by Georgios Kasapoglou; 16 Nov 2020, 02:46.

            Comment

            Working...
            X