Netezza is a super-fast platform for databases… once you have data on it. Somehow, getting the data to the server always seems like a bit of a hassle (admittedly, not as big a hassle as old school punchcards). If you’re using Netezza, you’re probably part of a large organization that may also have some hefty ETL tools that can do the transfer. But if you’re not personally part of the team that does ETL, yet still need to put data onto Netezza, you’ve got to find another way. The EXTERNAL TABLE functionality may just be the solution for you.
The Situation
You have some sort of basic file on a non-server drive that you need to load into a table in Netezza. This is usually an unencrypted, delimited file. The delimiter can be whatever you like but standard might be comma- or pipe-delimited.
There are several ways to accomplish this task. If you are using the Aginity Workbench for Netezza, you can use the Import function. The wizard steps will ask you to specify field formats and a table name. This process does add a little bit of overhead and may take longer than you’d like. You can also use Workbench to create a table in the structure you need and then Import to it directly. However, if your file has any incongruous lines, the process is likely to error.
The EXTERNAL TABLE Solution
External tables basically form a link from the Netezza server to your local environment (desktop or shared drive) but do not immediately import the data. You just specify how it would. If you were to query the external table, it would read from the file as needed but would not store a copy of the data on the server.
The Code
Technically, this is explicit external table schema definition with options set. There are more streamlined ways to do this but, if you’re fussy about getting the schema absolutely right, I recommend using the method below. It also allows for better error checking and handling if you run into problems.
create external table EXTERNALTABLENAMEHERE ( tbl_id bigint, date_entered timestamp, name_of_stuff varchar(100), spend_amt numeric(16,2) ) USING ( DATAOBJECT('C:\Data Sources\some_random_file.txt') REMOTESOURCE 'odbc' DELIMITER '~' SKIPROWS 1 MAXERRORS 1000 LOGDIR 'C:\\' ); create table PERMANENTTABLENAMEHERE as select * from EXTERNALTABLENAMEHERE;
The Breakdown
The first part of EXTERNAL TABLE looks largely like a normal CREATE TABLE statement. You give the external table a name and provide the DDL. The second part is where all the fun stuff happens.
The USING statement indicates that it will be reading from a different source. You then state the source by referencing the file location as the DATAOBJECT and the source connection as the REMOTESOURCE. If the file is already located on the Netezza server somewhere (just not in a table in a database), you can skip the REMOTESOURCE piece. REMOTESOURCE can take one of three values: odbc, jdbc, or ole-db. Use the one appropriate for your connection type.
The DELIMITER allows you to set a custom field delimiter. It defaults to pipe if you don’t put anything in so be sure to check this before running your code.
Next in the example above is SKIPROWS. This is useful if you have headers in your first record (or however many). Just provide the number of records to ignore at the top.
MAXERRORS is the number of erred records that can be written to the log before the process will fail. This does not happen when you execute the CREATE EXTERNAL TABLE step. It will happen if you attempt to read something from the EXTERNAL TABLE either by running a SELECT statement or using CREATE TABLE AS to pull the data from the external table into a normal table.
Make sure that you specify a LOGDIR if you use MAXERRORS. This is the directory to which the log file and errors will be written (which means you need to have write access there). Any errors will show up in a file that gets written to the TABLENAME.DATABASE.NZBAD. You will also see TABLENAME.DATABASE.NZLOG as the log file written, whether or not any errors were encountered.
The final step is to CREATE TABLE from the external table. This reads the file from the link in the external table definition, reads the external table DDL, and combines the two into a regular table.
Other Options with External Tables
There are a bunch of options that are configurable during your CREATE EXTERNAL TABLE procedure. I recommend looking at the IBM External Table Load Options documentation, which also goes through what default behaviors are for this command. Some of the ones I’ve used before are below with explanation.
BOOLSTYLE Used to specify how BOOLEAN fields are encoded in your file. For example, if you have ‘True’ and ‘False’ as your values, you would use True_False as your option command.
FILLRECORD Sometimes you have a file with fields missing at the end (not optimal, but it happens). In this case, use FILLRECORD TRUE to allow for records with missing fields to be entered into the table.
ENCODING Files with character encoding not aligned to that used by the database must be specified. You can state Latin9 or UTF8 as the encoding structure.
MAXROWS If you are worried that you may not have it quite right yet and want to test load a sample, use the MAXROWS command to give a number of rows to load in. This will take less time than loading the whole file if you are still in development.
REQUIREQUOTES and QUOTEDVALUE These kind of go together to deal with quoting of string values. If you set REQUIREQUOTES TRUE in your options, that means any strings not enclosed in quotes will error. It also means that you have to specify the QUOTEDVALUE as YES, SINGLE or DOUBLE. YES and SINGLE are roughly synonymous. DOUBLE means that strings are enclosed in double-quotes.
TIMESTYLE This allows you to specify 24HOUR or 12HOUR to denote how times should be read. It defaults to 24HOUR so if you are using 12HOUR (with A.M. and P.M.), you need to state 12HOUR.
Again, these are only a handful of the available options. If you’re having trouble with a given file, check the documentation and/or leave a comment below.
How to create external file (csv file) having same schema with the existing table in netezza DB.
I tried # createExternalTableCmd =
CREATE EXTERNAL TABLE {externalTableName}
SAMEAS {dataTableName}
USING
(
DATAOBJECT (‘{csvFilePath}’)
Y2BASE 2000
ENCODING ‘internal’
REMOTESOURCE ‘ODBC’
ESCAPECHAR ‘\’
);
INSERT INTO {externalTableName}
SELECT *
FROM {dataTableName}
;
With the help of above query, external file got created but schema of csv file is not proper. set of columns as single column separated by pipe symbol instead of each column individually even I used the word SAMEAS.
another command which is also tried
createExternalTableCmd =
CREATE EXTERNAL TABLE ‘{csvFilePath}’
USING
(
Y2BASE 2000
ENCODING ‘internal’
REMOTESOURCE ‘ODBC’
ESCAPECHAR ‘\’
DELIMITER ‘{delimiter}’
) AS SELECT * FROM {dataTableName}’;
It also worked in creating external file but failed in having same schema of existing table say “dataTableName” . Can you please help me out how to create a external table (csv file) with same schema of existing table.
Is the issue that it is pipe-delimited rather than comma-delimited? Are the fields in a different order? What environment are you using to run this command (e.g. Aginity Workbench, Squirrel, etc.)?
Can you please provide an example of the output files for each of these?
In the first example, you had not specified a delimiter. If your environment default delimiter was set to a pipe, that would explain the lack of commas for delimiting as it may override the CSV designation.
In the second example, it would be most helpful to understand in what way the schema of the external file differed from the structure of your table in Netezza. Please provide specifics.
Hello and thank you for this information. I am having particular trouble understanding the use of the DATAOBJECT & REMOTESOURCE options… In our environment we have the Netezza host and a separate Linux ‘script-host’ where we keep most of the scripts used in the ETL process as well as some pipe-delimited flat files that we load using some utility.
It is this utility that I would like to deprecate and instead load this data via the external table functionality.
In all of the examples that I can find, the DATAOBJECT is defined as ‘C:\somedir\somefile.txt’. But how does the process know where ‘C:\’ is? Or, in my case where do I define the ‘host’ where my DATAOBJECT lives (which is not the Netezza host)?
Then, what exactly does the REMOTESOURCE specify and would there be some additional setup required for an odbc/jdbc/oledb connection?
It sounds like if my files lived on the Netezza host it would be a non-issue and perhaps I could simply provide the path to the file- but I do not have access to save files onto the Netezza host.
Hopefully I’ve explained the situation and our environment enough for you to offer some guidance.
Thank you-
Eric
When you see the ‘C:\’ directories referenced in these examples including the REMOTESOURCE option, it is typically pointing to the system of the user – meaning their local machine/terminal. This could be a virtual environment, another servicer, a thin client, etc. The REMOTESOURCE is the system to connect to in order to find the data. The DATAOBJECT is the path to the file you want to load. If you do not provide a REMOTESOURCE location, then it defaults to looking for that path on the Netezza server. The knowledge base article from IBM on loading from remote sources is a good place to start.
Thanks for the detailed explanation. It was very helpful.