Date Intervals in HiveQL

black-beehive-wig This picture is definitely not me. But as this is my first post regarding Hive, I felt the need to include a photo of a ridiculous beehive hairdo.

As is the case with many other Data Scientists, I am being pulled increasingly into the world of Hadoop and all the technologies associated with it. Lately this has meant trying to sort out how to do certain functions in HiveQL that I’ve grown familiar and comfortable with in various types of SQL.

Today’s conundrum was trying to determine if someone is at least 18 years old based on their birthday. Normally, I would use one of the following:

date_of_birth <= current_date - interval('18 years') -- check for True condition OR extract(years from age(current_date,date_of_birth)) --check for >= 18

There are a few functions that can be used here: DATE_ADD, DATE_SUB, MONTHS_BETWEEN, or ADD_MONTHS.

The DATE_ADD and DATE_SUB are roughly synonymous except that one adds days and the other subtracts them. I suppose that you could add a negative number of days though, if you wanted to just learn one of the two. That might look something like this:

date_of_birth <= date_add(current_date,(-18*365))

The number of days being added is calculated using 365 as a nice round number for dealing with years. However, it does not take into account that there could a few leap years in the mix. The DATE_ADD function does not allow decimals in the number of days to add so it cannot be used as -18*365.25 or something. This is not my preferred method. I like more precision.

Next up is the ADD_MONTHS function, which is sort of like using the interval except that you have to calculate the interval based on months rather than years.

date_of_birth <= add_months(current_date,(-18*12))

I prefer this method because it accounts for the leap year stuff by ignoring the actual number of days and just changing the months.

Similarly, the MONTHS_BETWEEN could be used along with division to get to the number of years.

months_between(current_date,date_of_birth)/12.0 --check for >= 18

Inserting Multiple Rows into Netezza Table

access-denied-715x400 This one has been irking me for quite a bit. If you have to insert multiple rows into a table from a list or something, you may be tempted to use the standard PostgreSQL method of…

INSERT INTO TABLE_NAME VALUES
(1,2,3),
(4,5,6),
(7,8,9)
;

Be warned – THIS WILL NOT WORK IN NETEZZA! Netezza does not allow for the insertion of multiple rows in one statement if you are using VALUES.

You have two options here: 1) create a file with your values and load it (see Loading Data into Netezza post) or 2) use individual INSERT INTO statements. With a small, simple set of records like the one above, the second method will do fine. It would look like this…

INSERT INTO TABLE_NAME VALUES (1,2,3);
INSERT INTO TABLE_NAME VALUES (4,5,6);
INSERT INTO TABLE_NAME VALUES (7,8,9);

If you have larger sets of data to insert or more complex row structures, consider using an external table.

Official documentation for IBM Netezza INSERT command.

Loading Data into Netezza Using Create External Table

punch_card.75dpi.rgb Netezza is a super-fast platform for databases… once you have data on it. Somehow, getting the data to the server always seems like a bit of a hassle (admittedly, not as big a hassle as old school punchcards). If you’re using Netezza, you’re probably part of a large organization that may also have some hefty ETL tools that can do the transfer. But if you’re not personally part of the team that does ETL, yet still need to put data onto Netezza, you’ve got to find another way. The EXTERNAL TABLE functionality may just be the solution for you. Continue reading

Common Table Expressions versus Temp Tables in Netezza

This seems to be either a controversial or overly-technical topic: should you use WITH (a common table expression a.k.a. CTE) or a TEMP TABLE. Both can serve similar purposes but each has their own strengths and weaknesses in how they work with other aspects of your query or procedure.

So today we’ll take a look at both without going into crazy detail but covering at least the basics.
Continue reading

Epic Epoch Time in Netezza and PostgreSQL

Epoch time is both a blessing and a curse. It is super-convenient for counting seconds (and doing calculations based on them) but can also be a pain to try to get into something readable as, or comparable to, a recognizable date. So today we’ll get into and out of epoch to show its flexibility without our brains having to be contortionists too. Continue reading

Netezza Stored Procedures and Optional Arguments

netezza Let’s say you have a process that requires a start date, but may or may not have an end date, as a parameter feeding in. How can you accomplish this in Netezza via a stored procedure? The internet has not had great answers to this question but it’s not as hard as some other programmers make it out to be. So today we tackle optional arguments – in easy mode. Continue reading

Fun Uses for Windowing Functions

I’ve recently come to love (read:be obsessed with) windowing functions in my coding. They’re just so useful and practical.

For those who haven’t experienced the joys of windowing, here’s the deal. They allow you to do calculations across multiple rows without actually having to group, thereby storing aggregate info on each record. That means you keep all the data associated with the row and can add calculated fields that rely on interaction with other rows. Pretty swiffy, huh?

Below are just a few funky functions that I’ve found helpful. I’m not saying that these aren’t resource intensive, but they may just save you from having to join to some crazy aggregation sub-queries and then export to Excel for further manipulation to get the same result. Continue reading

When Lag is Awesome

Imagine, if you will, begin able to calculate the time between visits to a website, transactions in a store, logs from a punch-clock, etc. in just one step. Well, I have found the way! Continue reading

Correlated Query Error in Netezza

Beta… you know… for correlation?

Fun and interesting error today. Here’s the actual error text:

Error 2: this form of correlated query is not supported – consider rewriting

I’d never heard of this “correlated query” business before so I had to look it up to sort out what was going on. Turns out that you can reference a table in the outside part of a query from within a subquery by calling the alias… Or, rather, you can’t in Netezza.

Tip of the day:

Check for alias references in your subquery and get rid of them Continue reading

Calculating Percentiles in PostgreSQL

Slicing data into manageable chunks for viewing is crucial when you start dealing with more records than will fit in something like Excel (without PowerPivot, of course). One of the most common ways to look at data in a more easily-digestible manner is to use percentiles, or some derivative thereof, to group records based on a ranking. This allows you to then compare equal-sized groups to one another in order to form conclusions as to relative behavior.

Most SQL variants have workarounds for how to accomplish this task that may or may not actually cover 100% of your data (may drop a few records here or there when trying to round a percentage to a whole number of records to pull). PostgreSQL, on the other hand, has a handy function built in for doing this sort of thing without having to worry about getting full coverage on your table. Continue reading

Lexy Kassan

The Truth Against the World

Coding Conundrums