Date Intervals in HiveQL

Standard

black-beehive-wigThis picture is definitely not me. But as this is my first post regarding Hive, I felt the need to include a photo of a ridiculous beehive hairdo.

As is the case with many other Data Scientists, I am being pulled increasingly into the world of Hadoop and all the technologies associated with it. Lately this has meant trying to sort out how to do certain functions in HiveQL that I’ve grown familiar and comfortable with in various types of SQL.

Today’s conundrum was trying to determine if someone is at least 18 years old based on their birthday. Normally, I would use one of the following:

date_of_birth <= current_date - interval('18 years') -- check for True condition OR extract(years from age(current_date,date_of_birth)) --check for >= 18

There are a few functions that can be used here: DATE_ADD, DATE_SUB, MONTHS_BETWEEN, or ADD_MONTHS.

The DATE_ADD and DATE_SUB are roughly synonymous except that one adds days and the other subtracts them. I suppose that you could add a negative number of days though, if you wanted to just learn one of the two. That might look something like this:

date_of_birth <= date_add(current_date,(-18*365))

The number of days being added is calculated using 365 as a nice round number for dealing with years. However, it does not take into account that there could a few leap years in the mix. The DATE_ADD function does not allow decimals in the number of days to add so it cannot be used as -18*365.25 or something. This is not my preferred method. I like more precision.

Next up is the ADD_MONTHS function, which is sort of like using the interval except that you have to calculate the interval based on months rather than years.

date_of_birth <= add_months(current_date,(-18*12))

I prefer this method because it accounts for the leap year stuff by ignoring the actual number of days and just changing the months.

Similarly, the MONTHS_BETWEEN could be used along with division to get to the number of years.

months_between(current_date,date_of_birth)/12.0 --check for >= 18

Inserting Multiple Rows into Netezza Table

Standard

access-denied-715x400This one has been irking me for quite a bit. If you have to insert multiple rows into a table from a list or something, you may be tempted to use the standard PostgreSQL method of…

INSERT INTO TABLE_NAME VALUES
(1,2,3),
(4,5,6),
(7,8,9)
;

Be warned – THIS WILL NOT WORK IN NETEZZA! Netezza does not allow for the insertion of multiple rows in one statement if you are using VALUES.

You have two options here: 1) create a file with your values and load it (see Loading Data into Netezza post) or 2) use individual INSERT INTO statements. With a small, simple set of records like the one above, the second method will do fine. It would look like this…

INSERT INTO TABLE_NAME VALUES (1,2,3);
INSERT INTO TABLE_NAME VALUES (4,5,6);
INSERT INTO TABLE_NAME VALUES (7,8,9);

If you have larger sets of data to insert or more complex row structures, consider using an external table.

Official documentation for IBM Netezza INSERT command.

Loading Data into Netezza Using Create External Table

Standard

punch_card.75dpi.rgbNetezza is a super-fast platform for databases… once you have data on it. Somehow, getting the data to the server always seems like a bit of a hassle (admittedly, not as big a hassle as old school punchcards). If you’re using Netezza, you’re probably part of a large organization that may also have some hefty ETL tools that can do the transfer. But if you’re not personally part of the team that does ETL, yet still need to put data onto Netezza, you’ve got to find another way. The EXTERNAL TABLE functionality may just be the solution for you. Continue reading

Common Table Expressions versus Temp Tables in Netezza

Standard

keep-calm-with-or-without-youThis seems to be either a controversial or overly-technical topic: should you use WITH (a common table expression a.k.a. CTE) or a TEMP TABLE. Both can serve similar purposes but each has their own strengths and weaknesses in how they work with other aspects of your query or procedure.

So today we’ll take a look at both without going into crazy detail but covering at least the basics.
Continue reading

Epic Epoch Time in Netezza and PostgreSQL

Standard

EpochTimeBeganEpoch time is both a blessing and a curse. It is super-convenient for counting seconds (and doing calculations based on them) but can also be a pain to try to get into something readable as, or comparable to, a recognizable date. So today we’ll get into and out of epoch to show its flexibility without our brains having to be contortionists too. Continue reading

Seeking Small: Why Big Retailers Envy Mom & Pop

Standard

mom-n-pop-shopThere are a few local stores I visit often – enough so that they know my name, preferences, boyfriend’s name, work/travel schedule, and even my pets. They call me on my cell phone when something new is available that they know I’ll love.¬†They make sure I know in advance about events they’re holding. In return, I’m glad to pick up the phone when I see that they’re calling.

These are the experts in personalization and targeted messaging. They are the kings and queens of the customer relationship. And the big box retailers of the world are tremendously jealous of them.
Continue reading

Netezza Stored Procedures and Optional Arguments

Standard

netezzaLet’s say you have a process that requires a start date, but may or may not have an end date, as a parameter feeding in. How can you accomplish this in Netezza via a stored procedure? The internet has not had great answers to this question but it’s not as hard as some¬†other programmers make it out to be. So today we tackle optional arguments – in easy mode. Continue reading

Fun Uses for Windowing Functions

Standard

I’ve recently come to love (read:be obsessed with) windowing functions in my coding. They’re just so useful and practical.

For those who haven’t experienced the joys of windowing, here’s the deal. They allow you to do calculations across multiple rows without actually having to group, thereby storing aggregate info on each record. That means you keep all the data associated with the row and can add calculated fields that rely on interaction with other rows. Pretty swiffy, huh?

Below are just a few funky functions that I’ve found helpful. I’m not saying that these aren’t resource intensive, but they may just save you from having to join to some crazy aggregation sub-queries and then export to Excel for further manipulation to get the same result. Continue reading