This picture is definitely not me. But as this is my first post regarding Hive, I felt the need to include a photo of a ridiculous beehive hairdo.
As is the case with many other Data Scientists, I am being pulled increasingly into the world of Hadoop and all the technologies associated with it. Lately this has meant trying to sort out how to do certain functions in HiveQL that I’ve grown familiar and comfortable with in various types of SQL.
Today’s conundrum was trying to determine if someone is at least 18 years old based on their birthday. Normally, I would use one of the following:
date_of_birth <= current_date - interval('18 years') -- check for True condition OR extract(years from age(current_date,date_of_birth)) --check for >= 18
There are a few functions that can be used here: DATE_ADD, DATE_SUB, MONTHS_BETWEEN, or ADD_MONTHS.
The DATE_ADD and DATE_SUB are roughly synonymous except that one adds days and the other subtracts them. I suppose that you could add a negative number of days though, if you wanted to just learn one of the two. That might look something like this:
date_of_birth <= date_add(current_date,(-18*365))
The number of days being added is calculated using 365 as a nice round number for dealing with years. However, it does not take into account that there could a few leap years in the mix. The DATE_ADD function does not allow decimals in the number of days to add so it cannot be used as -18*365.25 or something. This is not my preferred method. I like more precision.
Next up is the ADD_MONTHS function, which is sort of like using the interval except that you have to calculate the interval based on months rather than years.
date_of_birth <= add_months(current_date,(-18*12))
I prefer this method because it accounts for the leap year stuff by ignoring the actual number of days and just changing the months.
Similarly, the MONTHS_BETWEEN could be used along with division to get to the number of years.
months_between(current_date,date_of_birth)/12.0 --check for >= 18
This one has been irking me for quite a bit. If you have to insert multiple rows into a table from a list or something, you may be tempted to use the standard PostgreSQL method of…
Netezza is a super-fast platform for databases… once you have data on it. Somehow, getting the data to the server always seems like a bit of a hassle (admittedly, not as big a hassle as old school punchcards). If you’re using Netezza, you’re probably part of a large organization that may also have some hefty ETL tools that can do the transfer. But if you’re not personally part of the team that does ETL, yet still need to put data onto Netezza, you’ve got to find another way. The EXTERNAL TABLE functionality may just be the solution for you.

Let’s say you have a process that requires a start date, but may or may not have an end date, as a parameter feeding in. How can you accomplish this in Netezza via a stored procedure? The internet has not had great answers to this question but it’s not as hard as someĀ other programmers make it out to be. So today we tackle optional arguments – in easy mode.
I’ve recently come to love (read:be obsessed with) windowing functions in my coding. They’re just so useful and practical.
Imagine, if you will, begin able to calculate the time between visits to a website, transactions in a store, logs from a punch-clock, etc. in just one step. Well, I have found the way!
Slicing data into manageable chunks for viewing is crucial when you start dealing with more records than will fit in something like Excel (without PowerPivot, of course). One of the most common ways to look at data in a more easily-digestible manner is to use percentiles, or some derivative thereof, to group records based on a ranking. This allows you to then compare equal-sized groups to one another in order to form conclusions as to relative behavior.