This picture is definitely not me. But as this is my first post regarding Hive, I felt the need to include a photo of a ridiculous beehive hairdo.
As is the case with many other Data Scientists, I am being pulled increasingly into the world of Hadoop and all the technologies associated with it. Lately this has meant trying to sort out how to do certain functions in HiveQL that I’ve grown familiar and comfortable with in various types of SQL.
Today’s conundrum was trying to determine if someone is at least 18 years old based on their birthday. Normally, I would use one of the following:
date_of_birth <= current_date - interval('18 years') -- check for True condition OR extract(years from age(current_date,date_of_birth)) --check for >= 18
There are a few functions that can be used here: DATE_ADD, DATE_SUB, MONTHS_BETWEEN, or ADD_MONTHS.
The DATE_ADD and DATE_SUB are roughly synonymous except that one adds days and the other subtracts them. I suppose that you could add a negative number of days though, if you wanted to just learn one of the two. That might look something like this:
date_of_birth <= date_add(current_date,(-18*365))
The number of days being added is calculated using 365 as a nice round number for dealing with years. However, it does not take into account that there could a few leap years in the mix. The DATE_ADD function does not allow decimals in the number of days to add so it cannot be used as -18*365.25 or something. This is not my preferred method. I like more precision.
Next up is the ADD_MONTHS function, which is sort of like using the interval except that you have to calculate the interval based on months rather than years.
date_of_birth <= add_months(current_date,(-18*12))
I prefer this method because it accounts for the leap year stuff by ignoring the actual number of days and just changing the months.
Similarly, the MONTHS_BETWEEN could be used along with division to get to the number of years.
months_between(current_date,date_of_birth)/12.0 --check for >= 18