Fun new conundrum in dealing with data formats/displays in PostgreSQL! A client recently requested that we provide a data extract that included percentages as a three character number, left zero-padded, rather than a decimal/numeric. There’s not a great way in Postgres to show a numeric with leading zeros (actually, I’ve yet to find a data type that does this consistently as a built-in to any platform). Instead, you have to do a little bit of work to get to your end result as a character (or text) field.
Recently, in speaking with the President of Kobie regarding my team, I used the term “data munging” to describe a lot of the work that we do. He laughed, thinking I had said “data munching” (mmmm, tasty!) and asked if that was a technical term. The short answer is that yes, it is another term for data wrangling (which, incidentally, is one of my favorite terms in the industry).
In one of the PostgreSQL systems we work with often, there is only one street address field. To get around the need for a change to the table structure, our IT folks simply use a new line character (\n) to denote that there is a second line of address information. The problem becomes, when we go to generate a mailing list, how can we export a list to send to the mailhouse with all address information on one line?
The answer is to use the SPLIT_PART function. See the example below:
Hit a rather irksome problem recently in developing a WordPress website on the StudioPress Genesis framework. I could not find anywhere to change the default sort order (Newest to Oldest by time of publication). Nothing in the functions. Nothing I could find in the lib settings. Nothing in the theme options (which I believe should be changed).
Finally, after hearing back from the developers and having them point me to a nice, clean function for supposedly resolving this, I gave it a try. It had the unfortunate side effect of knocking out my static homepage because it went into the loop, thought the homepage had a loop, and decided to simply sort all the posts in the database into that page. At least it showed them in the specified order.
Thankfully, I came up with a quick way to resolve this pesky problem. All you have to do is wrap the function with a check to determine if it is a page. Here’s the code:
If you’ve worked with datetime formats at all in PostgreSQL (or any other SQL version) before, you’ve probably dealt with date_part (or extract). In pgSQL, this will pull out the numeric value of the part of the date that you have specified. For instance:
/* Pull the Month from the timestamp without time zone */ date_part('month','2012-03-13 12:45:22') /*Yields*/ 3
Another common one if you’re trying to eliminate the time portion is to truncate the date using date_trunc.
/* Truncate date so all timestamps map to midnight */ date_trunc('day','2012-03-13 12:45:22') /* Yields */ '2012-03-13 00:00:00'
Earlier today, I went looking for a similar construct to extract and isolate only the time portion and ran across a rather cool little feature. Seems there is a built-in function housed in the pg_catalog schema that allows for very quick parsing. It’s called, unsurprisingly, time.
Extracting Time Part from Timestamp Example
select pg_catalog.time('2012-03-13 12:45:22'); /* Yields */ '12:45:22'
It may not look like much code (which, to be fair, I greatly appreciated), but this works like a charm. You can then use the time syntax for any grouping or editing you want to do. For instance, this was being used to chunk up the day into meal periods for a client analysis so we could state that transaction_time between ’12:00:00′ and ’15:00:00′ is lunch with a simple case statement.
After finding this awesome little tidbit, I did some more research on this mysterious pg_catalog schema and it turns out that there are TONS of functions hidden away in there (including one simply called ‘date’ that may suit better than date_trunc if you want only the date portion). Found a pretty comprehensive schema of pg_catalog here and will be testing out the functions as needed.
Do you have a better way of handling different components of dates and timestamps? I’d love to hear about it. Please leave a comment below.
What is your name?
What is your quest?
What is your favorite color?
Arguably the most difficult skill for any analyst to learn is how to effectively communicate the numbers to the business. This act of translation is the most important moment in any analysis because it is when we, as number crunchers, get the buy-in (I know, corporate-speak) of people who actually do stuff with what we’re telling them (a.k.a. decision makers) … or not.
Tip for Translation: Remember the Dependent Variable
This one had been bugging me for a while now. There are a lot of analyses where it is useful to select multiple random groups. Usually, this would involve picking a bunch of numbers out of your head and trying them as the seed values (I like using phone numbers without the area codes – then I can call the person and tell them they rocked my randomization).
But today, as I struggled with pulling a multitude of sample sets, I decided to come up with a more elegant solution for generating random number seed values. Behold the random loop:
One of the most basic calculations on dates is to tell the time elapsed between two dates. Often it is more helpful to show the date as a number of months rather than a number of days. In PostgreSQL, subtracting one date from another will yield a number of days that then takes a tremendously complex formula to manipulate into months.
The best way I have found to get around this is to use the built in AGE function. The age function calculates the difference between two dates and returns an interval. This may not seem like we’ve gotten very far in calculating the number of months between the two dates, but stick with me on this one. You can then EXTRACT the pieces of the interval needed to calculate months and use them in a simple equation.
EXTRACT(year FROM age(end_date,start_date))*12 + EXTRACT(month FROM age(end_date,start_date))