ETL Powered by Rice and Beans

I’m told that it’s very important to keep one’s brain in shape as one grows older. Or “drifts gently towards the twilight” as I prefer to think of it. Trying to look on the optimistic side, obviously. Fortunately I end each day with my brain wrung out like a well-used sponge courtesy of some ETL development work on a largeish data warehouse development.

Although the regular ETL development is outsourced, we are responsible for the coding that processes historical data, so on deployment we will run our historical load code and then the “other team’s” regular load will take over. Yes, it is a strange arrangement. And by “we” I really mean my collegues somewhere in the pestilent jungles of Brazil, surrounded by poisonous frogs and who-knows-what other kind of other fearsome thing, and living on a diet of rice and beans in their houses on stilts.

My own role is not so bad. I don’t have to get too involved in the detailed busness logic that the code has to implement, and thank goodness for that because frankly I’ve not seen anything so complex for … well, forever. So I sit here in a metaphorically lofty perch, making myself popular by criticising the hard work of others and encouraging them to greater efforts by calling them lazy if they stop work before midnight (the traditional management role). Yes, it’s a good life alright. I constantly remind them that the real aim of the project is to produce a process for loading five years of historical data that is faster than the “B-Team’s” load of single day.

I think our methodology works quite well though. When you are down in the weeds of the logic (did I mention that it is particularly complex logic?) it is always difficult to see the slightly larger picture of where you’re taking the data next and how you optimise both specific ETL stages and how you can adjust the overall process to make everything more efficient. It’s a little like pair programming, I guess.

And we do get to do the fun bit — 10 million records joining 0-M:0-N through eight OR’d join conditions to 10 million other records, with a couple of billion joins resulting that we then artfully winnow down to a few million. The optimization process is a little mind bending, with all sorts of issues that we wouldn’t normally worry about too much becoming vitally important in an effort to avoid blowing out the host server’s memory with a degree of parallelism of 40. I have to say that there was initially some FUD over the historical process, with estimates of the load duration being bandied about that were in the order of months — we never thought it would take more than five days, and it’s now looking like it’ll run in under one day, which means that we’d be able to start the daily load process directly afterwards without altering it. And I think we still have a few more tricks up our sleeves as well in case we need to buy an extra hour somehow.

In fact it’s turning into something of a stroll through the SQL Reference book. The list of features we’re touching on include analytic functions (which have become a second language for everyone now, along with the “PCTFREE 0 NOLOGGING COMPRESS” which I nearly appended to my sandwich order the other day), connect by clauses, and even some rather exotic stuff with finding the lowest set bit in a something-or-other that I don’t understand yet but which Heitor thought of while lying in bed one night and which he used to help reduce an eight hour process to fifteen minutes.  They say that they don’t have siestas in Brazil, but maybe they should start.


Oh well, another morning … Brazil is online … back to work!


10 thoughts on “ETL Powered by Rice and Beans

  1. connect by clauses? Haven’t thought about them. Do you have hierarchival data or do you use them in a more creative way? Can you give an example of how they can be used in ETL?

  2. It’s not really hierarchical data of the “manager-employee” type, it’s a singly-linked list. We have sets of many millions of events, and a subset of them form chains of related events through being tagged with the identifier for the subsequent event. We have to identify the unique chain, if any, to which each event belongs. The chains may be a single event or they maybe be a couple of dozen.

  3. David hi,

    I have asked it before and will ask again – how does your “set oriented SQL” powered stand up to error conditions? What method are you adopting for this?

    I used SQL some years back for a data migration project (this was when I was left alone by the “competency centre” and did not have to follow their “methodology”). The code was small – every piece of logic was within a screenful – it was fast – it worked (meaning very few bugs). I just loved it (well, I did not really think about error handling – but then, no error showed up! ;)

    And then the methodology guys came along and people do something else that I don’t want to talk about. It takes longer – leads to more bugs and yet they do it… so I just moved out of the ETL part of it all.

  4. Naresh,

    I started to write an answer, but it turned into something rather longer so I’ll start writing up something as a separate posting on ETL error handling.

  5. Funny you should say that, Naresh. Over time, I have been asked to do a lot of little projects, of the sort “duplicate this (real) warehouse configuration for this new warehouse, and we’ll give you a bunch of data in spreadsheets to do the initial inventory load.” So I’d do a few iterations cleaning up the data and ETL’ing it with unix and Oracle tools until I could cleanly load it and users could work with it with no errors showing up. Of course, this is a commercial OLTP system with some DDL outside of Oracle, and many unused fields. So of course, now I’m doing a major upgrade and discovering some fields are now being used, and some DDL (like NOT NULLs) is moving into the Oracle DDL. So when you say “no error showed up!” I say, “for what time frame?”

    It is pretty amazing how far off users can be between reality and perceived requirements.

  6. David, thanks – waiting for that post then :)

    Joel – “for what time frame?”

    Well, it was a data migration for a telecom billing system – there were no errors encountered during the ETL process run itself. There were some bugs in the data populated on the target system – but by and large the rating and billing ran correctly. Given the development timeframe, it was a pretty decent job done. And I think it was good because I used SQL – seeing the logic in one page of SQL is valuable – it is very easy to lose sight of the picture if the logic is spread over some ‘mapping tool’. Or maybe I am just plain too comfortable with SQL.

  7. Sorry Duke, you comment got trapped in the spam filter because of the two outgoing links I guess.

    Yes it’s redundant, but it’s a psychological thing – I somehow can’t stop doing it. Like a nervous twitch.

  8. Pingback: Resolving chains of events with CONNECT_BY « Todor Botev's Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s