It has been a while since we had a nice list, so here is one on a subject I rarely touch — Extraction, Transformation and Load technology.
Full disclosure: I have seven years of Informatica ETL development experience, and aside from a brief flirtation with OWB some years ago (in which I was unimpressed) I have hands-on experience of no other tools. I have heard many of these sentiments echoed elsewhere, though.
So here are some reasons why third party ETL tools are inferior to in-database transformation, IMHO.
- They require little database expertise to use: This sounds at first blush like an advantage, but knowledge of the graphical interface used for generating transformations can mask a shallow appreciation of the performance enhancing internal mechanisms of the database, such as bulk operations, parallelism, partitioning etc.. When a performance problem arises you are into a multiperson analysis which is less efficient, more time consuming and potentially divisive. Another issue, particularly for smaller teams, is that dividing the expertise between database people and ETL people can require higher staffing levels — instead of a DBA and two database developers (who can at a pinch substitute for the DBA on vacations and sick days) you need a DBA, a developer and two ETL developers. You can still find yourself in a bind with one member of staff on leave and no margin of safety in case of illness or other duties.
- They are generally not good at bulk operations as the database: Bulk, nologging operations are the number one performance enhancer for data warehouse loads, and your external ETL tool may not allow them, or may make it tricky to tell whether they are being used through requiring multiple configuration items to be set.
- You will have to use some in-database transformations anyway: Related to issue number one. Your ETL developer may be unqualified to do this work, yet it is almost always going to be required. Gathering statistics, refreshing materialized views or other precalculated result sets, disabling and rebuilding indexes …
- They don’t know as much about the data as the database does: We expend a lot of effort in making sure that the database knows as much about the data as possible. Row counts, average row lengths, cardinalities, foreign keys, unique keys, dimensions etc. all allow the database to make the most efficient choices in optimizing operations. Your ETL tool rarely will take advantage of these features. Indeed when it starts to read data it often has no idea whether it will read 10 rows or 10 million … how can it make a choice on an efficient join mechnism under those circumstances?
- Database functionality has expanded to encompass many ETL tasks: Not by coincidence, Oracle has many features that now make it a very efficient load-and-transform tool. Full outer joins, external tables, analytic functions, regular expressions … every one of these helps to kick the feet out from under the external tool market.
- Difficult transaction control: External tools generally enforce coding in units across which transactions cannot span. You are often forced into smaller transactions than is ideal, leading to more complex recovery mechanisms in the event of a processing failure.
- External tools do not have a performance analysis interface as well developed as Oracle’s: Good luck looking for a wait interface. A performance problem can be in the tool, in the database, in the network …. Interestingly, despite being allegedly easier to use the creation of a performance test case on an external ETL tool can be a very tricky and time-consuming proposition. If you recall back in the days when everyone was using BCHRs and rules of thumb for Oracle performance tuning, that is where ETL tools are today. Scary, huh?
- Hardware issues: Buy separate hardware and pay the price, or host the tool on the database hardware and be unable to load share effectively between them? Poor database performance? Sorry, the ETL tool is using all your resources right now. High hardware costs? Sorry, the ETL tool needs it to avoid contention with the database, but sits idle twelve hours a day.
- Purchase Costs: Have you seen the price of these things?
- Smaller user base: There are good resources out there in terms of forums, but precious few in comparison to Oracle.
- Lack of Source Control: (Late addition).When your ETL tool stores it’s configuration in a bunch of tables in a database repository, you are pretty much stuck with whatever source control they give you, including “none” or what I think of as “marketingware” (features that marketing need the software to include to be able to sell the product but which are little more than a unreliable stub feature). With text-based code you are free to use best practices and any tool you like — including, of course, “none”.
So, to summarise … higher staff costs, lower staff efficiency, poorer performance, less optimal processing, higher hardware costs, higher software costs … other contributions gratefully accepted, in particular experiences of 100% in-database processing.