A List: Ten Reasons Not To Use An External ETL Tool

It has been a while since we had a nice list, so here is one on a subject I rarely touch — Extraction, Transformation and Load technology.

Full disclosure: I have seven years of Informatica ETL development experience, and aside from a brief flirtation with OWB some years ago (in which I was unimpressed) I have hands-on experience of no other tools. I have heard many of these sentiments echoed elsewhere, though.

So here are some reasons why third party ETL tools are inferior to in-database transformation, IMHO.

  1. They require little database expertise to use: This sounds at first blush like an advantage, but knowledge of the graphical interface used for generating transformations can mask a shallow appreciation of the performance enhancing internal mechanisms of the database, such as bulk operations, parallelism, partitioning etc.. When a performance problem arises you are into a multiperson analysis which is less efficient, more time consuming and potentially divisive. Another issue, particularly for smaller teams, is that dividing the expertise between database people and ETL people can require higher staffing levels — instead of a DBA and two database developers (who can at a pinch substitute for the DBA on vacations and sick days) you need a DBA, a developer and two ETL developers. You can still find yourself in a bind with one member of staff on leave and no margin of safety in case of illness or other duties.
  2. They are generally not good at bulk operations as the database: Bulk, nologging operations are the number one performance enhancer for data warehouse loads, and your external ETL tool may not allow them, or may make it tricky to tell whether they are being used through requiring multiple configuration items to be set.
  3. You will have to use some in-database transformations anyway: Related to issue number one. Your ETL developer may be unqualified to do this work, yet it is almost always going to be required. Gathering statistics, refreshing materialized views or other precalculated result sets, disabling and rebuilding indexes …
  4. They don’t know as much about the data as the database does: We expend a lot of effort in making sure that the database knows as much about the data as possible. Row counts, average row lengths, cardinalities, foreign keys, unique keys, dimensions etc. all allow the database to make the most efficient choices in optimizing operations. Your ETL tool rarely will take advantage of these features. Indeed when it starts to read data it often has no idea whether it will read 10 rows or 10 million … how can it make a choice on an efficient join mechnism under those circumstances?
  5. Database functionality has expanded to encompass many ETL tasks: Not by coincidence, Oracle has many features that now make it a very efficient load-and-transform tool. Full outer joins, external tables, analytic functions, regular expressions … every one of these helps to kick the feet out from under the external tool market.
  6. Difficult transaction control: External tools generally enforce coding in units across which transactions cannot span. You are often forced into smaller transactions than is ideal, leading to more complex recovery mechanisms in the event of a processing failure.
  7. External tools do not have a performance analysis interface as well developed as Oracle’s: Good luck looking for a wait interface. A performance problem can be in the tool, in the database, in the network …. Interestingly, despite being allegedly easier to use the creation of a performance test case on an external ETL tool can be a very tricky and time-consuming proposition. If you recall back in the days when everyone was using BCHRs and rules of thumb for Oracle performance tuning, that is where ETL tools are today. Scary, huh?
  8. Hardware issues: Buy separate hardware and pay the price, or host the tool on the database hardware and be unable to load share effectively between them? Poor database performance? Sorry, the ETL tool is using all your resources right now. High hardware costs? Sorry, the ETL tool needs it to avoid contention with the database, but sits idle twelve hours a day.
  9. Purchase Costs: Have you seen the price of these things?
  10. Smaller user base: There are good resources out there in terms of forums, but precious few in comparison to Oracle.
  11. Lack of Source Control: (Late addition).When your ETL tool stores it’s configuration in a bunch of tables in a database repository, you are pretty much stuck with whatever source control they give you, including “none” or what I think of as “marketingware” (features that marketing need the software to include to be able to sell the product but which are little more than a unreliable stub feature). With text-based code you are free to use best practices and any tool you like — including, of course, “none”.

So, to summarise … higher staff costs, lower staff efficiency, poorer performance, less optimal processing, higher hardware costs, higher software costs … other contributions gratefully accepted, in particular experiences of 100% in-database processing.

About these ads

38 thoughts on “A List: Ten Reasons Not To Use An External ETL Tool

  1. Some situations demand a third scenario: no ETL tool and no in-database processing. I see this when I run into business rules that are just insane when (or if they can be) translated into relational terms, yet simple when bringing the data out of the db, hacked with unix tools and shoved back in.

    This usually results from design decisions that just weren’t normalized to begin with (let’s have a general reporting program on a summary table with 23 indexes), followed by implementation decisions more appropriate for ’60’s COBOL (the first three characters of the part code mean…), followed by business rule changes (we bought our competitor, redistributed all our codes, now fix all the data. Oh, some people need to see new data as though it were old).

  2. The above said, some tools are better than others – and it is not always the big names that are best. Some of the BI reporting vendors’ ETL tools are particularly clunky with extensive use of the lowest common denominator, the row-by-row insert.

    In general, I prefer ETL tools that push as much of the workload on to the target system – strangely, they are mainly dabase vendor products.

  3. Good thought, Joel. In that case the issue becomes “how good is my ETL platform at integrating with other solutions” I suppose.

    Peter, quite so, yet one sometimes finds it difficult to elicit information from vendors on row-by-row etc. abilities. Whether a tool supports direct path inserts is a tricky business to track down sometimes, and I’ve had to resort to practical tests to demonstrate that they do or do not, even in the light of contrary opinion from tool experts.

    By the way I thought of a late addition: Control of source code. I know that up to v6 there was really nothing that could be effectively done with Informatica in this respect, at least not reliably. We had an archive repository for saving previous versions of mappings but it was hardly satisfactory. I recall a very long meeting with configuration experts from my client in which I described how the configuration for a single mapping was spread across multiple tables in a proprietary format, and was essentially untrackable. A very bad business, it is.

  4. David,

    The one issue I can imagine with using SQL is that it tends to be an all or nothing proposition – either the SQL works, or it does not – in which case the output is zero.

    I agree with your thoughts above – I prefer (enjoy!) doing the SQL thing and with features like analytics and multi-insert, what you can achieve is phenomenal if you can be a little creative. However, the “all or nothing” always bothers me.

    What are your thoughts on this?

    Thanks,
    Naresh

  5. Naresh, I’m not sure about the “all or nothing” point, you might like to expand on that.

    I’ve used analytics to implement some extremely complex record selection in ETL processes and it has been very performant, very simple to code, and of course easy to stuff inot source control.

    One tricky issue with pure SQL solutions is that of multiple targets. Multitable insert is a solution as far as inserts go, but when you want to merge into table A and insert into table’s B C and D then you have to get pretty creative — for example you can use views as targets and “instead of” triggers to convert an insert into a logical merge. Or you can insert to GTT’s and then merge from them. The code complexity is increased but performance is maintained.

  6. What I meant by all or nothing: In a row by row ETL process, you could log any error somewhere and do something about it later. In SQL, the same would not be possible? It would just fail the entire operation.

    Well, come to think of it, the 10G error logging feature would provide some possibilities – but I have never thought it through very much (we use 9i at present so not much practice).

  7. OK, I understand. Oracle has a couple of ways of dealing with that though. The best is probably the 10g “LOG ERRORS” syntax for DML statements, which allows individual rows to be flagged in an error logging table, along with the error itself and a statement identifying the statement that generated the error.

    My own feeling is that errors are always a problem, even in ETL tools — certainly Informatica is reduced to a crawl by them, and identifying the individual rows affected is generally no joke. A robust ETL solution should include such data validation as is necessary to ensure that problems are weeded out gracefully — IMHO Oracle’s DML error logging clause is just such a mechanism.

  8. David,

    Your response is encouraging – I would rather go the way of the “bulk DML in SQL” approach myself.

    You mentioned about using Analytics to implement some extremely complex record selection in ETL processes. Maybe you could blog about this sometime – I know it would be time-consuming to do it without exposing ocnfidential customer data models – however, any ideas in this realm would be very interesting indeed.

    Thank you for your responses and all that you write on your blog.

  9. Yes, I might do that, the analytics thing — there was a particular circumstance when I resorted to performing ETL in a view because I had absolutely no idea how to perform that task in Informatica in less than about a hundred transformations, and it turned into a single SQL statement.

  10. Analytics – to be honest I’ve used them more in ETL than in user queries – they can allow some very powerful transforms (and extracts!) to happen at very low cost.

  11. Yikes. Can’t say I agree with you there. I heavily use SSIS with our MSSQL-based datawarehouse. It’s FAST, easy to use, and extremely maintainable. I have no problem whipping out code, but I’d sure only want to if I need to!

  12. Arjun,

    I’m not familiar with SSIS (or its predecessor, DTS) so I can’t really make any intelligent comment on it. I will note however that there are many many people and organisations who are very happy with their choice of Informatica as an ETL tool as well, despite what I personally see as faults.

    I’m sure that the balance of pros and cons also changes according to the RDBMS platform in question. With Oracle we have PL/SQL and Oracle’s own SQL dialect, which are very robust, easy to work with (IMHO), and very powerful for ETL operations (more so with every subsequent version, with external tables, LOG ERRORS, analytics etc). It could be that the balance lies more towards SSIS than T-SQL in a Microsoft environment.

    I suppose that you’d have to look at theissues that I raised based on my own experience and treat them as caveats to consider when choosing a tool. For example, how is source/revision control in SSIS? Does it integrate with Subversion or other common systems, or have an internal mechanism?

  13. A key difference with MS SSIS is that it is really a “single target” system, it is designed to push data into SQLServer 2005 ,in much the same way as Oracle’s OWB pushes data into Oracle databases. Whether SSIS does that through T-SQL or some other form of native code is irrelevant.

    And as a aside, I would have thought the set of coders that could white partition exchange load routines in T-SQL is somewhat samller than the set who could drag and drop one on to the screen

  14. I would tend to agree with David on this one. My beef with ETL tools in general is that they keep a lid on what really happens behind the scenes. At least with pure SQL and PL/SQL design of an ETL solution, one can fine tune the operations and maintain better control over the entire ETL process. We recently are implementing an ETL solution with Informatica and it has been a growing pain process. I’d rather design it with Oracle PL/SQL and SQL code and have a better method but management has already bought into Informatica so now we all have to attend training and learn the bloody tool. Still- for me, it is good for the resume and CV as Informatica consultants are highly paid and in demand. David- is there a way within Informatica (since I am new to the tool) to modify the error tracking mechanism to dump notifications to Oracle PL/SQL and external tables?

  15. I agree with you. I think what you said is related to the E-LT concept. Basically, instead of doing transformation then load the data to database, the tool load the data to the target and do the transformation within database. The ETL features available from the DBMS can be fully leveraged.

    Please see also : How is Sunopsis different from other tools?

  16. Can I ask some questions about ETL tools? Informatica, Cognos, MicroStrategy: for their ETL products do any of these have a specialized data crunching programming language? Or do they rely upon a GUI(graphical user interface)?

    During a food fight with the SAS Users group (comp.soft-sys.sas on Usenet/google-groups), one fellow thought they did not have proprietary languages, but that they do interface with the Visual Basic tool.
    He also said that the SAS datastep was his ETL tool of choice( this is entirely line-by-line code, not GUI).

    What about IBM/Ascential ETL tools?

    I created a new statistical programming language, a data crunching programming language called Vilno. It’s strongest feature is the data-processing-function, a paragraph of code that is a fine replacement for the SAS datastep (and more than that as well). The folks blogging at the SAS Users group were quite closed-minded about it , slamming it without actually looking at it. After all, how can one unknown guy create something to compare with the creation of a multi-billion dollar company? You ought to look at something first, then criticize.

    Anyways, tell me about the ETL landscape: mainly GUI-based , or in-house specialized languages, or hook into all-purpose languages already out there (VB, Python, C# etc). Oh, and are these ETL tools tied to a specific system (such as Windows Server)?

    Thanks in advance for the info.

    Robert callingrw@yahoo.com http://www.xanga.com/datahelper

  17. Robert,

    Almost exclusively GUI, as far as I know. Generally with the ability to hook into external C modules and whatnot. I don’t know about SAS at all.

    You’d have to review the individual products to see what o/s they support. Informatica is windows, linux, HPUX as well possibly.

  18. David,

    I have five years of experience with Informatica and 12 years of experience with Oracle’s PL/SQL. In addition to your list of 11 reasons to not use an external ETL tool, I would suggest two others.

    (1) Due to their row-by-row processing orientation, external ETL tools lack the ability to effectively utilize SQL set processing. Set processing is usually much faster and much more efficient than row-by-row processing. Employing set processing in a tool like Informatica can get clunky, but set processing comes easily and naturally with PL/SQL.

    (2) Working with monster dimensions (ie., very wide tables containing millions of rows and experiencing a high degree of changes, eg. a CUSTOMER dimension for a telecom company) can be an exercise in grating exasperation when using a GUI like Informatica’s Designer tool. (It becomes even worse if the source and target table definitions are still undergoing changes.) Using a PL/SQL-friendly editor like TOAD, I find that it’s much easier to manage the code for monster dimensions and other ETL requirements of a highly-complexed nature.

    Ernest

  19. David,

    Being an experienced Oracle developer I’m a greate fan of the in-database processing and using SQL to its limit. I know one can do miracles with analytics an nologging. So I pretty much share you oppinion. I work since about a year in an Informatica project and that’s what I see – bad performance, poorly educated developers (databasewise). The business users have only 2 weeks access to the new month’s data – because the whole ETL monthly thing takes the other two weeks.

    But now I want to play the generous and write down some reasons why you might like Informatica. In the end, the whole Infornatica hipe should have some rational reasons.

    1. Automatic storing of metadata and data lineage
    This is what Informatica is really good at. Everything you draw on the screen is an information about your data processing. Just by mouseclick you can see that red line showing how the column flows trough the mapping. Nice and powerfull.

    2. Business users can understand it
    A well designed mapping can be a very good basis for discussion with the business users. Try to do it with a complex Select with analytics inside.

    3. You can easily integrate different platforms
    Now, this should be an advantage although I haven’t seen it so far in a real project. If you want to integrate data from Oracle, DB2, SAP and some VSAM files, Informatica might really help you. But of course there is that other approach without a special ETL tool: you export the data from your source systems and process them in your target system as flat/XML files.

    And finally, I’m not sure whether the Source Control is really an issue. You can export/import mapping as XML and base your configuration management on XML files.

    Todor

  20. I really do not agree with whatever criticism is being done here with Informatica or any ETL tool for instance. Yes, I agree that Row by Row basis may not be efficient and pure PL SQL would do wonders, especially when Oracle is pushing it way too far with Oracle 10g.

    But I have one Question. Is Informatica (or any other ETL tool) really a Database specific tool? I want to emphasize on the fact that these are Data Warehousing tools. How on earth can one integrate heterogeneous sources and put into yet another target? And If there are transformations to be done to source data before loading? I really do not have any idea how this can be achieved with PL SQL using set based processing.

    TODOR, I totally requoted your stuff here !

  21. Seemit,

    One approach to integrate heterogeneous sources is to export all the source data in flat/XML files. Then all the transformations take place in the target database. In fact this is the approach taken in the Data Warehouse I’m involved currently – and it works. Yes – we do transform the falt files with Informatica – but this is IMHO more organisational than technical decision. There is nothing preventing the transformations being done using the native DB functionality.

    Todor

  22. David,

    Your list is fairly extensive, however, none of the items have solid arguments.

    Nothing internal comes close to Informatica’s ability to interoperate with external sources and targets.

    1. Expertise.. you need expertise to implement any system. You can create bad implementations with internal and external ETL tools. Understanding the limitations of the tools, and designing to their strength is essential.

    2. Bulk operations, Nologging…
    like Create Table As.. Create Index nologging, .. can be done via SQL scripts in any ETL tool. If you want to minimize logging of inserts, drop indexes.

    3. “You will have to use some in-database transformations anyway”
    Since when..? What tool are you using..

    4. “They don’t know as much about the data as the database does”.
    Informatica gives you the choice to do a join in the database or with one of their external components.. (Also see 1). If you design a system assuming all the tables will be small… that is your fault… not the tools fault.

    5. “Database functionality has expanded to encompass many ETL tasks”.
    This is oracle specific.. what if you were using something other than oracle?

    6. Difficult transaction control:
    Informatica gives you both precise and generally mechanisms for transaction control. You can control is manually via expression logic.. or set a commit row count.

    7. External tools do not have a performance analysis interface as well developed as Oracle’s
    Again, this applies to oracle alone.. But informatica does have good profiling tools.. you should check them out.

    8. Hardware issues
    Hardware is cheap compared to a developmers time. Hardware is cheap if the tool allows you to avoid bad a implementaiton. The tools, Designer, Workflow Manager, Workflow Monitor, available to informatica users allow fast development, perform very well, and are quite flexible. I have not yet run into any design that could not be accomplished faster with Informatica vs PL/SQL.

    9. Purchase Costs .. See 8.
    10. Smaller user base
    Ahh compared to what.. Lots of people use 10g, not all of them use OWB. Informatica forums and their Online support database is great.

    10. Lack of Source Control:
    Informatica has it.. Its nice oracle is catching up.

    I will agree, if you are transforming data within Oracle, sure, it is probably just fine to stick with OWB. If want to talk to external systems, message brokers, etc… OWB doesn’t come close.

  23. I am total agree with most of comment about informatica limitation.
    (I have 7 year oracle dba,developer backgroud and 1 year informatica)
    but I do see informatica advantage in follow area:

    1) If you have a lots dimension tables(like 20 or more) need to populate in differe FACT tables,
    in pure sql, you have wroten very long sqls to populate each FACT tables. but
    in informatica, you can use one mapplet to do all reference lookup, which is pretty
    handy and easy to mantain.

    2) If you need get data from oracle,teradata,flat files,etc, informatica will be good
    choice, esp the table is not too big.

    3) Central meta data to make you easy to find out what was go wrong and informatica “click through” feather you can trace down the issue very easily. In pure sql,it will be more effort
    to do that.

    4)In pure sql enviornment,codes can be wroten in ksh,perl,sql,pl/sql,trigger,even Proc*C, it all depend the developer favor, it will be hard to find out each piece if you don’t have good documentation. but in informatica, you can easily use check “dependence” to find all the related mapping,lookup,etc,which it help to maintain code base.

  24. I think if the ETL tools can allow user to use set based operation and yet support heterogeneous data sources by pushing as much as possible to database, it will be faster and and also gives a lot of control to the enterprise user. By doing everything in pure SQL/PL SQL, it may be bit difficult to integrate the solution to the rest of business processes. ETL tools are no longer limited to building data warehouse.

    Check out what Sun is doing with their ETL offering () as part of their open ESB solution.

    Even if I agree that hand coding SQL may be faster for some cases, but it does not make Enterprise user happy all the time. An unified IDE and Integration platform is very important for Enterprise users. It is nice to see database vendors like Oracle and IBM are adding more and more ETL support into their SQL engine, but they alone can’t solve the problem, The E-LT tools like Sun ETL Integrator will be in a better position to take advantage of the database engine and yet provide a robust solution to build composite applications.

  25. Sorry .. we do ETL’s for very large corporate organisations. When the cruch comes, speed matters. We have developed a suite of ETL Routines over time, and on every project the are carefully reviewed and hand crafted. In the end they are robust, exact, optimised.

    However we do use a separate Tool DaMiT (Data Migration Tool) to maintain mappings and produce the specifications. This however has nothing to do with the actual code, which we prefer to code.

    The Oracle Package Libraries are well documented and well maintained.

    In the end, what matters is SPEED, SPEED, SPEED !!!

    :)

    Pieter

    .. and of course robustness and accuracy with the neccessary management controls ..!

  26. I agree with both camps–ETL and PL/SQL. There is one option that fits all situations. But like many things in life, there are situations where one of the options (ETL or PL/SQL) proves to be the best option. In light of that does it make sense for us to make a blanket statement that one of them will always be the best choice?

    In addition to the things mentioned in all previous emails, I would like us to consider these following things too before we make our choice. As somebody said before, it is not just a technical choice.

    (1) Documentation–code, metadata and system
    (2) Maintainability
    (3) Simplicity of the solution
    (4) Scalability
    (5) Flexibility for change
    (6) Learning curve (new member joins team)
    (7) Ease of finding the reqd skillset
    (8) Time and effort required to code
    (9) Enforceable standards
    (10) Expertise required

    That is my 2 cents.

  27. Having worked on many warehouse projects with big consulting companies heres
    what I can draw… Kimball methodology has always been at the heart of all these
    projects and I admit I am old school. So I believe ETL tools are good for standardizing
    data ( extract stages )… this is usually done in bulk ( SET logic )… my limited
    experience with Informatica has been its good for this but as Oracle states in its
    doc that target tables ( start schema ) should have single row processing and ( from
    my experience ) should have meta data tracking to allow the job to pickup when the
    corrupted row is fixed and continue from that point. I will say that I saw an Informatica
    job fail that was set orientated and they had to go through a large amount of data to
    fix the problem… Oracle does have a new feature that allows you to do a bulk set logic
    insert and have the error rows go to error processing tables instead of having the entire
    job fail… WOW is this ever key to Data warehousing…

  28. …as Oracle states in its doc that target tables ( start schema ) should have single row processing …

    Well if it is in there I wouldn’t pay too much attention — the 10g docs still say that tables and indexes should be in separate tablespace for performance reasons.

  29. I am an IT manager, who has worked in Oracle for 17 years. My experience is that trying to mainatin a complex ETL process in SQL and (more horribly) PL/SQL is just too problematic for very large warehouses. Our current ETL process, wich populates an ODS and several datamarts is written in PL/SQL. It has been optimized and optimized again, but even though only about 120 GB of source data are processed each day, it still takes several hours to complete. Worse, the code must be almost continuously retuned, as execution plans change on a near-weekly basis. If underlying structures change, large portions of the process must be recoded. Even worse than this, the solution is not very robust and we have frequent failures. I have never seen such problems with these with a product such as Informatica. PL/SQL is suitable only for simple internal transformations. It is also very expensive because of the context switches needed each time it makes a call to SQL*Plus. For more complex data transformations, external native languages are always faster and more robust and one never has to worry about latch waits, locks, etc.

  30. I’m suprised by nearly all of your comments, I’m afraid, and some in particular I do not understand (for example, PL/SQL making calls to SQL*Plus?).

    I thnk it’s worth pointing out that using PL/SQL does not have to mean using row-by-row processing. I believe that the role of PL/SQL should almost entirely confined to that of a wrapper for SQL statements that implement the work, and that with competent use of DW-oriented SQL features like multitable insert, compression, dynamic sampling, parallelism and direct path nologging inserts a SQL-based solution is almost always going to be faster than Informatica, and will be more robust also.

    In some cases there are problems that are nearly impossible to solve in a performant fasshion with either Informatica or SQL, for example this one: http://oraclesponge.wordpress.com/2008/04/02/250000-tests-for-uniqueness-per-second-aint-no-biggie/

  31. David

    I have worked in the DW field since Kimball’s book came out as a DBA, ETL developer
    and architect and agree with all your statements. On all the projects I have seen
    be it host processing ( Perl, syncsort,Pro*C…) or SQL based there are architectual
    aspects that are familiar to all…

    1. Separate metadata area that tracks stats and load to the row

    2. Rich error processing that puts the error rows in a “suspension bucket” so
    that they can be retrieved without going back to the load file. I see a lot
    of PL/SQL that just records the row number and error. This is poor processing
    especially for large extract files or projects that have sources from mainframes
    that typically can have dirty binary data. Nobody wants to be the poor soul that
    has to go through millions of row to find the one that failed. I think the error
    rows should be written to a file because inserting to an error table will only work
    if you insert the entire row into a field that can handle it. Binary data can blow
    a field or record out…. that is why its best to write the entire error record to
    a flat file.

    3. The ability to fix the error record and pick up the load from that point utilizing
    the metadata to process.

    this is why error logging DML has so much potential, your PL/SQL if using bulkload
    should offer the same functionality…

  32. Oh I have also been involved extensively with ETL development tools
    like Informatica and have seen when these jobs fail its a nightmare to find
    the rows that error out, of course it been a few years since I have worked
    with it and I have heard that there are many improvements. Still any high
    performance or complex loading ( like matching ie I recommend PostalSoft )
    is still done in Pro*C or PL/SQL… I think these ETL tools are good for
    standardizing data and low impact loading, have not seen it perform well
    in high performance arena…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s