Pentaho KETTLE and SQL Server

Background

I have recently started on a Proof-of-Concept project where I have been asked to investigate replacing their ageing application with something more modern. Their current application is built using classic ASP 1.0 (Active Server Pages), COM (Common Object Model) components, MTS (Microsoft Transaction Services), and Microsoft SQL Server 2012 (upgraded from various older versions going back to 2000).

I can't talk about the domain directly at the moment, but in abstract terms, it is a large and comprehensive records management system. Whilst the application itself is crucial to the operation of the business, the data itself is much more interesting. There are some grand plans afoot to replace the current data model with something much more flexible, thus allowing enrichment of the data with further facts and links, and at the same time offering facilities for new insights and applications.

To help inform proposals for a new data model and application, I wanted to get a good look at the existing data and its model. Having been provided with SQL Server backups, I wanted to be able to extract the data from SQL Server into several different models that are under my consideration. After all, there is nothing quite like working with the data to help you understand if the model you have chosen is a potentially good fit or not.

Pentaho Data Integration - Community Edition, or KETTLE as it is commonly known is an Open Source ETL (Extract Transform and Load) tool. Compiled releases are available from SourceForge. KETTLE includes a GUI tool for visually designing workflows called Spoon, and its this tool that I initially want to work with.

KETTLE has the ability to connect to a vast array of different data sources. Unfortunately connecting to SQL Server out of the box does not work, as the SQL Server licence prohibits them from distributing their JDBC drivers directly, and so some manual configuration is needed.

Configuring KETTLE for SQL Server

These are the steps that I performed to get KETTLE (version 8.3.0.0-371) and Spoon to connect to SQL Server (version 2012 / 11.0.2100.60) as an input source. I am working on Windows 10, and I had previously installed KETTLE to C:\data-integration.

  1. Download and extract the Microsoft SQL Server JDBC drivers. I downloaded version 7.4. After downloading I ran the sqljdbc_7.4.1.0_enu.exe file, which self-unzips to the directory Microsoft JDBC Driver 7.4 for SQL Server.
  2. I copied the file Microsoft JDBC Driver 7.4 for SQL Server\sqljdbc_7.4\enu\mssql-jdbc-7.4.1.jre8-shaded.jar to the directory C:\data-integration\lib.
  3. If your SQL Server is using Integrated Authentication, and you wish to also connect in that way, then you will also need to copy the file Microsoft JDBC Driver 7.4 for SQL Server\sqljdbc_7.4\enu\auth\x64\sqljdbc_auth.dll to the directory C:\data-integration\libswt\win64.
  4. If you have Spoon already running, make sure to restart it so that it picks up the new drivers.
  5. In your transformation within Spoon, you can now add a New Database Connection with the following settings:
  • Connection Type: MS SQL Server (Native)
  • Access: Native (JDBC)
  • Host Name: localhost
  • Database Name: your-db-name-here
  • Instance Name: (leave blank)
  • Enter a Username and Password, or tick "Use Integrated Security".