This month, we extended our overview of ETL tools from Open Source to include Matillion, an ETL/ELT software available from the AWS Marketplace that has been optimized for Amazon Redshift and the AWS ecosystem. Matillion pushes down all the data transformations to Redshift which is unique to Matillion and gives a tremendous performance for both the data computation and during the development cycle. Because Matillion is web hosted and tied to the Redshift Cluster, its process scales almost linearly, and it’s rarely more than a couple of clicks to backup, restore, or suspend an instance. Since Amazon hosts the Matillion software, maintenance is vastly reduced. Matillion’s cloud-based architecture also means that you can scale up and down on demand for processing power or memory and do not need to have large amounts of memory sitting available and waiting (expensively) – you scale up when your processing needs it and back down when you don’t. The end users access the instance using a local browser from their PC, tablet or phone.
The Breakdown
Installation Process | Since Matillion is run through AWS, there is no installation; however, starting a Matillion instance just takes a couple of clicks and will be up and running in under five minutes. |
User Interface | Orchestrations and transformations are core building blocks, each containing steps. Jobs can be linked through AWS messaging and queues. Easy to navigate core blocks and complicated items can be handled with extensions to additional services through code. |
User Community | The company website maintains a user community with helpful guides for each component that had answers for all of our questions. |
Logging | Logs can be produced at job or transformation level, with options for level of detail. In addition, external alert options allow you to track and monitor jobs in real time to email or SMS. |
Job Execution | Matillion jobs can be run interactively or through a scheduler that will launch orchestration jobs automatically and can have multiple schedules per project with all setup and execution via web browser. In addition, jobs can be set to execute via triggers when events happen in the AWS space. |
Performance | Near linear scalability for performance, tied to the size of Redshift cluster. Scripts are translated and pushed to Redshift for exceptional performance regardless of the client machine that initiates the process. |
Customizing Jobs | Very flexible design where code can be shared across projects via orchestration calls. In addition, configurations and variables can be read from files or tables to customize each run. |
Work Repository | Matillion scripts are stored in the cloud and can be updated by multiple developers simultaneously and in upcoming releases can even have multiple people updating the same transformation step within the job. Matillion also provides the ability to create and maintain version control for scripts with branching and locking options. Specific branches of code can be identified and run, regardless of which method of job execution that you choose. |
Architecture | Scripts are stored in the cloud and can be multi-threaded through orchestrations. Software is not tied to any operating system, as all scripts are developed and executed via web browser. |
Tool Updates | Matillion develops software using Agile methodologies with new releases to the tool about every six weeks. Clients have the ability to invoke updates on given instances to prevent negative impacts on their ongoing projects. |
Additional Product Offering | Matillion also offers a Cloud Business Intelligence product. |
Matillion not only has the ability to simplify working with data stored on Amazon S3 and Redshift, but makes it easy to work with any hosted Amazon platform – RDS, NoSQL, REST, or SOAP API, Social Media, or other enterprise cloud data system (Google Analytics, Salesforce.com, Netsuite, SAP, etc.) In addition to the integration points, the tool provides the ability to create additional extensions natively or by running a piece of external code. The ability to run external code via orchestration means that you can extend the functionality of the toolbox to handle any situation using common services that reduce maintenance and complexity. Another feature that helped with our development is the ability to track row counts and sample data at any point in the process. This allows developers to identify potential data quality or mapping issues, as well as logic errors. When this is encountered, it permits you to add to enhance logging or error handling immediately.
Our Conclusion
The Matillion software can significantly ease data integration for the AWS Redshift environment. We were able to provide a copy of the software to an ETL developer who had never touched a cloud-based data integration product and have orchestrations up and running in a matter of minutes. The fact that the interface is browser based and intuitive may help with clients that are starting down the self-service data integration path, but the power and flexibility mean that you can schedule orchestrations to run or create event-driven jobs to perform just about any task. Given the very reasonable price of $1.37 per hour, billed through the AWS Marketplace, we found the tool to be an incredible value for the features delivered and will continue to use it in our AWS data lab.
Brian Rodrigue – Data Integration Specialty Lead
Claire Walsh – Data & Analytics Lead
@excellaco / excella.com