March 31, 2020

Free and open source Java ETLs

1. Apache Spark

Spark has become a popular addition to ETL workflows. The Spark quickstart shows you how to write a self-contained app in Java. You can get even more functionality with one of Spark’s many Java API packages.

Spark has all sorts of data processing and transformation tools built in. It’s designed to run computations in parallel, so even large data jobs run fast—100 times faster than Hadoop, according to the Spark website. And it scales up for big data operations and can run algorithms in stream. Spark has tools for fast data streaming, machine learning and graph processing that output to storage or live dashboards.

Spark is supported by the community. If you need help, try its mailing lists, in-person groups and issue tracker. For more info ETL Training

2. Jaspersoft ETL

Jaspersoft ETL is a free platform that works with Java. With this open source ETL tool, you can embed dynamic reports and print-quality files into your Java apps and websites. It extracts report data from any data source and exports to 10 formats.

If you’re a developer, Jaspersoft ETL is an easy-to-use choice for data integration projects. You can download the community edition for free. The open source version is recommended for small work groups. For larger enterprises and professional-level support, you might opt for the enterprise edition. Documentation and tutorial links on the community page tend to take you to info on the paid version.

3. Scriptella

Scriptella is an open source ETL tool that was written in Java. It was created for programmers to simplify data transformation work. To embed or invoke Java code in Scriptella, you need the Janino or JavaScript bridge driver or the Service Provider Interface (SPI). The SPI is a Scriptella API plug-in that’s a bit more complicated. See the Using Java Code section in the Scriptella documentation for more options on using Java in Scriptella.

Scriptella supports cross-database ETL scripts, and it works with multiple data sources in a single ETL file. This ETL tool is a good choice to use with Java when you’ve got source data in different database formats that needs to be run in a combined transformation.

4. Apatar

If you work with CRM systems, Apatar, a Java-based open source ETL tool, might be a good choice. It moves and synchronizes customer data between your own systems and third-party applications. Apatar can transform and integrate large, complex customer datasets. You can customize this free tool with the Java source code that’s included in the package.

The Apatar download saves time and resources by leveraging built-in app integration tools and reusing mapping schemas that you create. Even non-developers can work with Apatar’s user-friendly drag-and-drop UI.

No programming, design or coding is required with this cost-saving, but powerful, data migration tool that makes CRM work easier. for more details ETL Testing Certification

5. Pentaho Kettle

Pentaho’s Data Integration (PDI), or Kettle (Kettle E.T.T.L. Environment), is an open source ETL tool that uses Pentaho’s own metadata-based integration method. Kettle documentation includes Java API examples. And its wiki has documentation covering how to run Kettle transformations with Java.

With Kettle, you can move and transform data, create and run jobs, load balance data, pull data from multiple sources, and more. But you can’t sequence your transformations. You’ll need Spoon, the GUI for designing jobs and transformations that work with Kettle’s tools: Pan does data transformation, and Kitchen runs your jobs. However, Spoon has some reported issues.

6. Talend Open Source Data Integrator

Go past basic data analysis and storage with Talend Open Studio for Data Integration, a cloud-friendly ETL tool that can embed Java code libraries. Open Studio’s robust toolbox lets you work with code, manage files, and transform and integrate big data. It gives you graphical design and development tools and hundreds of data processing components and connectors.

With Talend’s Open Studio, you can import external code, create and expand your own, and view and test it in a runtime environment. Check your final products with Open Studio’s Data Quality & Profiling and Data Preparation features. You can get the open source download on the Talend website. Get more from ETL Testing Training

7. Spring Batch

Spring Batch is a full-service ETL tool that is heavy on documentation and training resources. This lightweight, easy-to-use tool delivers robust ETL for batch applications. With Spring Batch, you can build batch apps, process small or complex batch jobs, and scale up for high-volume data processing. It has reusable functions and advanced technical features like transaction management, chunk-based processing, web-based admin interface and more.

Learn more about this open source ETL on GitHub and this resource page.

8. Easy Batch

The Easy Batch framework uses Java to make batch processing easier. This open source ETL tool reads, filters and maps your source data in sequence. It processes your job in a pipeline, writes your output in batches to your data warehouse, and gives you a job report. With Easy Batch’s APIs, you can process different source data types consistently. The Easy Batch ETL tool transforms your Java code into usable data for reporting, testing and analysis.

You can get the latest version of Easy Batch, check out its documentation, or try one of many beginning, intermediate and advanced tutorials.

9. Apache Camel

Apache Camel is an open source Java framework that integrates different apps by using multiple protocols and technologies. It’s a small ETL library with only one API for you to learn. To configure routing and mediation rules, Apache Camel provides Java object-based implementation of Enterprise Integration Patterns (EIPs) using an API or declarative Java domain-specific language. EIPs are design patterns that enable enterprise application integration and message-oriented middleware.

Apache Camel uses Uniform Resource Identifiers (URIs), a naming scheme that refers to an endpoint that provides information. Examples are what components are used, the context path and the options applied against the component. This ETL tool has more than 100 components, including FTP, JMX and HTTP. It runs as a standalone application in a web container like Apache Tomcat, a JEEE application server like WildFly, or combined with a Spring container. learn more skills from ETL Certification

You can read more about Apache Camel on its GitHub repo.

10. Bender

Amazon’s AWS Lambda runs serverless code and does basic ETL, but you might need something more. Bender is a Java-based framework designed to build ETL modules in Lambda. For example, this open source ETL appends GeoIP info to your log data, so you can create data-driven geological dashboards in Kibana. Out of the box, it reads, writes and transforms input that supports Java code: Amazon Kinesis Streams and Amazon S3.

Bender is a robust, strongly documented and supported ETL tool that enhances your data operations. It gives you multiple operations, handlers, deserializers and serializers, transporters and reporters that go beyond what’s available in Lambda.

11. Smooks

If you’re OK with using an ETL tool that’s no longer being developed, you might try Smooks. This open source ETL tool uses Java to build apps for processing data, including Java code. Although Smooks isn’t supported, it has useful functions beyond basic ETL. For example, it can populate Java and virtual object models from source data. Smooks also transforms and transmits large-GB messages to your data warehouse or output destination. From there, Smooks can enrich messages with data from your data sources.

You can clone the Smooks repo on GitHub, or else download it on Maven. And if you’re feeling ambitious, you can take over the project yourself.

12. Metl

Metl, from JumpMind, is a lightweight ETL tool that’s built and run on the Java JDK. This web-based, open source ETL was designed to make programmers’ data work easier. Although it’s a hands-on ETL tool, you don’t need to write custom code with Metl. But you can write your own components if you need to. It runs in the cloud or internally. Metl generates a war file that you can run either on a server like Tomcat or as a standalone app.

See the JumpMind Metl page for support, documentation and training resources.

13. PocketETL

PocketETL is an extendable Java library for batched ETL using Java. This is another hands-on open source ETL tool that was designed for programmers. To make your data pipeline faster, it processes large batches in parallel instead of in series. With PocketETL, you can customize streams and split and reuse EtlStream objects as components in other streams. PocketETL can speed up the time it takes to call external APIs, and it merges multiple EtlStreams into a single loader. If the output is more than 128 MiB, the S3FastLoader splits it into part files.

If you need an ETL tool that saves time, PocketETL might be a good choice. It comes with a host of publicly available adapters that include extractors, transformers and loaders. So, you can get your ETL work started right away with just a bit of coding. PocketETL’s user documentation isn’t extensive, but it has an issue support page.