DataCleaner Changelog

New in version 3.5.5

September 25th, 2013
  • The 'Synonym lookup' transformation now has a option to look up every token of the input. This is useful if you're doing replacement of synonyms within the values of a long text field.
  • Blocking execution of DataCleaner jobs through the monitor's web service for this could sometimes fail with a bug caused by the blocking thread. This issue has been fixed.
  • An improvement was made in the way jobs and the sequence of components are closed / cleaned up after execution.
  • The JNLP / Java WebStart version of DataCleaner was exposed by a bug in the Java runtime causing certain JAR files not to be recognized by the WebStart launcher, under certain circumstances. This issue has been fixed by making slight modifications to those JAR files.
  • A few dead links in the documentation was fixed.

New in version 3.5.4 (September 6th, 2013)

  • It is now possible to hide output columns of transformations. Hiding will not affect the processing flow at all, but simply hide them from the user interface, and thus potentially making the experience more clean, when interacting with other components.
  • A new web service has been added to the monitoring web application, which provides a way to poll the status of the execution of a particular job.
  • A bug was fixed, causing the HTML report to fail for certain analysis types when no records had been processed.
  • And 6 other minor bug has been adressed.

New in version 3.5.1 (June 13th, 2013)

  • Capture changed records:
  • A new filter was added to enable incremental processing of records that have not been processed before, e.g. for profiling or copying only modified records. The new filters's name is Capture changed records, referring to the concept of Change data capture.
  • Queued execution of jobs:
  • The DataCleaner monitor will now queue the execution of the same job, if it is triggered multiple times. This ensures that you don't accidentally run the same job concurrently which may lead to all sorts of issues, depending on what the job does.
  • Minor bugfixes:
  • Several bugfixes was implemented.

New in version 3.5 (May 2nd, 2013)

  • Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of credentials and more.
  • The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
  • You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
  • Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
  • Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
  • A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.

New in version 3.1.2 (January 22nd, 2013)

  • We've added a web service in the monitoring application for getting a (list of) metric values. This makes the monitoring even more usable as a key infrastructure component, as a way to monitor data (quality) and expose the results to third party applications.
  • The 'Table lookup' component has been improved by adding join semantics as a configurable property. Using the join semantics you can tweak if you wish the lookup to work semantically like a LEFT JOIN or an INNER JOIN.
  • The EasyDQ components have been upgraded, adding further configuration options and a richer deduplication result interface.
  • Performance improvements have been a specific focus of this release. Improvements have been made in the engine of DataCleaner to further utilize a streaming processing approach in certain corner cases which was not covered previously.

New in version 3.1.1 (January 5th, 2013)

  • The date and time related analysis options have been expanded, adding distribution analyzers for week numbers, months and years. All analyzers related to date and time are now grouped within a submenu called "Date and time" under "Analyze".
  • An optional "descriptive statistics" option has been added to the Number analyzer and the Date/time analyzer. This option adds additional metrics to the results of these analyzers, such as Median, Skewness, percentiles and Kurtosis. These metrics are optional since their memory footprint is somewhat larger than the existing metrics.
  • The lines in the timeline charts of the monitoring web application now have small dots in them. This is especially useful for charts with few (or even only one) observations in them - to point out exactly where the observation points are.
  • The query parser when invoking ad-hoc queries have also been substantially improved. Now queries can contain DISTINCT clauses, *-wildcards, subqueries and are fault-tolerant towards text-case issues.
  • Two new transformers have been added for generating UUIDs and for generating timestamps.

New in version 3.1 (December 18th, 2012)

  • Metric formulas – elaborated Data Quality KPIs:
  • It is now possible to build much more elaborate Data Quality KPIs in DataCleaner’s monitoring web application. The user interface allows you to build complex formulas in a spreadsheet-like formula style; using variables collected by DataCleaner jobs.
  • Metric formulas can combine any number of metrics, constants and operations, as long as it can be expressed in a mathematical equation.
  • For instance – measure the rate of duplicate records in percentage of the total record count. Or measure the amount of product codes that conform to a set of multiple string patterns.
  • Ad-hoc querying – of any datastore:
  • With DataCleaner 3.1 you can now perform ad-hoc queries to any datastore! Queries can be expressed in plain SQL and will be applied to databases as well as files, NoSQL databases and more, providing a truly helpful query mechanism to extend into your discovery and data profiling experience.
  • The query option is also available through a web service to monitoring users with the ADMIN role. The query is provided as a HTTP parameter or POST body, and the result is provided as an XHTML table.
  • Value matcher – a new analysis option:
  • Often times you have a firm idea on which values should be allowed and expected for a particular field. In DataCleaner there’s always been the Value Distribution analysis option which would help you assert your assumptions. In DataCleaner 3.1 though, you have a more precise offering – the Value matcher. This analysis option allows you to specify a set of expected values and then perform a value distribution like analysis, specifically to validate and identify unexpected values.
  • Copying, deleting and management of jobs:
  • Management of jobs and results in the DataCleaner monitor application has been improved greatly. You can now click a job in the Scheduling page of the monitor, and find management options available for operations such as renaming, copying, deleting and more. Each operation respects the linkages to other artifacts in the monitor, such as analysis results, schedules and more. This means that management of the monitoring repository has become a lot easier and mature.
  • Manage data quality history:
  • Sometimes you’re facing situations where you actually want to do monitoring with historic data! It might be that you have historic dumps or backups of databases, which you wish to show and tell the story of. You can now do the analysis of this historic data, upload it to the DataCleaner monitor, and using a new web service, set a historic data of that particular analysis result. This means that your timelines will properly plot the results using their intended date, but with the results that you’ve collected maybe at a later point in time.
  • Clustered scheduler support (EE only):
  • The scheduler of DataCleaner monitor has been externalized, so that it can be replaced by the means of simple configuration. In the Enterprise Edition (EE) of DataCleaner, we provide a clustered scheduler, providing the ability to load balance and distribute your executions across a cluster of machines.
  • Single-signon (SSO) using CAS (EE only):
  • In the Enterprise Edition (EE) of DataCleaner we now provide a single-signon option for the monitor application. Now DataCleaner can be an integrated part of your IT infrastructure, also security-wise.
  • ... And a lot more:
  • The above is just a summary. More than thirty issues have been resolved in this release. We have solved several requests coming from the forums and community, and we encourage everyone to use this medium as a vehicle for change. We’re very happy to make the development of DataCleaner be heavily influenced by the streams in the community.

New in version 3.0.3 (November 1st, 2012)

  • Adds a service for renaming jobs in the monitoring repository.
  • You can access this as a RESTful Web service or interactively in the UI.
  • A Web service was added for changing the historic date of an analysis result in the monitoring repository.
  • The Web application has been made compatible with legacy JSF containers.
  • Caching of configuration in the Web application was greatly improved, leading to faster page load and job initialization times.

New in version 3.0.2 (October 13th, 2012)

  • When triggering a job in the monitoring web application, the panel auto-refreshes every second to get the latest state of the execution.
  • File-based datastores (such as CSV or Excel spreadsheets) with absolute paths are now correctly resolved in the monitoring web application.
  • The "Select from key/value map" transformer now supports nested select expressions like "Address.Street" or "orderlines[0]".
  • The table lookup mechanism have been optimized for performance, using prepared statements when running against JDBC databases.
  • Administrators can now download file-based datastores directly from the "Datastores" page.
  • Exception handling in the monitoring web application has been improved a bit, making the error messages more precise and intuitive.