May 2nd, 2013
· Several wizards are now available for registering datastores; including file-upload to the server for CSV files, database connection entry, guided registration of Salesforce.com credentials and more.
· The job building wizards have also been extended with several enhanced features; Selection of value distribution and pattern finding fields in the Quick analysis wizard, a completely new wizard for creating EasyDQ based customer cleansing jobs and a new job wizard for firing Pentaho Data Integration jobs (read more below).
· You can now ad-hoc query any datastore directly in the web user interface. This makes it easy to get quick or sporadic insights into the data without setting up jobs or other managed approaches of processing the data.
· Once jobs or datastores are created, the user is guided to take action with the newly built object. For instance, you can very quickly run a job right after it's built, or query a datastore after it is registered.
· Administrators can now directly upload jobs to the repository, which is especially handy if you want to hand-edit the XML content of the job files.
· A lot of the technical cruft is now hidden away in favor of showing simple dialogs. For instance, when a job is triggered a large loading indicator is shown, and when finished the result will be shown. The advanced logging screen that was previously there can still be displayed upon clicking a link for additional details.
January 22nd, 2013
· We've added a web service in the monitoring application for getting a (list of) metric values. This makes the monitoring even more usable as a key infrastructure component, as a way to monitor data (quality) and expose the results to third party applications.
· The 'Table lookup' component has been improved by adding join semantics as a configurable property. Using the join semantics you can tweak if you wish the lookup to work semantically like a LEFT JOIN or an INNER JOIN.
· The EasyDQ components have been upgraded, adding further configuration options and a richer deduplication result interface.
· Performance improvements have been a specific focus of this release. Improvements have been made in the engine of DataCleaner to further utilize a streaming processing approach in certain corner cases which was not covered previously.
January 5th, 2013
· The date and time related analysis options have been expanded, adding distribution analyzers for week numbers, months and years. All analyzers related to date and time are now grouped within a submenu called "Date and time" under "Analyze".
· An optional "descriptive statistics" option has been added to the Number analyzer and the Date/time analyzer. This option adds additional metrics to the results of these analyzers, such as Median, Skewness, percentiles and Kurtosis. These metrics are optional since their memory footprint is somewhat larger than the existing metrics.
· The lines in the timeline charts of the monitoring web application now have small dots in them. This is especially useful for charts with few (or even only one) observations in them - to point out exactly where the observation points are.
· The query parser when invoking ad-hoc queries have also been substantially improved. Now queries can contain DISTINCT clauses, *-wildcards, subqueries and are fault-tolerant towards text-case issues.
· Two new transformers have been added for generating UUIDs and for generating timestamps.
December 18th, 2012Metric formulas – elaborated Data Quality KPIs:
· It is now possible to build much more elaborate Data Quality KPIs in DataCleaner’s monitoring web application. The user interface allows you to build complex formulas in a spreadsheet-like formula style; using variables collected by DataCleaner jobs.
· Metric formulas can combine any number of metrics, constants and operations, as long as it can be expressed in a mathematical equation.
· For instance – measure the rate of duplicate records in percentage of the total record count. Or measure the amount of product codes that conform to a set of multiple string patterns.
Ad-hoc querying – of any datastore:
· With DataCleaner 3.1 you can now perform ad-hoc queries to any datastore! Queries can be expressed in plain SQL and will be applied to databases as well as files, NoSQL databases and more, providing a truly helpful query mechanism to extend into your discovery and data profiling experience.
· The query option is also available through a web service to monitoring users with the ADMIN role. The query is provided as a HTTP parameter or POST body, and the result is provided as an XHTML table.
Value matcher – a new analysis option:
· Often times you have a firm idea on which values should be allowed and expected for a particular field. In DataCleaner there’s always been the Value Distribution analysis option which would help you assert your assumptions. In DataCleaner 3.1 though, you have a more precise offering – the Value matcher. This analysis option allows you to specify a set of expected values and then perform a value distribution like analysis, specifically to validate and identify unexpected values.
Copying, deleting and management of jobs:
· Management of jobs and results in the DataCleaner monitor application has been improved greatly. You can now click a job in the Scheduling page of the monitor, and find management options available for operations such as renaming, copying, deleting and more. Each operation respects the linkages to other artifacts in the monitor, such as analysis results, schedules and more. This means that management of the monitoring repository has become a lot easier and mature.
Manage data quality history:
· Sometimes you’re facing situations where you actually want to do monitoring with historic data! It might be that you have historic dumps or backups of databases, which you wish to show and tell the story of. You can now do the analysis of this historic data, upload it to the DataCleaner monitor, and using a new web service, set a historic data of that particular analysis result. This means that your timelines will properly plot the results using their intended date, but with the results that you’ve collected maybe at a later point in time.
Clustered scheduler support (EE only):
· The scheduler of DataCleaner monitor has been externalized, so that it can be replaced by the means of simple configuration. In the Enterprise Edition (EE) of DataCleaner, we provide a clustered scheduler, providing the ability to load balance and distribute your executions across a cluster of machines.
Single-signon (SSO) using CAS (EE only):
· In the Enterprise Edition (EE) of DataCleaner we now provide a single-signon option for the monitor application. Now DataCleaner can be an integrated part of your IT infrastructure, also security-wise.
... And a lot more:
· The above is just a summary. More than thirty issues have been resolved in this release. We have solved several requests coming from the forums and community, and we encourage everyone to use this medium as a vehicle for change. We’re very happy to make the development of DataCleaner be heavily influenced by the streams in the community.
November 1st, 2012
· Adds a service for renaming jobs in the monitoring repository.
· You can access this as a RESTful Web service or interactively in the UI.
· A Web service was added for changing the historic date of an analysis result in the monitoring repository.
· The Web application has been made compatible with legacy JSF containers.
· Caching of configuration in the Web application was greatly improved, leading to faster page load and job initialization times.
October 13th, 2012
· When triggering a job in the monitoring web application, the panel auto-refreshes every second to get the latest state of the execution.
· File-based datastores (such as CSV or Excel spreadsheets) with absolute paths are now correctly resolved in the monitoring web application.
· The "Select from key/value map" transformer now supports nested select expressions like "Address.Street" or "orderlines.product.name".
· The table lookup mechanism have been optimized for performance, using prepared statements when running against JDBC databases.
· Administrators can now download file-based datastores directly from the "Datastores" page.
· Exception handling in the monitoring web application has been improved a bit, making the error messages more precise and intuitive.
October 2nd, 2012
· The primary bugfix in this release was about restoring the mapping of columns and specific enumerable categorizations. For instance in the new Completeness analyzer, we found that after reloading a saved job, the mapping was not always correct.
· Furthermore a few internal improvements have been made, making it easier to deploy the DataCleaner monitor web application in environments using the Spring Framework.
· Last but not least, the visualization settings in the desktop application have been improved by automatically taking a look at the job being visualized and toggling displayed artifacts based on the screen size and amount of details needed to show it nicely.
September 26th, 2012
· Display of timeline and trends of data quality metrics
· Centralized repository for managing and containing jobs, results, timelines etc.
· Scheduling and auditing of DataCleaner jobs
· Providing web services for invoking DataCleaner transformations
· Security and multi-tenancy
· Alerts and notifications when data quality metrics are out of their expected comfort zones.
· There is a new Completeness analyzer which is very useful for simply identifying records that have incomplete fields.
· You can now export DataCleaner results to nice-looking HTML reports that you can give to your manager, or send to your XML parser!
· The new monitoring environment is also closely integrated with the desktop application. Thus, the desktop application now has the ability to publish jobs and results to the monitor repository, and to be used as an interactive editor for content already in the repository.
· New date-oriented transformations are now available: Date range filter, which allows you to subset datasets based on date ranges, and format date, which allows to format a date using a date mask.
· The Regex Parser (which was previously only available through the ExtensionSwap) has now been included in DataCleaner. This makes it very convenient to parse and standardize rich text fields using regular expressions.
· There's a new Text case transformer available. With this transformation you can easily convert between upper/lower case and proper capitalization of sentences and words.
· Two new search/replace transformations have been added: Plain search/replace and Regex search/replace.
· The user experience of the desktop application has been improved. We've added several in-application help messages, made the colors look brighter and clearer and improved the font handling.
May 1st, 2012Apache CouchDB support:
· We've added support for the NoSQL database Apache CouchDB. DataCleaner supports both reading from, analyzing and writing to your CouchDB instances.
Update table writer:
· Following our previous efforts to bring ETLightweight-style features into DataCleaner, we've added a writer which updates records in a table. You can use this for example to insert or update records based on specific conditions.
· Like the Insert into table writer, the new DataCleaner Update table writer is not restricted to SQL-based databases, but any datastore type which supports writing (currently relational databases, CSV files, Excel spreadsheets, MongoDB databases and MongoDB databases), but the semantics are the same as with a traditional UPDATE TABLE statement in SQL.
Drill-to-detail information saved in result files:
· When using the Save result feature of DataCleaner 2.5, some users experienced that their drill-to-detail information was lost. In DataCleaner 2.5.2 we now also persist this information, making your DQ archives much more valuable when investigating historic data incidents.
Improved EasyDQ error handling:
· The EasyDQ components have been improved in terms of error handling. If a momentary network issue occurs or another similar issue causes a few records to fail, the EasyDQ components will now gracefully recover and most importantly - your batch work will prevail even in spite of errors.
Table mapping for NoSQL datastores:
· Since CouchDB and MongoDB are not table based, but have a more dynamic structure we provide two approaches to working with them: The default, which is to let DataCleaner autodetect a table structure, and the advanced which allows you to manually specify your desired table structure. Previously the advanced option was only available through XML configuration, but now the user interface contains appropriate dialogs for doing this directly in the application.
January 3rd, 2012Feature enhancements:
· Batch loading features we're greatly improved when writing data to database tables. Expect to see many orders of magnitude improvements here.
· Writing to data has been more conveniently made available by adding the options to the window menu.
· You can now easily rename components of a job by double clicking their tabs.
· When reading from and writing to the same datastore (eg. the DataCleaner staging area) we've made sure that the table cache of that datastore is refreshed. Previously some scenarios allowed you to see an out-of-date view of the tables.
· A potential deadlock when starting up the application was solved. This deadlock was a consequence of an issue in the JVM, but we worked around it by synchronizing all calls to the particular API in Java.
December 15th, 2011
· Duplicate detection (aka. Deduplication or Fuzzy matching of records), which is free to use for up to 500,000 values.
· Address data validation and cleansing. This allows you to check if addresses exist, if they are correctly formatted and even to suggest corrections in case you have mistakes.
· Name data validation and cleansing. With the Name service, EasyDQ does not only format your names consistently, but also checks for misspellings and interprets the name parts.
· Email and phone validation and cleansing. These services provide checking of email and phone data, making sure that email domains exist, that country codes are correct and much more.
September 30th, 2011International data support:
· If you are working with international data, then you might have different character sets in your data, for example Chinese or Hebrew. We added the Character set distribution analyzer, which is a profiling option that lets you figure out which character sets are used in your data.
· Working with data containing different character sets can be problematic. Using the new Transliterate transformer you can now transliterate strings from different writing systems to Latin characters.
· There is also a new webcast demonstration, focusing on the international data capabilities of DataCleaner 2.3 in the documentation section.
Grouping of analysis results by a secondary column:
The Pattern analyzer is now able to group patterns based on a secondary column. This is useful for analyses like:
· Get patterns of phone numbers, grouped by country.
· Get patterns of email username based on email domain.
Something similar has been done for the Value Distribution analyzer; this allows for analyses such as:
· Are all city names distinct, when grouped by postal code?
· What is the distribution of gender within particular customer types?
· The Pattern finder results can now be shown in a chart. This makes the distribution visible and shows how much of a "long tail" of patterns there is.
The output of the value distribution analyzer has been improved in a couple of areas:
· The readability of the chart has been improved.
· It shows the total number of rows and the distinct count over these rows: the number of different values that exist in the rows. This helps in figuring out how often duplicate values exist.
· If there are empty strings, we use the keyword for it, so that it is easier to recognize them.
· Next to the already existing output formats (CSV files and H2 datastores) we added writing output to Excel spreadsheets.
· After writing to a datastore, it is now possible previewing the output, so that you can check whether the output is according to your expectations.
· It is now also possible to add the output as a new datastore, so that it can be used as input for a new job.
· Documentation has been generally improved. In particular, logging and command line interface descriptions have been added.
· The extension mechanism has been improved by modularizing several pieces of the application and introducing Google Guice as a generally available dependency injection framework for extension developers.
· And of course we did more than twenty small improvements and bug fixes.
June 27th, 2011
· The main driver for this release has been a story about extensibility. While releasing the application we are simultaniously releasing a a new DataCleaner website which features an important new area: The ExtensionSwap. The idea of the ExtensionSwap is to allow sharing of extensions to DataCleaner and installation simply by clicking a button in the browser!
· The DataCleaner extension API has been improved a lot in this release, making it possible to create your own transformers, analyzers and filters. If you feel your extensions could be of interest to other users, please share it on the ExtensionSwap and we provide a channel for you to easily distribute it to thousands of users. The Extension API and the ExtensionSwap is further explained in our new webcast demonstration for developers and other techies with an interest.
· We are also releasing a set of initial extensions on the ExtensionSwap: The HIquality Contacts for DataCleaner extension which provides advanced Name, Phone and Email cleansing, based on Human Inferences natural language processing DQ web services. We are also shipping a sample extension which will serve as an example for developers wanting to try out extension development themselves. In the coming months we will make sure to post even more extensions originating from our internal portfolio of tools that we use at Human Inference's knowledge gathering teams.
· In addition to extensibility we are also focusing on embeddability. We want to be able to embed DataCleaner easily into other applications to make profiling and data analysis possible anywhere! We've created a new bootstrapping API which allows applications to bundle DataCleaner and bootstrap it with a dynamic configuration or run it in a "single datastore mode", where the application is tuned towards just inspecting a single datastore (typically defined by the application that embeds DataCleaner). We already have some really interesting cases of embedding DataCleaner in the works - both in other open source applications as well as commercial applications.
· We've added support for analyzing SAS data sets. This is something we're quite proud of as we are, to our knowledge, the first major open source application to provide such functionality, ultimately liberating a lot of SAS users. The SAS interoperability part was created as a separate project, SassyReader, so we expect to see adoption in DataCleaner's complimentary open source communities soon too!
· We've also added support for another type of datastore: Fixed width files. Fixed width files are text files where each column has a fixed width. There is no separator or quote character, like CSV files, instead each line are equal in length and each line will be tokenized according to a set of value lengths.
· An option to "fail on inconsistencies" was added to CSV file and fixed width file datastores. These flags add a format integrity check when using these text file based datastores.
· A bug was fixed, which caused CSV separator settings not to be retained in the user interface, when editing a CSV datastore.
· Japanese and other characters are not supported in the user interface. This "bug" was a matter of investigating available fonts on the system and selecting a font that can render the particular characters. On most modern systems there will be capable fonts available, but on some Unix/Linux branches there might still be limitations.
· The documentation section has been updated! Ever since the initial 2.0 release the documentation have been far behind, but we've finally managed to get it up to date. There are still pieces missing in the docs, but it should definately be useful for basic usage as well as a reference for most topics.
· Application startup time was improved by parallelizing the configuration loading and by delaying the initialization of those parts of the configuration that are not needed for the initial window display.
· The phonetic similarity finder analyzer have been removed from the main distribution, as this was quite experimental and serves mostly as a proof of concept and an appetizer to the community to create more advanced matching analyzers. You can now find and install the phonetic similarity finder on the ExtensionSwap.
· Cancelled or errornous job handling was improved and the user interface responds more correctly by disabling buttons and progress indicators, if a job has stopped.
· Fixed a few minor UI issues pertaining to table sizing and use of scrollbars.
May 16th, 2011Enhancements:
· Added a search/filtering text field on the datastores list. This enables you to quickly find your datastore if you have registered more datastores than available on the screen.
· Reference data for country codes was added to the standard distribution, thanks goes to Graham Rhind for providing these.
· Added a horizontal scroll bar to the data previewing windows of there are more than 10 columns.
· Ability to add an extension package with new functionality in the Options dialog at runtime. More focus on extensions will follow in the upcoming releases.
· We've exposed an early preview of our Command-Line Interface (CLI) by allowing you to invoke the application with the "-usage" parameter which will show the CLI options.
· Added number formatting options to the "Convert to Number" transformer.
· Fixed an out-of-memory issue when querying tables with a LOT of columns (150+).
· Fixed an issue that cause the "Limit analysis" check box to not be checked correctly when a job was re-opened after saving.
· Not really a bugfix as it was never an official feature, but now we support restoring user preferences (the userpreferences.dat file) from previous versions of DataCleaner.
April 6th, 2011There was a lot of work done on the user interface (see media page):
· We decided to remove the left-hand side window containing environment configuration options.
· Instead all these options have now been moved to the job building window so the user only has to focus on a single window for all the interactions needed to build a job.
· The welcome/login dialog has also been removed in favor of a more discrete panel that can be pulled in or hidden from the main window.
· Datastore selection and management is considered the first activity in the application, which is why it is also the first step to handle in the main window.
· You can now stop jobs in case you decide to change something before it is done.
· Bar and line charts were added to a lot of the analysis result screens, including String analyzer, Number analyzer, Date/time analyzer and Weekday distribution (see media page).
· All "preview data" windows now contain paging controls so you can move backwards and forwards in the data set.
· Most common database drivers (MySQL, PostgreSQL, Oracle, MS SQL Server and Sybase) have been added to a default set of drivers.
· Configuration of the Quick analysis function in the Options dialog.
· Various minor bugfixes.
· Transformer for extracting date parts (year, month, day etc.) from date columns.
March 7th, 2011
· Tabs and buttons in the workbench are disabled when no source columns have been selected.
· A special widget have been added to the "Source" tab, making it very easy to apply row count based sampling of the input data.
· When possible, filters now have the ability to optimize the query of a job (aka. Push-down optimization). This was implemented for the "Max rows", "Equals" and "Not null" filters.
· The growing amount of transformers caused a long list in the "Add transformer" popup. Therefore transformers are now grouped by category and displayed accordingly.
· The visualization of execution flow now allows removing column items and filter outcome items, making the graph more comprehensible, especially for very large jobs.
· The "Coalesce string" transformer now has a "Consider empty strings as null" flag, which is particularly useful when dealing with CSV files.
· Text-based dictionaries and synonym catalogs will get their cached values flushed, if the file they read from changes.
· The "Convert to date" transformer now includes the ability to specify your own date masks, if date strings require it.
· A bug was fixed when passing null values to the the email standardizer.
· A bug was fixed pertaining to proper presentation of "mixed" tokens in the the Pattern finder.
February 21st, 2011
· Filter outcomes where added to the flow visualization.
· A bug was fixed in the widget for selecting the tokenizer's separators.
· The "Equals" filter can now have multiple values to compare with.
· Some minor cosmetical improvements.
February 14th, 2011
· Data transformations, allowing you to preprocess, extract, refine, combine and calculate data items as a part of your data profiling jobs.
· Filtering, sampling and subflow management, allowing you to define criteria to exclude and include particular items of data.
· Richer reporting with charts, graphs, navigation trees and more.
· A bunch of new data quality functions for date gap analysis, phonetic similarity finding, synonym lookups and more.
· More configuration options and added data quality measures for existing data quality functions like the Pattern finder, String analyzer and more.
· Reusable profiling jobs, where you define your processing flow once and consequently run it on any data.
· Support for MS Excel 2007+ spreadsheets.
October 19th, 2009
· Improved Excel spreadsheet support.
· Improved SQL Server support.
· Improved performance for CSV files.
· A fix for a bug that caused re-opening of database dictionaries to throw an NPE.
· A fix for a bug related to dictionary lookups of null values.
· Support for Teradata databases. Connection templates for SQL Server connections.
· Selection of file encoding when reading CSV files.
· A fix for a minor bug relating to reading files on the classpath when running in Java WebStart mode.
July 14th, 2009
· The most notable improvement is in the Value Distribution Profile. Previously this profile caused quite a lot of memory problems which could result in out-of-memory errors in extreme cases. This has been fixed by using on-disk caching with the berkeley db when nescesary.
· Another notable feature is that we can now distribute DataCleaner as a single JAR file. This means that we will be serving the application as a Java WebStart application (ie. run it as if it's an online application) and we are also considering other distribution options.
· When starting the application, it automatically downloads regular expressions from the RegexSwap.
· A bug in regards to dictionary-matching number-based columns was reported and fixed.
· A bug in regards to invalid characters in XML-export formats was reported and fixed.
· When opening files, we are now ignoring suffix case so that .CSV files can be opened as well as .csv.
· The number of columns shown in the preview window are automatically restricted if there are too many to show on a single screen.
April 21st, 2009
· An additional HTML export format have been added to the built-in export formats (usable when exporting Profiler results in the desktop app and when executing the runjob command-line tool).
· The export format is now choosable directly in the desktop app.
· Four new measures where added to the String Analysis profile: avg. chars and max/min/avg white spaces.
March 16th, 2009
· The license was changed to LGPL.
· The profiler and validator can be executed using multiple threads.
· DataCleaner tasks can be executed from the command line for batch operation.
· More elaborate status information is given during profiler and validator execution.
· Date mask matcher and regex matcher profiles were added.
· A regex is loaded from the online RegexSwap repository.
· Popular database drivers are automatically downloaded and installed.
· More file types are supported, such as .dat and .txt.
· XML file support was improved.
· Memory improvements were made in the Time analysis profile.
· Logging when running profiling and validation was improved.
· An information schema is provided for file-based datastores.
· Columns in the datastore-tree are lazy-loaded.
January 23rd, 2009
· This release adds multi-threaded execution, a commandline interface (runjob.sh/runjob.cmd), some UI updates, and a few bugfixes.
September 22nd, 2008
· The "Repeated values" profile was replaced with the better and more advanced "Value distribution" profile.
· Drill-to-details options were added for Dictionary Matcher profile.
· A new application logo was made.
· Lots of small bugfixes and UI beautifications were done.
· Lots of sample dictionaries and regexes were added.