LanguageTool Changelog

New in version 2.7

October 14th, 2014
  • Breton:
  • added and improved rules
  • New rule that checks if a weekday matches a date, e.g. detects "Gwener 28 a viz Eost 2014", as that date isn't a Friday.
  • Catalan:
  • added and improved rules
  • fixed false alarms
  • Dutch:
  • added and improved many rules
  • switched to Morfologik-based spell checker
  • -English:
  • Do you want to be part of the team that develops the world's most powerful Open Source proofreading tool? We're looking for a maintainer for the English rules in LanguageTool. See http://wiki.languagetool.org/tasks-for-language-maintainers for details.
  • All English dictionaries have been extended to contain word frequency classes to improve the spell checker suggestions (the frequency data is taken from https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries, as for other languages that already use this feature).
  • Better suggestions for English learners: irregular verbs, nouns, and adjectives now usually have a better suggestion. For example, 'thinked' suggests 'thought', 'womans' suggests 'women'.
  • More misspellings provide suggestions now, e.g. 'garentee' (guarantee), 'greatful' (grateful). This may cause a performance decrease of ~ 10% (more for texts with a lot of unknown words).
  • New rule that checks if a weekday matches a date, e.g. detects "Monday, 7 October 2014", as that date isn't a Monday. This rule will only work if it detects the date format in use. So far, these formats are supported: * "Monday, 7 October 2014" * "Monday, 7 Oct 2014" * "Monday, October 7, 2014" * "Monday, Oct 7, 2014" * (this also works with abbreviated week days like Mo or Mon for Monday)
  • Esperanto:
  • New rule that checks if a weekday matches a date, e.g. detects "Vendredon la 28-an de AÅ­gusto 2014", as that date isn't a Friday.
  • French:
  • updated POS tag dictionary and Hunspell dictionary to Dicollecte-5.2
  • added a synthesizer - the agreement rule can now make suggestions for some errors
  • added/improved several rules
  • New rule that checks if a weekday matches a date, e.g. detects "vendredi 28/08/2014", as that date isn't a Friday.
  • German:
  • Fixed a rare NullPointerException and an ArrayIndexOutOfBoundsException
  • Fixed several false alarms
  • Added and improved rules
  • New rule that checks for sentences without a verb (turned off by default due to the risk of false alarms)
  • New rule that checks if a weekday matches a date, e.g. detects "Dienstag, 29.9.2014", as that date isn't a Tuesday.
  • Performance improvements for spell check suggestions
  • Persian:
  • added initial support for Persian (Farsi)
  • Polish:
  • added and improved some rules
  • new rule that checks if a weekday matches a date
  • Portuguese:
  • added/improved several rules
  • added many dozens of compound words
  • Russian:
  • added new rules
  • fix SourceForge feature request #38 (check for different quotation marks)
  • added a few false friend rules (Russian/English)
  • new rule that checks if a weekday matches a date, e.g. detects "понедельник, 30 сентября 2014 г.", as that date isn't a Monday.
  • expanded Russian compound rule with new words from postag dictionary
  • Spanish:
  • Added new POS category Z (for spelled numbers, e.g. 'uno', 'dos', ...)
  • Spelled numbers can now be detected and managed both in disambiguation and rules.
  • Fixed some incorrect lemmas in POS dictionary.
  • Added Hybrid chunker-disambiguator.
  • Tamil:
  • Added initial support for Tamil. If the font for Tamil is not properly displayed on your computer and you're using Windows, you might need to apply the work around described here: https://bugs.openjdk.java.net/browse/JDK-8008572
  • Ukrainian:
  • big update for POS dictionary (fixes and new words)
  • some POS tag renamed for consistency; new tags for abbreviations and rare words
  • many new rules and fixes for existing rules
  • new rule that checks if a weekday matches a date, e.g. detects "понеділок, 7 жов 2014", as that date isn't a Monday
  • token normalization performance improvement
  • LibreOffice integration:
  • Don't get confused by footnotes in LibreOffice 4.3 and later (it now provides us with the footnote positions as meta data, so we can ignore them).
  • API:
  • Major performance improvements for the multi-thread use case, where JLanguageTool gets created per thread, but the language object (e.g. 'German') gets created only once. Overhead for creating JLanguageTool should now be much lower.
  • Removed several classes and methods that had been deprecated since version 2.6
  • Removed DutchSpellerRule - use MorfologikDutchSpellerRule instead
  • The signature of Language.getRelevantRules() has changed
  • The JLanguageTool and MultiThreadedJLanguageTool constructors don't declare to throw an IOException anymore
  • WhitespaceRule has been renamed to MultipleWhitespaceRule (WhitespaceRule still exists but has been deprecated)
  • Deprecated some methods whose visibility will be decreased (e.g. from public to protected)
  • MorfologikSpellerRule.getRuleMatch(String, int) has been renamed to MorfologikSpellerRule.getRuleMatches(String, int)
  • The RuleMatch constructor now throws an exception if toPosition is not larger than fromPosition
  • Introduced a new abstract class TextLevelRule that extends Rule and that can be used for rules that cover more than single sentences.
  • Command line:
  • Enabling and disabling specific rules at the same time is now allowed. In order to test only some rules (disabling all the rest), which previously was done with "--enable LIST_OF_RULES", now use "--enabledonly --enable LIST_OF_RULES" (or "-eo -e LIST_OF_RULES").
  • Embedded server:
  • Two new options can be set in the properties file to make LanguageTool return the same XML format as After the Deadline (AtD). This way it can be used as a drop-in replacement for AtD: * mode - 'LanguageTool' or 'AfterTheDeadline' * afterTheDeadlineLanguage - code of default language if mode is set to 'AfterTheDeadline' NOTE: the 'AfterTheDeadline' mode should be considered experimental for now.
  • The new option 'maxCheckThreads' allows setting the maximum number of threads working on requests in parallel. The default is 10, as it used to be.
  • Internals:
  • New abstract rule AbstractDateCheckFilter that allows to check if a week day and date match. For example "Tuesday, September 29, 2014" could be detected, as September 29, 2014 is not actually a Tuesday. This uses the new experimental RuleFilter interface that can be called from XML with the new 'filter' element. 'filter' takes these attributes: 'class': the fully-qualified name of a Java class that implements RuleFilter, e.g. "org.languagetool.rules.de.DateCheckFilter" 'args': a string like "year:\1 month:\2 day:\3 weekDay:\4", i.e. a space-separated list of key/value pairs, where \x gets resolved to the pattern's token value (as in the 'message' element)
  • The compound rule now ignores tokens that have been immunized in the disambiguation.xml
  • The "filter" action in the disambiguator is now applied only to POS tags that match the POS tag given. If they don't match, the rule is not applied.
  • If you're extending the XML rules as described at http://wiki.languagetool.org/tips-and-tricks#toc2, the external rule and disambiguation files can now be hosted on a password-protected server by specifying an URL like this: http://user:password@example.org/path/user-rules.xml
  • The em dash ("—") is now a tokenizing character for all languages
  • New feature:
  • Use of language models
  • LanguageTool can now make use of ngram data. ngram data is information about how often phrases occur in a language. Currently, this uses phrases of length 3.
  • The data is used by an English rule to find homophone errors, like mixing up coarse/course or flair/flare. LanguageTool had some rules of this kind before, but the new rule now supports about 900 of such word pairs/sets.
  • The data needed for this is huge (7GB for English) and thus not part or LanguageTool.
  • The data (English only for now) and more documentation is available at http://wiki.languagetool.org/finding-errors-using-big-data
  • Using ngrams makes LanguageTool slightly slower when the data is stored on an SSD.
  • If not stored on an SSD, the performance might drastically decrease.
  • Use the new --languagemodel option with the command line client to activate the rule that uses the data. That option is not yet available for the stand-alone GUI.

New in version 2.4.1 (January 10th, 2014)

  • Updated Morfologik libraries to 1.8.3 to fix slow suggestions in the spell checker, which affected at least en-US

New in version 2.4 (January 3rd, 2014)

  • Breton:
  • SRX sentence tokenization
  • added/improved a few rules
  • fixed some false alarms
  • fixed incorrect suggestions thanks to added tests on corrections
  • Catalan:
  • added/improved several rules
  • fixed false alarms
  • made additions and fixes to the tagger dictionary
  • removed some words from synthesis dictionary (see filterarchaic.txt)
  • added frequency data to the tagger dictionary; frequency wordlist comes from the Gaia project, with Apache License, version 2.0 (https://github.com/mozillab2g/gaia/tree/master/keyboard/dictionaries).
  • English:
  • added/improved a few rules
  • fixed some false alarms
  • French:
  • added/improved several rules
  • fixed some false alarms
  • German:
  • added/improved several rules
  • added a synthesizer the agreement rule can now make suggestions for some errors (not all suggestions are correct, though)
  • Polish:
  • added/improved several rules, especially for hyphen and dash usage
  • added frequency information for spellchecking dictionary; frequency wordlist comes from the Gaia project, with Apache License, version 2.0 (https://github.com/mozillab2g/gaia/tree/master/keyboard/dictionaries).
  • fixed some false alarms
  • Portuguese:
  • added/improved several rules (it now includes gender rules "a"/"o")
  • it now has 3400+ compound words
  • the JAR file has been renamed to languagetool.jar, from formerly languagetoolstandalone.jar to avoid confusion about what 'standalone' means in this context (github issue #29)
  • for languages with many rules (like French or German) performance on long texts has been increased by about 2030%
  • fix for threadsafety (could cause hang in MultiWordChunker)
  • fixed a bug where chunk annotations were not tested in groups
  • fix: \1 and had not been evaluated in ...
  • fixed a bug in the unification mechanism that discarded some of the matching interpretations prematurely
  • added support for chunk annotations in the disambiguator, and fixed one bug in filtering tokens with chunk annotations
  • updated Morfologik libraries to 1.8.2 (bug fixes, stricter input sanity checking, add frequency data to dictionaries)
  • added the option of including frequency data to taggging or spelling dictionaries. The expected format of the frequency wordlists is the one in the Gaia project, with Apache License, version 2.0 (https://github.com/mozillab2g/gaia/tree/master/keyboard/dictionaries)
  • new command line tools to export and create binary dictionaries:
  • org.languagetool.dev.DictionaryExporter
  • org.languagetool.dev.POSDictionaryBuilder
  • LibreOffice/OpenOffice integration:
  • added a workaround for incorrect sentence detection for the case that a footnote appeared after a sentence full stop (Sourceforge bug #191)
  • standalone GUI:
  • The dialog opened by the "More..." item in the context menu of an error will now also display correct and incorrect example sentences
  • API:
  • SentenceTokenizer is now an interface, the implementation has been moved to RegexSentenceTokenizer, but this is deprecated and SRXSentenceTokenizer should be used instead
  • Some methods from org.languagetool.tools.StringTools have been moved to the org.languagetool.gui.Tools class in the languagetoolguicommons project
  • LanguageToolListener.languageToolEventOccured() has been renamed to LanguageToolListener.languageToolEventOccurred()
  • org.languagetool.tools.SymbolLocator isn't public anymore (shouldn't affect anybody)
  • removed DanishSentenceTokenizer which had been deprecated for three years
  • Rule.getCorrectExamples() and Rule.getIncorrectExamples() don't return null anymore but an empty list if there are no examples. Consequently, setCorrectExamples() and setIncorrectExamples() don't accept null anymore.
  • Rule.getId() may return any string now, not just ASCIIonly strings (actually this has been the case before, as the ASCIIonly restriction was never enforced and only mentioned in the javadoc)
  • languagetoolwikipedia: the command line options for checking a Wikipedia dump have been simplified. The command can now be called like this: java jar languagetoolwikipedia.jar checkdata l en f enwiki20130621pagesarticles.xml Call just "java jar languagetoolwikipedia.jar checkdata" to get a usage message. More than one file can be specified with the f option. Additionally to Wikipedia XML dumps, CSV files from Tatoeba (http://tatoeba.org) are now also supported, they need to be filtered first to contain only the relevant language.

New in version 2.3 (October 4th, 2013)

  • Breton:
  • added/improved a few rules
  • fixed false alarms
  • updated POS dictionary from Apertium (svn r47282)
  • Catalan:
  • added support for language code ca-ES-valencia (Catalan Valencian), to be used in LibreOffice 4.2.0
  • added a simple replace rule with hundreds of replacement suggestions
  • added/improved several rules
  • fixed false alarms
  • Chinese:
  • added a workaround for a StringIndexOutOfBoundsException (http://sourceforge.net/p/languagetool/bugs/186/)
  • English:
  • added replacement patterns for the spelling checker to make suggestions better (now offers 'taught' for 'teached')
  • added/improved a few rules
  • French:
  • added/improved a few rules
  • fixed false alarms
  • updated POS tag dictionary and Hunspell dictionary to Dicollecte-4.12
  • German:
  • added/improved several rules
  • Portuguese:
  • added/improved a few rules
  • it now has 3300+ compound words
  • Ukrainian:
  • added/improved several rules
  • the source code has been moved to github: https://github.com/languagetool-org/languagetool
  • LanguageTool requires Java 7 now
  • LanguageTool makes use of multiple threads now for text checking on modern hardware, improving performance (this affects the stand-alone version, the command line version and the LibreOffice/OpenOffice extension)
  • Rule syntax:
  • preliminary support for new min/max attributes that allow to match an element that appears the given number of times. For example: foo will match nothing or "foo", i.e. "foo" is optional foo will match "foo" or "foo foo" foo will match nothing, "foo", or "foo foo" Use max="-1" to allow unlimited occurrences. For min, only 0 or 1 is supported (1 is the default).
  • support for OR-statements. For example: a Internally and in run-time, a rule containing OR-statements is converted into several rules without OR-statements.
  • English now has a chunker to detect, amongst others, singular and plural noun chunks. This is documented at http://wiki.languagetool.org/using-chunks
  • standalone version:
  • The standalone version now underlines errors with a red (spelling errors) or blue (other errors) line (Panagiotis Minos)
  • Remember the language selection for the next start
  • Improved window and dialog placement in a multi-monitor setup
  • embedded server: uses default port (8081) again if started without arguments
  • updated the morfologik-stemming library to version 1.7.1 to enable better suggestions, including proper handling of diacritics and replacement patterns (equivalents of MAP and REP features in hunspell dictionaries)
  • OpenOffice/LibreOffice integration:
  • fix: the "About" dialog didn't work in Apache OpenOffice 4.0
  • fix: country specific rules (like for British English) didn't work
  • API:
  • In class Language, getCountryVariants() has been renamed to getCountries(), and a new method getVariant has been added.
  • Some methods have been deprecated
  • Some methods have been moved from the Tools class (languagetool-core) to the new CommandLineTools class (languagetool-commandline)
  • AbstractRuleDisambiguator has been renamed XmlRuleDisambiguator and is not abstract anymore. The RuleDisambiguator classes have been removed, XmlRuleDisambiguator can be used directly instead.
  • A new method JLanguageTool.check(AnnotatedText) has been introduced that allows you to check text with markup. Use AnnotatedTextBuilder to build up the input.
  • Thread-safety has been improved. The recommended use case is now to create a new JLanguageTool object for each thread, but to create the language only once (e.g. new English()) and use that for all JLanguageTool instances. This changed the API of some public classes, but for the standard use case of checking texts with the JLanguageTool object it shouldn't make a difference. (patch by Stefan Lotties)
  • JLanguageTool.loadFalseFriendRules() now behaves like JLanguageTool.loadPatternRules(): it looks in the class path first, and then, if the given file is not found there, in the filesystem
  • Introduced the Chunker interface that can assign chunks (also known as phrases) to tokens. For example, for noun phrases like "a fast computer" the chunker could assign an 'NP-singular' (noun phrase, singular) chunk to each of the tokens in that phrase. In the grammar.xml, such a token can then be matched with this syntax:
  • The new class MultiThreadedJLanguageTool makes use of as many threads as the computer has processors. In our tests this has improved text checking time by about 70% on an Intel i7 processor when used on 30KB text.
  • AnalyzedTokenReadings now implements Iterable so it can be used in foreach loops
  • AnalyzedGermanTokenReadings has been removed, AnalyzedTokenReadings can be used instead
  • Embedded HTTP server: the server now uses 10 threads instead of 1 (thanks to Panagiotis Minos)
  • text extraction from Wikipedia dumps has been improved

New in version 2.2 (July 1st, 2013)

  • Many error detection rules have been updated, especially for French, Catalan, German, Portuguese, Russian, Esperanto, and Breton.
  • Several small bugs have been fixed.

New in version 2.1 (April 2nd, 2013)

  • This version adds many updates for the error detection rules for English, French, German, Portuguese, Catalan, Polish, Russian, Breton, Esperanto, and Italian.
  • LanguageTool is now modular, for easier use by Java developers.
  • Instead of one big JAR, there are now several small ones (soon to be found at Maven Central).
  • Several bugs have been fixed.

New in version 2.0 (January 3rd, 2013)

  • Many updates for the error detection rules for English, Spanish, French, German, Portuguese, Russian, Breton, Catalan, Esperanto, and Ukrainian have been added.
  • The embedded HTTP server can now be started from the context menu if LanguageTool is running in the system tray.
  • Some small bugs have been fixed.

New in version 1.9 (October 1st, 2012)

  • Many new error detection rules have been added and existing rules have been updated. Mostly affected languages are Danish, German, English, Catalan, Russian, Chinese, French, Breton, Portuguese, and Esperanto.
  • There is initial support for Japanese, with about 20 rules. Several bugs have been fixed.

New in version 1.8 (July 2nd, 2012)

  • Spell checking is now included in the LanguageTool stand-alone version (i.e. not used in LibreOffice/OpenOffice).
  • Many error detection rules have been improved and new rules have been added, especially for German, English, Catalan, Italian, French, Breton, Polish, and Esperanto.
  • Initial support for Greek and Portuguese with a few rules has been added. LanguageTool now supports language variants like British English, American English, Swiss German, etc.
  • Several bugs have been fixed.