nltk
/
ChangeLog
1495 строк · 59.8 Кб
1
2Version 3.8.1 2023-01-02
3
4* Resolve RCE vulnerability in localhost WordNet Browser (#3100)
5* Remove unused tool scripts (#3099)
6* Resolve XSS vulnerability in localhost WordNet Browser (#3096)
7* Add Python 3.11 support (#3090)
8
9Thanks to the following contributors to 3.8.1:
10Francis Bond, John Vandenberg, Tom Aarsen
11
12Version 3.8 2022-12-12
13
14* Refactor dispersion plot (#3082)
15* Provide type hints for LazyCorpusLoader variables (#3081)
16* Throw warning when LanguageModel is initialized with incorrect vocabulary (#3080)
17* Fix WordNet's all_synsets() function (#3078)
18* Resolve TreebankWordDetokenizer inconsistency with end-of-string contractions (#3070)
19* Support both iso639-3 codes and BCP-47 language tags (#3060)
20* Avoid DeprecationWarning in Regexp tokenizer (#3055)
21* Fix many doctests, add doctests to CI (#3054, #3050, #3048)
22* Fix bool field not being read in VerbNet (#3044)
23* Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers (#3042)
24* Fix encodings of Polish udhr corpus reader (#3038)
25* Allow TweetTokenizer to tokenize emoji flag sequences (#3034)
26* Prevent LazyModule from increasing the size of nltk.__dict__ (#3033)
27* Fix CoreNLPServer non-default port issue (#3031)
28* Add "acion" suffix to the Spanish SnowballStemmer (#3030)
29* Allow loading WordNet without OMW (#3026)
30* Use input() in nltk.chat.chatbot() for Jupyter support (#3022)
31* Fix edit_distance_align() in distance.py (#3017)
32* Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 (#3014)
33* Add the Iota operator to semantic logic (#3010)
34* Resolve critical errors in WordNet app (#3008)
35* Resolve critical error in CHILDES Corpus (#2998)
36* Make WordNet information_content() accept adjective satellites (#2995)
37* Add "strict=True" parameter to CoreNLP (#2993, #3043)
38* Resolve issue with WordNet's synset_from_sense_key (#2988)
39* Handle WordNet synsets that were lost in mapping (#2985)
40* Resolve TypeError in Boxer (#2979)
41* Add function to retrieve WordNet synonyms (#2978)
42* Warn about nonexistent OMW offsets instead of raising an error (#2974)
43* Fix missing ic argument in res, jcn and lin similarity functions of WordNet (#2970)
44* Add support for the extended OMW (#2946)
45* Fix LC cutoff policy of text tiling (#2936)
46* Optimize ConditionalFreqDist.__add__ performance (#2939)
47* Add Markdown corpus reader (#2902)
48
49Thanks to the following contributors to 3.8:
50Alexandre Perez-Lebel, David Lukes, Eric Kafe, Fernando Carranza, Heungson Lee,
51Hoyeol Kim, James Huang, Jelle Zijlstra, Louis-Justin Tallot, M.K. Pawelkiewicz,
52Jan Lennartz, Malinda Dilhara, Martin Kondratzky, Rob Malouf, Saud Kadiri,
53Siddhesh Mhadnak, Stephan Hasler, Steve Smith, Tom Aarsen, Tyler Sheaffer,
54Yue Zhao, cestwc, elespike, purificant, richardyy1188
55
56Version 3.7 2022-02-09
57
58* Improve and update the NLTK team page on nltk.org (#2855, #2941)
59* Drop support for Python 3.6, support Python 3.10 (#2920)
60
61Thanks to the following contributors to 3.7:
62Tom Aarsen
63
64Version 3.6.7 2021-12-28
65
66* Resolve IndexError in `sent_tokenize` and `word_tokenize` (#2922)
67
68Thanks to the following contributors to 3.6.7:
69Tom Aarsen
70
71Version 3.6.6 2021-12-21
72
73* Refactor `gensim.doctest` to work for gensim 4.0.0 and up (#2914)
74* Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
75* Added warnings if .zip files exist without any corresponding .csv files. (#2908)
76* Fix `FileNotFoundError` when the `download_dir` is a non-existing nested folder (#2910)
77* Rename omw to omw-1.4 (#2907)
78* Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
79* Support OMW 1.4 (#2899)
80* Deprecate Tree get and set node methods (#2900)
81* Fix broken inaugural test case (#2903)
82* Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)
83* Keep NLTKs "tokenize" module working with pathlib (#2896)
84* Make prettyprinter to be more readable (#2893)
85* Update links to the nltk book (#2895)
86* Add `CITATION.cff` to nltk (#2880)
87* Resolve serious ReDoS in PunktSentenceTokenizer (#2869)
88* Delete old CI config files (#2881)
89* Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)
90* Fix expected value for BLEU score doctest after changes from #2572
91* Add multi Bleu functionality and tests (#2793)
92* Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)
93* Allow empty string in CFG's + more (#2888)
94* Partition `tree.py` module into `tree` package + pickle fix (#2863)
95* Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
96* Rewind Wordnet data file after each lookup (#2868)
97* Correct __init__ call for SyntaxCorpusReader subclasses (#2872)
98* Documentation fixes (#2873)
99* Fix levenstein distance for duplicated letters (#2849)
100* Support alternative Wordnet versions (#2860)
101* Remove hundreds of formatting warnings for nltk.org (#2859)
102* Modernize `nltk.org/howto` pages (#2856)
103* Fix Bleu Score smoothing function from taking log(0) (#2839)
104* Update third party tools to newer versions and removing MaltParser fixed version (#2832)
105* Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)
106* Replace `http` with `https` in most URLs (#2852)
107
108Thanks to the following contributors to 3.6.6:
109Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis,
110RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly
111
112Version 3.6.5 2021-10-11
113
114* modernised nltk.org website
115* addressed LGTM.com issues
116* support ZWJ sequences emoji and skin tone modifer emoji in TweetTokenizer
117* METEOR evaluation now requires pre-tokenized input
118* Code linting and type hinting
119* implement get_refs function for DrtLambdaExpression
120* Enable automated CoreNLP, Senna, Prover9/Mace4, Megam, MaltParser CI tests
121* specify minimum regex version that supports regex.Pattern
122* avoid re.Pattern and regex.Pattern which fail for Python 3.6, 3.7
123
124Thanks to the following contributors to 3.6.5:
125Tom Aarsen, Saibo Geng, Mohaned Mashaly, Dimitri Papadopoulos, Danny Sepler,
126Ahmet Yildirim, RnDevelover, yutanakamura
127
128Version 3.6.4 2021-10-01
129
130* deprecate `nltk.usage(obj)` in favor of `help(obj)`
131* resolve ReDoS vulnerability in Corpus Reader
132* solidify performance tests
133* improve phone number recognition in tweet tokenizer
134* refactored CISTEM stemmer for German
135* identify NLTK Team as the author
136* replace travis badge with github actions badge
137* add SECURITY.md
138
139Thanks to the following contributors to 3.6.4:
140Tom Aarsen, Mohaned Mashaly, Dimitri Papadopoulos Orfanos, purificant, Danny Sepler
141
142Version 3.6.3 2021-09-19
143* Dropped support for Python 3.5
144* Run CI tests on Windows, too
145* Moved from Travis CI to GitHub Actions
146* Code and comment cleanups
147* Visualize WordNet relation graphs using Graphviz
148* Fixed large error in METEOR score
149* Apply isort, pyupgrade, black, added as pre-commit hooks
150* Prevent debug_decisions in Punkt from throwing IndexError
151* Resolved ZeroDivisionError in RIBES with dissimilar sentences
152* Initialize WordNet IC total counts with smoothing value
153* Fixed AttributeError for Arabic ARLSTem2 stemmer
154* Many fixes and improvements to lm language model package
155* Fix bug in nltk.metrics.aline, C_skip = -10
156* Improvements to TweetTokenizer
157* Optional show arg for FreqDist.plot, ConditionalFreqDist.plot
158* edit_distance now computes Damerau-Levenshtein edit-distance
159
160Thanks to the following contributors to 3.6.3:
161Tom Aarsen, Abhijnan Bajpai, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne,
162Manu Joseph, Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, Mohaned Mashaly,
163purificant, Danny Sepler, Anthony Sottile
164
165Version 3.6.2 2021-04-20
166* move test code to nltk/test
167* clean up some doctests
168* fix bug in NgramAssocMeasures (order preserving fix)
169* fixes for compatibility with Pypy 7.3.4
170
171Thanks to the following contributors to 3.6.2:
172Ruben Cartuyvels, Rob Malouf, Dalton Pearson, Danny Sepler
173
174Version 3.6 2021-04-07
175* add support for Python 3.9
176* add Tree.fromlist
177* compute Minimum Spanning Tree of unweighted graph using BFS
178* fix bug with infinite loop in Wordnet closure and tree
179* fix bug in calculating BLEU using smoothing method 4
180* Wordnet synset similarities work for all pos
181* new Arabic light stemmer (ARLSTem2)
182* new syllable tokenizer (LegalitySyllableTokenizer)
183* remove nose in favor of pytest
184* misc bug fixes, code cleanups, test cleanups, efficiency improvements
185
186Thanks to the following contributors to 3.6:
187Tom Aarsen, K Abainia, Akshita Bhagia, Andrew Bird, Thomas Bird,
188Tom Conroy, Christopher Hench, Andrew Jorgensen, Eric Kafe,
189Ilia Kurenkov, Yeting Li, Joseph Manu, Marius Mather, Denali Molitor,
190Jacob Moorman, Philippe Ombredanne, Vassilis Palassopoulos, Ram Rachum,
191Danny Sepler, Or Sharir, Brad Solomon, Hiroki Teranishi, Constantin Weisser,
192Pratap Yadav, Louis Yang
193
194Version 3.5 2020-04-13
195* add support for Python 3.8
196* drop support for Python 2
197* create NLTK's own Tokenizer class distinct from the Treebank reference tokeniser
198* update Vader sentiment analyser
199* fix JSON serialization of some PoS taggers
200* minor improvements in grammar.CFG, Vader, pl196x corpus reader, StringTokenizer
201* change implementation <= and >= for FreqDist so they are partial orders
202* make FreqDist iterable
203* correctly handle Penn Treebank trees with a unlabeled branching top node.
204
205Thanks to the following contributors to 3.5:
206Nicolas Darr, Gerhard Kremer, Liling Tan, Christopher Hench, Alexandre Dias, Hervé Nicol,
207Pierpaolo Pantone, Bonifacio de Oliveira, Maciej Gawinecki, BLKSerene, hoefling, alvations,
208pyfisch, srhrshr
209
210Version 3.4.5 2019-08-20
211* Fixed security bug in downloader: Zip slip vulnerability - for the unlikely
212situation where a user configures their downloader to use a compromised server
213https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14751)
214
215Thanks to the following contributors to 3.4.5:
216Mike Salvatore
217
218Version 3.4.4 2019-07-04
219* fix bug in plot function (probability.py)
220* add improved PanLex Swadesh corpus reader
221
222Thanks to the following contributors to 3.4.4:
223Devashish Lal, Liling Tan
224
225Version 3.4.3 2019-06-07
226
227* add Text.generate()
228* add QuadgramAssocMeasures
229* add SSP to tokenizers
230* return confidence of best tag from AveragedPerceptron
231* make plot methods return Axes objects
232* don't require list arguments to PositiveNaiveBayesClassifier.train
233* fix Tree classes to work with native Python copy library
234* fix inconsistency for NomBank
235* fix random seeding in LanguageModel.generate
236* fix ConditionalFreqDist mutation on tabulate/plot call
237* fix broken links in documentation
238* fix misc Wordnet issues
239* update installation instructions
240
241Thanks to the following contributors to 3.4.3:
242alvations, Bharat123rox, cifkao, drewmiller, free-variation, henchc
243irisxzhou, nick-ulle, ppartarr, simonepri, yigitsever, zhaoyanpeng
244
245Version 3.4.1 2019-04-17
246
247* add chomsky_normal_form for CFGs
248* add meteor score
249* add minimum edit/Levenshtein distance based alignment function
250* allow access to collocation list via text.collocation_list()
251* support corenlp server options
252* drop support for Python 3.4
253* other minor fixes
254
255Thanks to the following contributors to 3.4.1:
256Adrian Ellis, Andrew Martin, Ayush Kaushal, BLKSerene, Bharat
257Raghunathan, Franklin Chen, KMiNT21 Kevin Brown, Liling Tan,
258Matan Rak, Nat Quayle Nelson, Osman Zubair, Purificant,
259Uday Krishna, Viresh Gupta
260
261Version 3.4 2018-11-17
262* Support Python 3.7
263* Language Modeling incl Kneser-Ney, Witten-Bell, Good-Turing
264* Cistem Stemmer for German
265* Support Russian National Corpus incl POS tag model
266* Decouple sentiment and twitter packages
267* Minor extensions for WordNet
268* K-alpha
269* Fix warning messages for corenlp
270* Comprehensive code cleanups
271* Many other minor fixes
272* Switch continuous integration from Jenkins to Travis
273
274Special thanks to Ilia Kurenkov (Language Model package), Liling Tan (Python 3.7, Travis-CI),
275and purificant (code cleanups). Thanks also to: Afshin Sadeghi, Ales Tamchyna, Alok Debnath,
276aquatiko, Coykto, Denis Kataev, dnc1994, Fabian Howard, Frankie Robertson, Iaroslav Tymchenko,
277Jayakrishna Sahit, LBenzahia, Leonie Weißweiler, Linghao Zhang, Rohit Kumar, sahitpj,
278Tim Gianitsos, vagrant, 53X
279
280Version 3.3 2018-05-06
281* Support Python 3.6
282* New interface to CoreNLP
283* Support synset retrieval by sense key
284* Minor fixes to CoNLL Corpus Reader, AlignedSent
285* Fixed minor inconsistencies in APIs and API documentation
286* Better conformance to PEP8
287* Drop moses.py (incompatible license)
288
289Special thanks to Liling Tan for leading our transition to Python 3.6.
290Thanks to other contributors listed here: https://github.com/nltk/nltk/blob/develop/AUTHORS.md
291
292Version 3.2.5 2017-09-24
293
294* Arabic stemmers (ARLSTem, Snowball)
295* NIST MT evaluation metric and added NIST international_tokenize
296* Moses tokenizer
297* Document Russian tagger
298* Fix to Stanford segmenter
299* Improve treebank detokenizer, VerbNet, Vader
300* Misc code and documentation cleanups
301* Implement fixes suggested by LGTM
302
303Thanks to the following contributors to 3.2.5:
304Ali Abdullah, Lakhdar Benzahia, Henry Elder, Campion Fellin,
305Tsolak Ghukasyan, Thanh Ha, Jean Helie, Nelson Liu,
306Nathan Schneider, Chintan Shah, Fábio Silva, Liling Tan,
307Ziyao Wei, Zicheng Xu, Albert Au Yeung, AbdealiJK,
308porqupine, sbagan, xprogramer
309
310Version 3.2.4 2017-05-21
311
312* remove load-time dependency on Python requests library
313* add support for Arabic in StanfordSegmenter
314* fix MosesDetokenizer on irregular quote tokens
315
316Thanks to the following contributors to 3.2.4:
317Alex Constantin, Hatem Nassrat, Liling Tan
318
319Version 3.2.3 2017-05-16
320
321* new interface to Stanford CoreNLP Web API
322* improved Lancaster stemmer with customizable rules from Whoosh
323* improved Treebank tokenizer
324* improved support for GLEU score
325* adopt new Abstract base class style
326* support custom tab files for extending WordNet
327* make synset_from_pos_and_offset a public method
328* make non-English WordNet lemma lookups case-insensitive
329* speed up TnT tagger
330* speed up FreqDist and ConditionalFreqDist
331* support additional quotes in TreebankWordTokenizer
332* clean up Tk's postscript output
333* drop explicit support for corpora not distributed with NLTK to streamline testing
334* allow iterator in perceptron tagger training
335* allow for curly bracket quantifiers in chunk.regexp.CHUNK_TAG_PATTERN
336* new corpus reader for MWA subset of PPDB
337* improved testing framework
338
339Thanks to the following contributors to 3.2.3:
340Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman,
341Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs,
342Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson,
343Arthur Tilley, jmhutch, Yorwba, eromoe and others
344
345Version 3.2.2 2016-12-31
346* added Kondrak's Aline algorithm
347* added ChrF and GLEU MT evaluation metrics
348* added Russian pos tagger model
349* added Moses detokenizer
350* rewrite Porter Stemmer
351* rewrite FrameNet corpus reader
352(adds frame parameter to fes(), lus(), exemplars()
353see https://www.nltk.org/howto/framenet.html)
354* updated FrameNet Corpus to version 1.7
355* fixes to stanford_segmenter.py, SentiText, CoNLL Corpus Reader
356* fixes to BLEU, naivebayes, Krippendorff's alpha, Punkt
357* fixes to tests for TransitionParser, Senna, edit distance
358* fixes to Moses Tokenizer and Detokenizer
359* improved TweetTokenizer
360* strip trailing whitespace when splitting sentences
361* handle inverted exclamation mark in ToktokTokenizer
362* resolved some issues with Python 3.5 support
363* improvements to testing framework
364* clean up dependencies
365
366Thanks to the following contributors to 3.2.2:
367
368Prasasto Adi, Mark Amery, Geoff Bacon, George Berry, Colin Carroll, Alexis Dimitriadis,
369Nicholas Fabina, German Ferrero, Tsolak Ghukasyan, Hyuckin David Lim, Naoya Kanai,
370Greg Kondrak, Igor Korolev, Tim Leslie, Rob Malouf, Heguang Miao, Dmitrijs Milajevs,
371Adam Nelson, Dennis O'Brien, Qi Liu, Pierpaolo Pantone, Andy Reagan, Mike Recachinas,
372Nathan Schneider, Jānis Šlapiņš, Richard Snape, Liling Tan, Marcus Uneson,
373Linghao Zhang, drevicko, SaintNazaire
374
375Version 3.2.1 2016-04-09
376* Support for CCG semantics, Stanford segmenter, VADER lexicon
377* Fixes to BLEU score calculation, CHILDES corpus reader
378* Other miscellaneous fixes
379
380Thanks to the following contributors to 3.2.1:
381Andrew Giel, Casper Lehmann-Strøm, David Madl, Tanin Na Nakorn,
382Guilherme Nardari, Philippe Ombredanne, Nathan Schneider, Liling Tan,
383Josiah Wang, venticello
384
385Version 3.2 2016-03-03
386* Fixes for Python 3.5
387* Code cleanups now Python 2.6 is no longer supported
388* Improvements to documentation
389* Comprehensive use of os.path for platform-specific path handling
390* Support for PanLex
391* Support for third party download locations for NLTK data
392* Fix bugs in IBM method 3 smoothing and BLEU calculation
393* Support smoothing for BLEU score and corpus-level BLEU
394* Support RIBES score
395* Improvements to TweetTokenizer
396* Updates for Stanford API
397* Add mathematical operators to ConditionalFreqDist
398* Fix bug in sentiwordnet for adjectives
399* Merged internal implementations of Trie
400
401Thanks to the following contributors to 3.2:
402Santiago Castro, Jihun Choi, Graham Christensen, Andrew Drozdov, Long
403Duong, Kyriakos Georgiou, Michael Wayne Goodman, Clark Grubb, Tah Wei
404Hoon, David Kamholz, Ewan Klein, Reed Loden, Rob Malouf, Philippe
405Ombredanne, Josh Owen, Pierpaolo Pantone, Mike Recachinas, Elijah
406Rippeth, Thomas Stieglmaier, Liling Tan, Philip Tzou, Pratap Vardhan.
407
408Version 3.1 2015-10-15
409* Fixes for Python 3.5 (drop support for capturing groups in regexp tokenizer)
410* Drop support for Python 2.6
411* Adopt perceptron tagger for new default POS tagger nltk.pos_tag
412* Stanford Neural Dependency Parser wrapper
413* Sentiment analysis package incl VADER
414* Improvements to twitter package
415* Multi word expression tokenizer
416* Support for everygram and skipgram
417* consistent evaluation metric interfaces, putting reference before hypothesis
418* new nltk.translate module, incorporating the old align module
419* implement stack decoder
420* clean up Alignment interface
421* CorpusReader method to support access to license and citation
422* Multext East Corpus and MTECorpusReader
423* include six module to streamline installation on MS Windows
424
425Thanks to the following contributors to 3.1:
426Le Tuan Anh, Petra Barancikova, Alexander Böhm, Francis Bond,
427Long Duong, Anna Garbar, Matthew Honnibal, Tah Wei Hoon, Ewan Klein,
428Rob Malouf, Dmitrijs Milajevs, Will Monroe, Sergio Oller, Pierpaolo
429Pantone, Jacob Perkins, Lorenzo Rubio, Thomas Stieglmaier, Liling Tan,
430Pratap Vardhan
431
432Version 3.0.5 2015-09-05
433* rewritten IBM models, and new IBM Model 4 and 5 implementations
434* new Twitter package
435* stabilized MaltParser API
436* improved regex tagger
437* improved documentation on contributing
438* minor improvements to documentation and testing
439
440Thanks to the following contributors to 3.0.5:
441Álvaro Justen, Dmitrijs Milajevs, Ewan Klein, Heran Lin, Justin Hammar,
442Liling Tan, Long Duong, Lorenzo Rubio, Pierpaolo Pantone, Tah Wei Hoon
443
444Version 3.0.4 2015-07-13
445* minor bug fixes and enhancements
446
447Thanks to the following contributors to 3.0.4:
448Nicola Bova, Santiago Castro, Len Remmerswaal, Keith Suderman, kabayan55,
449pln-fing-udelar (NLP Group, Instituto de Computación, Facultad de Ingeniería, Universidad de la República, Uruguay).
450
451Version 3.0.3 2015-06-12
452* bug fixes (Stanford NER, Boxer, Snowball, treebank tokenizer,
453dependency graph, KneserNey, BLEU)
454* code clean-ups
455* default POS tagger permits tagset to be specified
456* gensim illustration
457* tgrep implementation
458* added PanLex Swadesh corpora
459* visualisation for aligned bitext
460* support for Google App Engine
461* POSTagger renamed StanfordPOSTagger, NERTagger renamed StanfordNERTagger
462
463Thanks to the following contributors to 3.0.3:
464
465Long Duong, Pedro Fialho, Dan Garrette, Helder, Saimadhav Heblikar,
466Chris Inskip, David Kamholz, Dmitrijs Milajevs, Smitha Milli,
467Tom Mortimer-Jones, Avital Pekker, Jonathan Pool, Sam Raker,
468Will Roberts, Dmitry Sadovnychyi, Nathan Schneider, Anirudh W
469
470Version 3.0.2 2015-03-13
471* make pretty-printing method names consistent
472* improvements to Portuguese stemmer
473* transition-based dependency parsers
474* dependency graph visualisation for ipython notebook
475* interfaces for Senna, BLLIP, python-crfsuite
476* NKJP corpus reader
477* code clean ups, minor bug fixes
478
479Thanks to the following contributors to 3.0.2:
480
481Long Duong, Saimadhav Heblikar, Helder, Mikhail Korobov, Denis Krusko,
482Alex Louden, Felipe Madrigal, David McClosky, Dmitrijs Milajevs,
483Ondrej Platek, Nathan Schneider, Dávid Márk Nemeskey, 0ssifrage, ducki13, kiwipi.
484
485Version 3.0.1 2015-01-12
486* fix setup.py for new version of setuptools
487
488Version 3.0.0 2014-09-07
489* minor bugfixes
490* added phrase extraction code by Liling Tan and Fredrik Hedman
491
492Thanks to the following contributors to 3.0.0:
493Mark Amery, Ivan Barria, Ingolf Becker, Francis Bond, Lars
494Buitinck, Cristian Capdevila, Arthur Darcet, Michelle Fullwood,
495Dan Garrette, Dougal Graham, Dan Garrette, Dougal Graham, Lauri
496Hallila, Tyler Hartley, Fredrik Hedman, Ofer Helman, Bruce Hill,
497Marcus Huderle, Nancy Ide, Nick Johnson, Angelos Katharopoulos,
498Ewan Klein, Mikhail Korobov, Chris Liechti, Peter Ljunglof,
499Joseph Lynch, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Rob
500Malouf, Thorsten Marek, Dmitrijs Milajevs, Shari A’aidil
501Nasruddin, Lance Nathan, Joel Nothman, Alireza Nourian, Alexander
502Oleynikov, Ted Pedersen, Jacob Perkins, Will Roberts, Alex
503Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn Soe,
504Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
505Wang, Chuck Wooters, lade, isnowfy, onesandzeros, pquentin, wvanlint
506
507Version 3.0b2 2014-08-21
508* minor bugfixes and clean-ups
509* renamed remaining parse_ methods to read_ or load_, cf issue #656
510* added Paice's method of evaluating stemming algorithms
511
512Thanks to the following contributors to 3.0.0b2: Lars Buitinck,
513Cristian Capdevila, Lauri Hallila, Ofer Helman, Dmitrijs Milajevs,
514lade, Liling Tan, Steven Xu
515
516Version 3.0.0b1 2014-07-11
517* Added SentiWordNet corpus and corpus reader
518* Fixed support for 10-column dependency file format
519* Changed Tree initialization to use fromstring
520
521Thanks to the following contributors to 3.0b1: Mark Amery, Ivan
522Barria, Ingolf Becker, Francis Bond, Lars Buitinck, Arthur Darcet,
523Michelle Fullwood, Dan Garrette, Dougal Graham, Dan Garrette, Dougal
524Graham, Tyler Hartley, Ofer Helman, Bruce Hill, Marcus Huderle, Nancy
525Ide, Nick Johnson, Angelos Katharopoulos, Ewan Klein, Mikhail Korobov,
526Chris Liechti, Peter Ljunglof, Joseph Lynch, Haejoong Lee, Peter
527Ljunglöf, Dean Malmgren, Rob Malouf, Thorsten Marek, Dmitrijs
528Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman, Alireza
529Nourian, Alexander Oleynikov, Ted Pedersen, Jacob Perkins, Will
530Roberts, Alex Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn
531Soe, Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
532Wang, Chuck Wooters, isnowfy, onesandzeros, pquentin, wvanlint
533
534Version 3.0a4 2014-05-25
535* IBM Models 1-3, BLEU, Gale-Church aligner
536* Lesk algorithm for WSD
537* Open Multilingual WordNet
538* New implementation of Brill Tagger
539* Extend BNCCorpusReader to parse the whole BNC
540* MASC Tagged Corpus and corpus reader
541* Interface to Stanford Parser
542* Code speed-ups and clean-ups
543* API standardisation, including fromstring method for many objects
544* Improved regression testing setup
545* Removed PyYAML dependency
546
547Thanks to the following contributors to 3.0a4:
548Ivan Barria, Ingolf Becker, Francis Bond, Arthur Darcet, Dan Garrette,
549Ofer Helman, Dougal Graham, Nancy Ide, Ewan Klein, Mikhail Korobov,
550Chris Liechti, Peter Ljunglof, Joseph Lynch, Rob Malouf, Thorsten Marek,
551Dmitrijs Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman,
552Jacob Perkins, Lynn Soe, Liling Tan, Louis Tiao, Marcus Uneson, Steven Xu,
553Geraldine Sim Wei Ying
554
555Version 3.0a3 2013-11-02
556* support for FrameNet contributed by Chuck Wooters
557* support for Universal Declaration of Human Rights Corpus (udhr2)
558* major API changes:
559- Tree.node -> Tree.label() / Tree.set_label()
560- Chunk parser: top_node -> root_label; chunk_node -> chunk_label
561- WordNet properties are now access methods, e.g. Synset.definition -> Synset.definition()
562- relextract: show_raw_rtuple() -> rtuple(), show_clause() -> clause()
563* bugfix in texttiling
564* replaced simplify_tags with support for universal tagset (simplify_tags=True -> tagset='universal')
565* Punkt default behavior changed to realign sentence boundaries after trailing parenthesis and quotes
566* deprecated classify.svm (use scikit-learn instead)
567* various efficiency improvements
568
569Thanks to the following contributors to 3.0a3:
570Lars Buitinck, Marcus Huderle, Nick Johnson, Dougal Graham, Ewan Klein,
571Mikhail Korobov, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Lance Nathan,
572Alexander Oleynikov, Nathan Schneider, Chuck Wooters, Yu Usami, Steven Xu,
573pquentin, wvanlint
574
575Version 3.0a2 2013-07-12
576* speed improvements in word_tokenize, GAAClusterer, TnT tagger, Baum Welch, HMM tagger
577* small improvements in collocation finders, probability, modelling, Porter Stemmer
578* bugfix in lowest common hypernyn calculation (used in path similarity measures)
579* code cleanups, docstring cleanups, demo fixes
580
581Thanks to the following contributors to 3.0a2:
582Mark Amery, Lars Buitinck, Michelle Fullwood, Dan Garrette, Dougal Graham,
583Tyler Hartley, Bruce Hill, Angelos Katharopoulos, Mikhail Korobov,
584Rob Malouf, Joel Nothman, Ted Pedersen, Will Roberts, Alex Rudnick,
585Steven Xu, isnowfy, onesandzeros
586
587Version 3.0a1 2013-02-14
588* reinstated tkinter support (Haejoong Lee)
589
590Version 3.0a0 2013-01-14
591* alpha release of first version to support Python 2.6, 2.7, and 3.
592
593Version 2.0.4 2012-11-07
594* minor bugfix (removed numpy dependency)
595
596Version 2.0.3 2012-09-24
597
598* fixed corpus/reader/util.py to support Python 2.5
599* make MaltParser safe to use in parallel
600* fixed bug in inter-annotator agreement
601* updates to various doctests (nltk/test)
602* minor bugfixes
603
604Thanks to the following contributors to 2.0.3:
605Robin Cooper, Pablo Duboue, Christian Federmann, Dan Garrette, Ewan Klein,
606Pierre-François Laquerre, Max Leonov, Peter Ljunglöf, Nitin Madnani, Ceri Stagg
607
608Version 2.0.2 2012-07-05
609
610* improvements to PropBank, NomBank, and SemCor corpus readers
611* interface to full Penn Treebank Corpus V3 (corpus.ptb)
612* made wordnet.lemmas case-insensitive
613* more flexible padding in model.ngram
614* minor bugfixes and documentation enhancements
615* better support for automated testing
616
617Thanks to the following contributors to 2.0.2:
618Daniel Blanchard, Mikhail Korobov, Nitin Madnani, Duncan McGreggor,
619Morten Neergaard, Nathan Schneider, Rico Sennrich.
620
621Version 2.0.1 2012-05-15
622
623* moved NLTK to GitHub: https://github.com/nltk
624* set up integration testing: https://jenkins.shiningpanda.com/nltk/ (Morten Neergaard)
625* converted documentation to Sphinx format: https://www.nltk.org/api/nltk.html
626* dozens of minor enhancements and bugfixes: https://github.com/nltk/nltk/commits/
627* dozens of fixes for conformance with PEP-8
628* dozens of fixes to ensure operation with Python 2.5
629* added interface to Lin's Dependency Thesaurus (Dan Blanchard)
630* added interface to scikit-learn classifiers (Lars Buitinck)
631* added segmentation evaluation measures (David Doukhan)
632
633Thanks to the following contributors to 2.0.1 (since 2.0b9, July 2010):
634Rami Al-Rfou', Yonatan Becker, Steven Bethard, Daniel Blanchard, Lars
635Buitinck, David Coles, Lucas Cooper, David Doukhan, Dan Garrette,
636Masato Hagiwara, Michael Hansen, Michael Heilman, Rebecca Ingram,
637Sudharshan Kaushik, Mikhail Korobov, Peter Ljunglof, Nitin Madnani,
638Rob Malouf, Tomonori Nagano, Morten Neergaard, David Nemeskey,
639Joel Nothman, Jacob Perkins, Alessandro Presta, Alex Rudnick,
640Nathan Schneider, Stefano Lattarini, Peter Stahl, Jason Yoder
641
642Version 2.0.1 (rc1) 2011-04-11
643
644NLTK:
645* added interface to the Stanford POS Tagger
646* updates to sem.Boxer, sem.drt.DRS
647* allow unicode strings in grammars
648* allow non-string features in classifiers
649* modifications to HunposTagger
650* issues with DRS printing
651* fixed bigram collocation finder for window_size > 2
652* doctest paths no longer presume unix-style pathname separators
653* fixed issue with NLTK's tokenize module colliding with the Python tokenize module
654* fixed issue with stemming Unicode strings
655* changed ViterbiParser.nbest_parse to parse
656* ChaSen and KNBC Japanese corpus readers
657* preserve case in concordance display
658* fixed bug in simplification of Brown tags
659* a version of IBM Model 1 as described in Koehn 2010
660* new class AlignedSent for aligned sentence data and evaluation metrics
661* new nltk.util.set_proxy to allow easy configuration of HTTP proxy
662* improvements to downloader user interface to catch URL and HTTP errors
663* added CHILDES corpus reader
664* created special exception hierarchy for Prover9 errors
665* significant changes to the underlying code of the boxer interface
666* path-based wordnet similarity metrics use a fake root node for verbs, following the Perl version
667* added ability to handle multi-sentence discourses in Boxer
668* added the 'english' Snowball stemmer
669* simplifications and corrections of Earley Chart Parser rules
670* several changes to the feature chart parsers for correct unification
671* bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
672* removal of Python >2.4 language features for 2.4 compatibility
673* removal of deprecated functions and associated warnings
674* added semantic domains to wordnet corpus reader
675* changed wordnet similarity functions to include instance hyponyms
676* updated to use latest version of Boxer
677
678Data:
679* JEITA Public Morphologically Tagged Corpus (in ChaSen format)
680* KNB Annotated corpus of Japanese blog posts
681* Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
682* added more comtrans data
683
684Documentation:
685* minor fixes to documentation
686* NLTK Japanese book (chapter 12) by Masato Hagiwara
687
688NLTK-Contrib:
689* Viethen and Dale referring expression algorithms
690
691
692Version 2.0b9 2010-07-25
693
694NLTK:
695* many code and documentation cleanups
696* Added port of Snowball stemmers
697* Fixed loading of pickled tokenizers (issue 556)
698* DecisionTreeClassifier now handles unknown features (issue 570)
699* Added error messages to LogicParser
700* Replaced max_models with end_size to prevent Mace from hanging
701* Added interface to Boxer
702* Added nltk.corpus.semcor to give access to SemCor 3.0 corpus (issue 530)
703* Added support for integer- and float-valued features in maxent classifiers
704* Permit NgramModels to be pickled
705* Added Sourced Strings (see test/sourcedstring.doctest for details)
706* Fixed bugs in with Good-Turing and Simple Good-Turing Estimation (issue 26)
707* Add support for span tokenization, aka standoff annotation of segmentation (incl Punkt)
708* allow unicode nodes in Tree.productions()
709* Fixed WordNet's morphy to be consistent with the original implementation,
710taking the shortest returned form instead of an arbitrary one (issues 427, 487)
711* Fixed bug in MaxentClassifier
712* Accepted bugfixes for YCOE corpus reader (issue 435)
713* Added test to _cumulative_frequencies() to correctly handle the case when no arguments are supplied
714* Added a TaggerI interface to the HunPos open-source tagger
715* Return 0, not None, when no count is present for a lemma in WordNet
716* fixed pretty-printing of unicode leaves
717* More efficient calculation of the leftcorner relation for left corner parsers
718* Added two functions for graph calculations: transitive closure and inversion.
719* FreqDist.pop() and FreqDist.popitems() now invalidate the caches (issue 511)
720
721Data:
722* Added SemCor 3.0 corpus (Brown Corpus tagged with WordNet synsets)
723* Added LanguageID corpus (trigram counts for 451 languages)
724* Added grammar for a^n b^n c^n
725
726NLTK-Contrib:
727* minor updates
728
729Thanks to the following contributors to 2.0b9:
730
731Steven Bethard, Francis Bond, Dmitry Chichkov, Liang Dong, Dan Garrette,
732Simon Greenhill, Bjorn Maeland, Rob Malouf, Joel Nothman, Jacob Perkins,
733Alberto Planas, Alex Rudnick, Geoffrey Sampson, Kevin Scannell, Richard Sproat
734
735
736Version 2.0b8 2010-02-05
737
738NLTK:
739* fixed copyright and license statements
740* removed PyYAML, and added dependency to installers and download instructions
741* updated to LogicParser, DRT (Dan Garrette)
742* WordNet similarity metrics return None instead of -1 when
743they fail to find a path (Steve Bethard)
744* shortest_path_distance uses instance hypernyms (Jordan Boyd-Graber)
745* clean_html improved (Bjorn Maeland)
746* batch_parse, batch_interpret and batch_evaluate functions allow
747grammar or grammar filename as argument
748* more Portuguese examples (portuguese_en.doctest, examples/pt.py)
749
750NLTK-Contrib:
751* Aligner implementations (Christopher Crowner, Torsten Marek)
752* ScriptTranscriber package (Richard Sproat and Kristy Hollingshead)
753
754Book:
755* updates for second printing, correcting errata
756https://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt
757
758Data:
759* added Europarl sample, with 10 docs for each of 11 langs (Nitin Madnani)
760* added SMULTRON sample corpus (Torsten Marek, Martin Volk)
761
762
763Version 2.0b7 2009-11-09
764
765NLTK:
766* minor bugfixes and enhancements: data loader, inference package, FreqDist, Punkt
767* added Portuguese example module, similar to nltk.book for English (examples/pt.py)
768* added all_lemma_names() method to WordNet corpus reader
769* added update() and __add__() extensions to FreqDist (enhances alignment with Python 3.0 counters)
770* reimplemented clean_html
771* added test-suite runner for automatic/manual regression testing
772
773NLTK-Data:
774* updated Punkt models for sentence segmentation
775* added corpus of the works of Machado de Assis (Brazilian Portuguese)
776
777Book:
778* Added translation of preface into Portuguese, contributed by Tiago Tresoldi.
779
780Version 2.0b6 2009-09-20
781
782NLTK:
783* minor fixes for Python 2.4 compatibility
784* added words() method to XML corpus reader
785* minor bugfixes and code clean-ups
786* fixed downloader to put data in %APPDATA% on Windows
787
788Data:
789* Updated Punkt models
790* Fixed utf8 encoding issues with UDHR and Stopwords Corpora
791* Renamed CoNLL "cat" files to "esp" (different language)
792* Added Alvey NLT feature-based grammar
793* Added Polish PL196x corpus
794
795Version 2.0b5 2009-07-19
796
797NLTK:
798* minor bugfixes (incl FreqDist, Python eggs)
799* added reader for Europarl Corpora (contributed by Nitin Madnani)
800* added reader for IPI PAN Polish Corpus (contributed by Konrad Goluchowski)
801* fixed data.py so that it doesn't generate a warning for Windows Python 2.6
802
803NLTK-Contrib:
804* updated Praat reader (contributed by Margaret Mitchell)
805
806Version 2.0b4 2009-07-10
807
808NLTK:
809* switched to Apache License, Version 2.0
810* minor bugfixes in semantics and inference packages
811* support for Python eggs
812* fixed stale regression tests
813
814Data:
815* added NomBank 1.0
816* uppercased feature names in some grammars
817
818Version 2.0b3 2009-06-25
819
820NLTK:
821* several bugfixes
822* added nombank corpus reader (Paul Bedaride)
823
824Version 2.0b2 2009-06-15
825
826NLTK:
827* minor bugfixes and optimizations for parsers, updated some doctests
828* added bottom-up filtered left corner parsers,
829LeftCornerChartParser and IncrementalLeftCornerChartParser.
830* fixed dispersion plot bug which prevented empty plots
831
832Version 2.0b1 2009-06-09
833
834NLTK:
835* major refactor of chart parser code and improved API (Peter Ljungl喃)
836* added new bottom-up left-corner chart parser strategy
837* misc bugfixes (ChunkScore, chart rules, chatbots, jcn-similarity)
838* improved efficiency of "import nltk" using lazy module imports
839* moved CCG package and ISRI Arabic stemmer from NLTK-Contrib into core NLTK
840* misc code cleanups
841
842Contrib:
843* moved out of the main NLTK distribution into a separate distribution
844
845Book:
846* Ongoing polishing ahead of print publication
847
848Version 0.9.9 2009-05-06
849
850NLTK:
851* Finalized API for NLTK 2.0 and the book, incl dozens of small fixes
852* Names of the form nltk.foo.Bar now available as nltk.Bar
853for significant functionality; in some cases the name was modified
854(using old names will produce a deprecation warning)
855* Bugfixes in downloader, WordNet
856* Expanded functionality in DecisionTree
857* Bigram collocations extended for discontiguous bigrams
858* Translation toy nltk.misc.babelfish
859* New module nltk.help giving access to tagset documentation
860* Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
861
862Data:
863* new maxent NE chunker model
864* updated grammar packages for the book
865* data for new tagsets collection, documenting several tagsets
866* added lolcat translation to the Genesis collection
867
868Contrib (work in progress):
869* Updates to coreference package (Joseph Frazee)
870* New ISRI Arabic stemmer (Hosam Algasaier)
871* Updates to Toolbox package (Greg Aumann)
872
873Book:
874* Substantial editorial corrections ahead of final submission
875
876Version 0.9.8 2009-02-18
877
878NLTK:
879* New off-the-shelf tokenizer, POS tagger, and named-entity tagger
880* New metrics package with inter-annotator agreement scores,
881distance metrics, rank correlation
882* New collocations package (Joel Nothman)
883* Many clean-ups to WordNet package (Steven Bethard, Jordan Boyd-Graber)
884* Moved old pywordnet-based WordNet package to nltk_contrib
885* WordNet browser (Paul Bone)
886* New interface to dependency treebank corpora
887* Moved MinimalSet class into nltk.misc package
888* Put NLTK applications in new nltk.app package
889* Many other improvements incl semantics package, toolbox, MaltParser
890* Misc changes to many API names in preparation for 1.0, old names deprecated
891* Most classes now available in the top-level namespace
892* Work on Python egg distribution (Brandon Rhodes)
893* Removed deprecated code remaining from 0.8.* versions
894* Fixes for Python 2.4 compatibility
895
896Data:
897* Corrected identifiers in Dependency Treebank corpus
898* Basque and Catalan Dependency Treebanks (CoNLL 2007)
899* PE08 Parser Evaluation data
900* New models for POS tagger and named-entity tagger
901
902Book:
903* Substantial editorial corrections
904
905Version 0.9.7 2008-12-19
906
907NLTK:
908* fixed problems with accessing zipped corpora
909* improved design and efficiency of grammars and chart parsers
910including new bottom-up combine strategy and a redesigned
911Earley strategy (Peter Ljunglof)
912* fixed bugs in smoothed probability distributions and added
913regression tests (Peter Ljunglof)
914* improvements to Punkt (Joel Nothman)
915* improvements to text classifiers
916* simple word-overlap RTE classifier
917
918Data:
919* A new package of large grammars (Peter Ljunglof)
920* A small gazetteer corpus and corpus reader
921* Organized example grammars into separate packages
922* Childrens' stories added to gutenberg package
923
924Contrib (work in progress):
925* fixes and demonstration for named-entity feature extractors in nltk_contrib.coref
926
927Book:
928* extensive changes throughout, including new chapter 5 on classification
929and substantially revised chapter 11 on managing linguistic data
930
931Version 0.9.6 2008-12-07
932
933NLTK:
934* new WordNet corpus reader (contributed by Steven Bethard)
935* incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
936* moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
937* improved efficiency of unification algorithm
938* various enhancements to the semantics package
939* added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
940* FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value,
941to avoid the confusion caused by FreqDist.sorted()
942* new downloader module to support interactive data download: nltk.download()
943run using "python -m nltk.downloader all"
944* fixed WordNet bug that caused min_depth() to sometimes give incorrect result
945* added nltk.util.Index as a wrapper around defaultdict(list) plus
946a functional-style initializer
947* fixed bug in Earley chart parser that caused it to break
948* added basic TnT tagger nltk.tag.tnt
949* new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
950* misc other bugfixes
951
952Contrib (work in progress):
953* TIGERSearch implementation by Torsten Marek
954* extensions to hole and glue semantics modules by Dan Garrette
955* new coreference package by Joseph Frazee
956* MapReduce interface by Xinfan Meng
957
958Data:
959* Corpora are stored in compressed format if this will not compromise speed of access
960* Swadesh Corpus of comparative wordlists in 23 languages
961* Split grammar collection into separate packages
962* New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
963* Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
964* Fixed bug that forced users to manually unzip the WordNet corpus
965* New dependency-parsed version of Treebank corpus sample
966* Added movie script "Monty Python and the Holy Grail" to webtext corpus
967* Replaced words corpus data with a much larger list of English words
968* New URL for list of available NLTK corpora
969https://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
970
971Book:
972* complete rewrite of first three chapters to make the book accessible
973to a wider audience
974* new chapter on data-intensive language processing
975* extensive reworking of most chapters
976* Dropped subsection numbering; moved exercises to end of chapters
977
978Distributions:
979* created Portfile to support Mac installation
980
981
982Version 0.9.5 2008-08-27
983
984NLTK:
985* text module with support for concordancing, text generation, plotting
986* book module
987* Major reworking of the automated theorem proving modules (Dan Garrette)
988* draw.dispersion now uses pylab
989* draw.concordance GUI tool
990* nltk.data supports for reading corpora and other data files from within zipfiles
991* trees can be constructed from strings with Tree(s) (cf Tree.parse(s))
992
993Contrib (work in progress):
994* many updates to student projects
995- nltk_contrib.agreement (Thomas Lippincott)
996- nltk_contrib.coref (Joseph Frazee)
997- nltk_contrib.depparser (Jason Narad)
998- nltk_contrib.fuf (Petro Verkhogliad)
999- nltk_contrib.hadoop (Xinfan Meng)
1000* clean-ups: deleted stale files; moved some packages to misc
1001
1002Data
1003* Cleaned up Gutenberg text corpora
1004* added Moby Dick; removed redundant copy of Blake songs.
1005* more tagger models
1006* renamed to nltk_data to facilitate installation
1007* stored each corpus as a zip file for quicker installation
1008and access, and to solve a problem with the Propbank
1009corpus including a file with an illegal name for MSWindows
1010(con.xml).
1011
1012Book:
1013* changed filenames to chNN format
1014* reworked opening chapters (work in progress)
1015
1016Distributions:
1017* fixed problem with mac installer that arose when Python binary
1018couldn't be found
1019* removed dependency of NLTK on nltk_data so that NLTK code can be
1020installed before the data
1021
1022Version 0.9.4 2008-08-01
1023
1024NLTK:
1025- Expanded semantics package for first order logic, linear logic,
1026glue semantics, DRT, LFG (Dan Garrette)
1027- new WordSense class in wordnet.synset supporting access to synsets
1028from sense keys and accessing sense counts (Joel Nothman)
1029- interface to Mallet's linear chain CRF implementation (nltk.tag.crf)
1030- misc bugfixes incl Punkt, synsets, maxent
1031- improved support for chunkers incl flexible chunk corpus reader,
1032new rule type: ChunkRuleWithContext
1033- new GUI for pos-tagged concordancing nltk.draw.pos_concordance
1034- new GUI for developing regexp chunkers nltk.draw.rechunkparser
1035- added bio_sents() and bio_words() methods to ConllChunkCorpusReader in conll.py
1036to allow reading (word, tag, chunk_typ) tuples off of CoNLL-2000 corpus. Also
1037modified ConllChunkCorpusView to support these changes.
1038- feature structures support values with custom unification methods
1039- new flag on tagged corpus readers to use simplified tagsets
1040- new package for ngram language modeling with Katz backoff nltk.model
1041- added classes for single-parented and multi-parented trees that
1042automatically maintain parent pointers (nltk.tree.ParentedTree and
1043nltk.tree.MultiParentedTree)
1044- new WordNet browser GUI (Jussi Salmela, Paul Bone)
1045- improved support for lazy sequences
1046- added generate() method to probability distributions
1047- more flexible parser for converting bracketed strings to trees
1048- made fixes to docstrings to improve API documentation
1049
1050Contrib (work in progress)
1051- new NLG package, FUF/SURGE (Petro Verkhogliad)
1052- new dependency parser package (Jason Narad)
1053- new Coreference package, incl support for
1054ACE-2, MUC-6 and MUC-7 corpora (Joseph Frazee)
1055- CCG Parser (Graeme Gange)
1056- first order resolution theorem prover (Dan Garrette)
1057
1058Data:
1059- Nnw NPS Chat Corpus and corpus reader (nltk.corpus.nps_chat)
1060- ConllCorpusReader can now be used to read CoNLL 2004 and 2005 corpora.
1061- Implemented HMM-based Treebank POS tagger and phrase chunker for
1062nltk_contrib.coref in api.py. Pickled versions of these objects are checked
1063in in data/taggers and data/chunkers.
1064
1065Book:
1066- misc corrections in response to feedback from readers
1067
1068Version 0.9.3 2008-06-03
1069
1070NLTK:
1071- modified WordNet similarity code to use pre-built information content files
1072- new classifier-based tagger, BNC corpus reader
1073- improved unicode support for corpus readers
1074- improved interfaces to Weka, Prover9/Mace4
1075- new support for using MEGAM and SciPy to train maxent classifiers
1076- rewrite of Punkt sentence segmenter (Joel Nothman)
1077- bugfixes for WordNet information content module (Jordan Boyd-Graber)
1078- code clean-ups throughout
1079
1080Book:
1081- miscellaneous fixes in response to feedback from readers
1082
1083Contrib:
1084- implementation of incremental algorithm for generating
1085referring expressions (contributed by Margaret Mitchell)
1086- refactoring WordNet browser (Paul Bone)
1087
1088Corpora:
1089- included WordNet information content files
1090
1091Version 0.9.2 2008-03-04
1092
1093NLTK:
1094- new theorem-prover and model-checker module nltk.inference,
1095including interface to Prover9/Mace4 (Dan Garrette, Ewan Klein)
1096- bugfix in Reuters corpus reader that causes Python
1097to complain about too many open files
1098- VerbNet and PropBank corpus readers
1099
1100Data:
1101- VerbNet Corpus version 2.1: hierarchical, verb lexicon linked to WordNet
1102- PropBank Corpus: predicate-argument structures, as stand-off annotation of Penn Treebank
1103
1104Contrib:
1105- New work on WordNet browser, incorporating a client-server model (Jussi Salmela)
1106
1107Distributions:
1108- Mac OS 10.5 distribution
1109
1110Version 0.9.1 2008-01-24
1111
1112NLTK:
1113- new interface for text categorization corpora
1114- new corpus readers: RTE, Movie Reviews, Question Classification, Brown Corpus
1115- bugfix in ConcatenatedCorpusView that caused iteration to fail if it didn't start from the beginning of the corpus
1116
1117Data:
1118- Question classification data, included with permission of Li & Roth
1119- Reuters 21578 Corpus, ApteMod version, from CPAN
1120- Movie Reviews corpus (sentiment polarity), included with permission of Lillian Lee
1121- Corpus for Recognising Textual Entailment (RTE) Challenges 1, 2 and 3
1122- Brown Corpus (reverted to original file structure: ca01-cr09)
1123- Penn Treebank corpus sample (simplified implementation, new readers treebank_raw and treebank_chunk)
1124- Minor redesign of corpus readers, to use filenames instead of "items" to identify parts of a corpus
1125
1126Contrib:
1127- theorem_prover: Prover9, tableau, MaltParser, Mace4, glue semantics, docs (Dan Garrette, Ewan Klein)
1128- drt: improved drawing, conversion to FOL (Dan Garrette)
1129- gluesemantics: GUI demonstration, abstracted LFG code, documentation (Dan Garrette)
1130- readability: various text readability scores (Thomas Jakobsen, Thomas Skardal)
1131- toolbox: code to normalize toolbox databases (Greg Aumann)
1132
1133Book:
1134- many improvements in early chapters in response to reader feedback
1135- updates for revised corpus readers
1136- moved unicode section to chapter 3
1137- work on engineering.txt (not included in 0.9.1)
1138
1139Distributions:
1140- Fixed installation for Mac OS 10.5 (Joshua Ritterman)
1141- Generalize doctest_driver to work with doc_contrib
1142
1143Version 0.9 2007-10-12
1144
1145NLTK:
1146- New naming of packages and modules, and more functions imported into
1147top-level nltk namespace, e.g. nltk.chunk.Regexp -> nltk.RegexpParser,
1148nltk.tokenize.Line -> nltk.LineTokenizer, nltk.stem.Porter -> nltk.PorterStemmer,
1149nltk.parse.ShiftReduce -> nltk.ShiftReduceParser
1150- processing class names changed from verbs to nouns, e.g.
1151StemI -> StemmerI, ParseI -> ParserI, ChunkParseI -> ChunkParserI, ClassifyI -> ClassifierI
1152- all tokenizers are now available as subclasses of TokenizeI,
1153selected tokenizers are also available as functions, e.g. wordpunct_tokenize()
1154- rewritten ngram tagger code, collapsed lookup tagger with unigram tagger
1155- improved tagger API, permitting training in the initializer
1156- new system for deprecating code so that users are notified of name changes.
1157- support for reading feature cfgs to parallel reading cfgs (parse_featcfg())
1158- text classifier package, maxent (GIS, IIS), naive Bayes, decision trees, weka support
1159- more consistent tree printing
1160- wordnet's morphy stemmer now accessible via stemmer package
1161- RSLP Portuguese stemmer (originally developed by Viviane Moreira Orengo, reimplemented by Tiago Tresoldi)
1162- promoted ieer_rels.py to the sem package
1163- improvements to WordNet package (Jussi Salmela)
1164- more regression tests, and support for checking coverage of tests
1165- miscellaneous bugfixes
1166- remove numpy dependency
1167
1168Data:
1169- new corpus reader implementation, refactored syntax corpus readers
1170- new data package: corpora, grammars, tokenizers, stemmers, samples
1171- CESS-ESP Spanish Treebank and corpus reader
1172- CESS-CAT Catalan Treebank and corpus reader
1173- Alpino Dutch Treebank and corpus reader
1174- MacMorpho POS-tagged Brazilian Portuguese news text and corpus reader
1175- trained model for Portuguese sentence segmenter
1176- Floresta Portuguese Treebank version 7.4 and corpus reader
1177- TIMIT player audio support
1178
1179Contrib:
1180- BioReader (contributed by Carlos Rodriguez)
1181- TnT tagger (contributed by Sam Huston)
1182- wordnet browser (contributed by Jussi Salmela, requires wxpython)
1183- lpath interpreter (contributed by Haejoong Lee)
1184- timex -- regular expression-based temporal expression tagger
1185
1186Book:
1187- polishing of early chapters
1188- introductions to parts 1, 2, 3
1189- improvements in book processing software (xrefs, avm & gloss formatting, javascript clipboard)
1190- updates to book organization, chapter contents
1191- corrections throughout suggested by readers (acknowledged in preface)
1192- more consistent use of US spelling throughout
1193- all examples redone to work with single import statement: "import nltk"
1194- reordered chapters: 5->7->8->9->11->12->5
1195* language engineering in part 1 to broaden the appeal
1196of the earlier part of the book and to talk more about
1197evaluation and baselines at an earlier stage
1198* concentrate the partial and full parsing material in part 2,
1199and remove the specialized feature-grammar material into part 3
1200
1201Distributions:
1202- streamlined mac installation (Joshua Ritterman)
1203- included mac distribution with ISO image
1204
1205Version 0.8 2007-07-01
1206
1207Code:
1208- changed nltk.__init__ imports to explicitly import names from top-level modules
1209- changed corpus.util to use the 'rb' flag for opening files, to fix problems
1210reading corpora under MSWindows
1211- updated stale examples in engineering.txt
1212- extended feature structure interface to permit chained features, e.g. fs['F','G']
1213- further misc improvements to test code plus some bugfixes
1214Tutorials:
1215- rewritten opening section of tagging chapter
1216- reorganized some exercises
1217
1218Version 0.8b2 2007-06-26
1219
1220Code (major):
1221- new corpus package, obsoleting old corpora package
1222- supports caching, slicing, corpus search path
1223- more flexible API
1224- global updates so all NLTK modules use new corpus package
1225- moved nltk/contrib to separate top-level package nltk_contrib
1226- changed wordpunct tokenizer to use \w instead of a-zA-Z0-9
1227as this will be more robust for languages other than English,
1228with implications for many corpus readers that use it
1229- known bug: certain re-entrant structures in featstruct
1230- known bug: when the LHS of an edge contains an ApplicationExpression,
1231variable values in the RHS bindings aren't copied over when the
1232fundamental rule applies
1233- known bug: HMM tagger is broken
1234Tutorials:
1235- global updates to NLTK and docs
1236- ongoing polishing
1237Corpora:
1238- treebank sample reverted to published multi-file structure
1239Contrib:
1240- DRT and Glue Semantics code (nltk_contrib.drt, nltk_contrib.gluesemantics, by Dan Garrette)
1241
1242Version 0.8b1 2007-06-18
1243
1244Code (major):
1245- changed package name to nltk
1246- import all top-level modules into nltk, reducing need for import statements
1247- reorganization of sub-package structures to simplify imports
1248- new featstruct module, unifying old featurelite and featurestructure modules
1249- FreqDist now inherits from dict, fd.count(sample) becomes fd[sample]
1250- FreqDist initializer permits: fd = FreqDist(len(token) for token in text)
1251- made numpy optional
1252Code (minor):
1253- changed GrammarFile initializer to accept filename
1254- consistent tree display format
1255- fixed loading process for WordNet and TIMIT that prevented code installation if data not installed
1256- taken more care with unicode types
1257- incorporated pcfg code into cfg module
1258- moved cfg, tree, featstruct to top level
1259- new filebroker module to make handling of example grammar files more transparent
1260- more corpus readers (webtext, abc)
1261- added cfg.covers() to check that a grammar covers a sentence
1262- simple text-based wordnet browser
1263- known bug: parse/featurechart.py uses incorrect apply() function
1264Corpora:
1265- csv data file to document NLTK corpora
1266Contrib:
1267- added Glue semantics code (contrib.glue, by Dan Garrette)
1268- Punkt sentence segmenter port (contrib.punkt, by Willy)
1269- added LPath interpreter (contrib.lpath, by Haejoong Lee)
1270- extensive work on classifiers (contrib.classifier*, Sumukh Ghodke)
1271Tutorials:
1272- polishing on parts I, II
1273- more illustrations, data plots, summaries, exercises
1274- continuing to make prose more accessible to non-linguistic audience
1275- new default import that all chapters presume: from nltk.book import *
1276Distributions:
1277- updated to latest version of numpy
1278- removed WordNet installation instructions as WordNet is now included in corpus distribution
1279- added pylab (matplotlib)
1280
1281Version 0.7.5 2007-05-16
1282
1283Code:
1284- improved WordNet and WordNet-Similarity interface
1285- the Lancaster Stemmer (contributed by Steven Tomcavage)
1286Corpora:
1287- Web text samples
1288- BioCreAtIvE-PPI - a corpus for protein-protein interactions
1289- Switchboard Telephone Speech Corpus Sample (via Talkbank)
1290- CMU Problem Reports Corpus sample
1291- CONLL2002 POS+NER data
1292- Patient Information Leaflet corpus
1293- WordNet 3.0 data files
1294- English wordlists: basic English, frequent words
1295Tutorials:
1296- more improvements to text and images
1297
1298Version 0.7.4 2007-05-01
1299
1300Code:
1301- Indian POS tagged corpus reader: corpora.indian
1302- Sinica Treebank corpus reader: corpora.sinica_treebank
1303- new web corpus reader corpora.web
1304- tag package now supports pickling
1305- added function to utilities.py to guess character encoding
1306Corpora:
1307- Rotokas texts from Stuart Robinson
1308- POS-tagged corpora for several Indian languages (Bangla, Hindi, Marathi, Telugu) from A Kumaran
1309Tutorials:
1310- Substantial work on Part II of book on structured programming, parsing and grammar
1311- More bibliographic citations
1312- Improvements in typesetting, cross references
1313- Redimensioned images and tables for better use of page space
1314- Moved project list to wiki
1315Contrib:
1316- validation of toolbox entries using chunking
1317- improved classifiers
1318Distribution:
1319- updated for Python 2.5.1, Numpy 1.0.2
1320
1321Version 0.7.3 2007-04-02
1322
1323* Code:
1324- made chunk.Regexp.parse() more flexible about its input
1325- developed new syntax for PCFG grammars, e.g. A -> B C [0.3] | D [0.7]
1326- fixed CFG parser to support grammars with slash categories
1327- moved beta classify package from main NLTK to contrib
1328- Brill taggers loaded correctly
1329- misc bugfixes
1330* Corpora:
1331- Shakespeare XML corpus sample and corpus reader
1332* Tutorials:
1333- improvements to prose, exercises, plots, images
1334- expanded and reorganized tutorial on structured programming
1335- formatting improvements for Python listings
1336- improved plots (using pylab)
1337- categorization of problems by difficulty
1338Contrib:
1339- more work on kimmo lexicon and grammar
1340- more work on classifiers
1341
1342Version 0.7.2 2007-03-01
1343
1344* Code:
1345- simple feature detectors (detect module)
1346- fixed problem when token generators are passed to a parser (parse package)
1347- fixed bug in Grammar.productions() (identified by Lucas Champollion and Mitch Marcus)
1348- fixed import bug in category.GrammarFile.earley_parser
1349- added utilities.OrderedDict
1350- initial port of old NLTK classifier package (by Sam Huston)
1351- UDHR corpus reader
1352* Corpora:
1353- added UDHR corpus (Universal Declaration of Human Rights)
1354with 10k text samples in 300+ languages
1355* Tutorials:
1356- improved images
1357- improved book formatting, including new support for:
1358- javascript to copy program examples to clipboard in HTML version,
1359- bibliography, chapter cross-references, colorization, index, table-of-contents
1360
1361* Contrib:
1362- new Kimmo system: contrib.mit.six863.kimmo (Rob Speer)
1363- fixes for: contrib.fsa (Rob Speer)
1364- demonstration of text classifiers trained on UDHR corpus for
1365language identification: contrib.langid (Sam Huston)
1366- new Lambek calculus system: contrib.lambek
1367- new tree implementation based on elementtree: contrib.tree
1368
1369Version 0.7.1 2007-01-14
1370
1371* Code:
1372- bugfixes (HMM, WordNet)
1373
1374Version 0.7 2006-12-22
1375
1376* Code:
1377- bugfixes, including fixed bug in Brown corpus reader
1378- cleaned up wordnet 2.1 interface code and similarity measures
1379- support for full Penn treebank format contributed by Yoav Goldberg
1380* Tutorials:
1381- expanded tutorials on advanced parsing and structured programming
1382- checked all doctest code
1383- improved images for chart parsing
1384
1385Version 0.7b1 2006-12-06
1386
1387* Code:
1388- expanded semantic interpretation package
1389- new high-level chunking interface, with cascaded chunking
1390- split chunking code into new chunk package
1391- updated wordnet package to support version 2.1 of Wordnet.
1392- prototyped basic wordnet similarity measures
1393(path distance, Wu + Palmer and Leacock + Chodorow, Resnik similarity measures.)
1394- bugfixes (tag.Window, tag.ngram)
1395- more doctests
1396* Contrib:
1397- toolbox language settings module
1398* Tutorials:
1399- rewrite of chunking chapter, switched from Treebank to CoNLL format as main focus,
1400simplified evaluation framework, added ngram chunking section
1401- substantial updates throughout (esp programming and semantics chapters)
1402* Corpora:
1403- Chat-80 Prolog data files provided as corpora, plus corpus reader
1404
1405Version 0.7a2 2006-11-13
1406
1407* Code:
1408- more doctest
1409- code to read Chat-80 data
1410- HMM bugfix
1411* Tutorials:
1412- continued updates and polishing
1413* Corpora:
1414- toolbox MDF sample data
1415
1416Version 0.7a1 2006-10-29
1417
1418* Code:
1419- new toolbox module (Greg Aumann)
1420- new semantics package (Ewan Klein)
1421- bugfixes
1422* Tutorials
1423- substantial revision, especially in preface, introduction, words,
1424and semantics chapters.
1425
1426Version 0.6.6 2006-10-06
1427
1428* Code:
1429- bugfixes (probability, shoebox, draw)
1430* Contrib:
1431- new work on shoebox package (Stuart Robinson)
1432* Tutorials:
1433- continual expansion and revision, especially on introduction to
1434programming, advanced programming and the feature-based grammar chapters.
1435
1436Version 0.6.5 2006-07-09
1437
1438* Code:
1439- improvements to shoebox module (Stuart Robinson, Greg Aumann)
1440- incorporated feature-based parsing into core NLTK-Lite
1441- corpus reader for Sinica treebank sample
1442- new stemmer package
1443* Contrib:
1444- hole semantics implementation (Peter Wang)
1445- Incorporating yaml
1446- new work on feature structures, unification, lambda calculus
1447- new work on shoebox package (Stuart Robinson, Greg Aumann)
1448* Corpora:
1449- Sinica treebank sample
1450* Tutorials:
1451- expanded discussion throughout, incl: left-recursion, trees, grammars,
1452feature-based grammar, agreement, unification, PCFGs,
1453baseline performance, exercises, improved display of trees
1454
1455Version 0.6.4 2006-04-20
1456
1457* Code:
1458- corpus readers for Senseval 2 and TIMIT
1459- clusterer (ported from old NLTK)
1460- support for cascaded chunkers
1461- bugfix suggested by Brent Payne
1462- new SortedDict class for regression testing
1463* Contrib:
1464- CombinedTagger tagger and marshalling taggers, contributed by Tiago Tresoldi
1465* Corpora:
1466- new: Senseval 2, TIMIT sample
1467* Tutorials:
1468- major revisions to programming, words, tagging, chunking, and parsing tutorials
1469- many new exercises
1470- formatting improvements, including colorized program examples
1471- fixed problem with testing on training data, reported by Jason Baldridge
1472
1473Version 0.6.3 2006-03-09
1474
1475* switch to new style classes
1476* repair FSA model sufficiently for Kimmo module to work
1477* port of MIT Kimmo morphological analyzer; still needs lots of code clean-up and inline docs
1478* expanded support for shoebox format, developed with Stuart Robinson
1479* fixed bug in indexing CFG productions, for empty right-hand-sides
1480* efficiency improvements, suggested by Martin Ranang
1481* replaced classeq with isinstance, for efficiency improvement, as suggested by Martin Ranang
1482* bugfixes in chunk eval
1483* simplified call to draw_trees
1484* names, stopwords corpora
1485
1486Version 0.6.2 2006-01-29
1487
1488* Peter Spiller's concordancer
1489* Will Hardy's implementation of Penton's paradigm visualization system
1490* corpus readers for presidential speeches
1491* removed NLTK dependency
1492* generalized CFG terminals to permit full range of characters
1493* used fully qualified names in demo code, for portability
1494* bugfixes from Yoav Goldberg, Eduardo Pereira Habkost
1495* fixed obscure quoting bug in tree displays and conversions
1496* simplified demo code, fixed import bug
1497