nltk

ChangeLog
1495 строк · 59.8 Кб
Перенос по словам
1

2
Version 3.8.1 2023-01-02
3

4
* Resolve RCE vulnerability in localhost WordNet Browser (#3100)
5
* Remove unused tool scripts (#3099)
6
* Resolve XSS vulnerability in localhost WordNet Browser (#3096)
7
* Add Python 3.11 support (#3090)
8

9
Thanks to the following contributors to 3.8.1:
10
Francis Bond, John Vandenberg, Tom Aarsen
11

12
Version 3.8 2022-12-12
13

14
* Refactor dispersion plot (#3082)
15
* Provide type hints for LazyCorpusLoader variables (#3081)
16
* Throw warning when LanguageModel is initialized with incorrect vocabulary (#3080)
17
* Fix WordNet's all_synsets() function (#3078)
18
* Resolve TreebankWordDetokenizer inconsistency with end-of-string contractions (#3070)
19
* Support both iso639-3 codes and BCP-47 language tags (#3060)
20
* Avoid DeprecationWarning in Regexp tokenizer (#3055)
21
* Fix many doctests, add doctests to CI (#3054, #3050, #3048)
22
* Fix bool field not being read in VerbNet (#3044)
23
* Greatly improve time efficiency of SyllableTokenizer when tokenizing numbers (#3042)
24
* Fix encodings of Polish udhr corpus reader (#3038)
25
* Allow TweetTokenizer to tokenize emoji flag sequences (#3034)
26
* Prevent LazyModule from increasing the size of nltk.__dict__ (#3033)
27
* Fix CoreNLPServer non-default port issue (#3031)
28
* Add "acion" suffix to the Spanish SnowballStemmer (#3030)
29
* Allow loading WordNet without OMW (#3026)
30
* Use input() in nltk.chat.chatbot() for Jupyter support (#3022)
31
* Fix edit_distance_align() in distance.py (#3017)
32
* Tackle performance and accuracy regression of sentence tokenizer since NLTK 3.6.6 (#3014)
33
* Add the Iota operator to semantic logic (#3010)
34
* Resolve critical errors in WordNet app (#3008)
35
* Resolve critical error in CHILDES Corpus (#2998)
36
* Make WordNet information_content() accept adjective satellites (#2995)
37
* Add "strict=True" parameter to CoreNLP (#2993, #3043)
38
* Resolve issue with WordNet's synset_from_sense_key (#2988)
39
* Handle WordNet synsets that were lost in mapping (#2985)
40
* Resolve TypeError in Boxer (#2979)
41
* Add function to retrieve WordNet synonyms (#2978)
42
* Warn about nonexistent OMW offsets instead of raising an error (#2974)
43
* Fix missing ic argument in res, jcn and lin similarity functions of WordNet (#2970)
44
* Add support for the extended OMW (#2946)
45
* Fix LC cutoff policy of text tiling (#2936)
46
* Optimize ConditionalFreqDist.__add__ performance (#2939)
47
* Add Markdown corpus reader (#2902)
48

49
Thanks to the following contributors to 3.8:
50
Alexandre Perez-Lebel, David Lukes, Eric Kafe, Fernando Carranza, Heungson Lee,
51
Hoyeol Kim, James Huang, Jelle Zijlstra, Louis-Justin Tallot, M.K. Pawelkiewicz,
52
Jan Lennartz, Malinda Dilhara, Martin Kondratzky, Rob Malouf, Saud Kadiri,
53
Siddhesh Mhadnak, Stephan Hasler, Steve Smith, Tom Aarsen, Tyler Sheaffer,
54
Yue Zhao, cestwc, elespike, purificant, richardyy1188
55

56
Version 3.7 2022-02-09
57

58
* Improve and update the NLTK team page on nltk.org (#2855, #2941)
59
* Drop support for Python 3.6, support Python 3.10 (#2920)
60

61
Thanks to the following contributors to 3.7:
62
Tom Aarsen
63

64
Version 3.6.7 2021-12-28
65

66
* Resolve IndexError in `sent_tokenize` and `word_tokenize` (#2922)
67

68
Thanks to the following contributors to 3.6.7:
69
Tom Aarsen
70

71
Version 3.6.6 2021-12-21
72

73
* Refactor `gensim.doctest` to work for gensim 4.0.0 and up (#2914)
74
* Add Precision, Recall, F-measure, Confusion Matrix to Taggers (#2862)
75
* Added warnings if .zip files exist without any corresponding .csv files. (#2908)
76
* Fix `FileNotFoundError` when the `download_dir` is a non-existing nested folder (#2910)
77
* Rename omw to omw-1.4 (#2907)
78
* Resolve ReDoS opportunity by fixing incorrectly specified regex (#2906)
79
* Support OMW 1.4 (#2899)
80
* Deprecate Tree get and set node methods (#2900)
81
* Fix broken inaugural test case (#2903)
82
* Use Multilingual Wordnet Data from OMW with newer Wordnet versions (#2889)
83
* Keep NLTKs "tokenize" module working with pathlib (#2896)
84
* Make prettyprinter to be more readable (#2893)
85
* Update links to the nltk book (#2895)
86
* Add `CITATION.cff` to nltk (#2880)
87
* Resolve serious ReDoS in PunktSentenceTokenizer (#2869)
88
* Delete old CI config files (#2881)
89
* Improve Tokenize documentation + add TokenizerI as superclass for TweetTokenizer (#2878)
90
* Fix expected value for BLEU score doctest after changes from #2572
91
* Add multi Bleu functionality and tests (#2793)
92
* Deprecate 'return_str' parameter in NLTKWordTokenizer and TreebankWordTokenizer (#2883)
93
* Allow empty string in CFG's + more (#2888)
94
* Partition `tree.py` module into `tree` package + pickle fix (#2863)
95
* Fix several TreebankWordTokenizer and NLTKWordTokenizer bugs (#2877)
96
* Rewind Wordnet data file after each lookup (#2868)
97
* Correct __init__ call for SyntaxCorpusReader subclasses (#2872)
98
* Documentation fixes (#2873)
99
* Fix levenstein distance for duplicated letters (#2849)
100
* Support alternative Wordnet versions (#2860)
101
* Remove hundreds of formatting warnings for nltk.org (#2859)
102
* Modernize `nltk.org/howto` pages (#2856)
103
* Fix Bleu Score smoothing function from taking log(0) (#2839)
104
* Update third party tools to newer versions and removing MaltParser fixed version (#2832)
105
* Fix TypeError: _pretty() takes 1 positional argument but 2 were given in sem/drt.py (#2854)
106
* Replace `http` with `https` in most URLs (#2852)
107

108
Thanks to the following contributors to 3.6.6:
109
Adam Hawley, BatMrE, Danny Sepler, Eric Kafe, Gavish Poddar, Panagiotis Simakis,
110
RnDevelover, Robby Horvath, Tom Aarsen, Yuta Nakamura, Mohaned Mashaly
111

112
Version 3.6.5 2021-10-11
113

114
* modernised nltk.org website
115
* addressed LGTM.com issues
116
* support ZWJ sequences emoji and skin tone modifer emoji in TweetTokenizer
117
* METEOR evaluation now requires pre-tokenized input
118
* Code linting and type hinting
119
* implement get_refs function for DrtLambdaExpression
120
* Enable automated CoreNLP, Senna, Prover9/Mace4, Megam, MaltParser CI tests
121
* specify minimum regex version that supports regex.Pattern
122
* avoid re.Pattern and regex.Pattern which fail for Python 3.6, 3.7
123

124
Thanks to the following contributors to 3.6.5:
125
Tom Aarsen, Saibo Geng, Mohaned Mashaly, Dimitri Papadopoulos, Danny Sepler,
126
Ahmet Yildirim, RnDevelover, yutanakamura
127

128
Version 3.6.4 2021-10-01
129

130
* deprecate `nltk.usage(obj)` in favor of `help(obj)`
131
* resolve ReDoS vulnerability in Corpus Reader
132
* solidify performance tests
133
* improve phone number recognition in tweet tokenizer
134
* refactored CISTEM stemmer for German
135
* identify NLTK Team as the author
136
* replace travis badge with github actions badge
137
* add SECURITY.md
138

139
Thanks to the following contributors to 3.6.4:
140
Tom Aarsen, Mohaned Mashaly, Dimitri Papadopoulos Orfanos, purificant, Danny Sepler
141

142
Version 3.6.3 2021-09-19
143
* Dropped support for Python 3.5
144
* Run CI tests on Windows, too
145
* Moved from Travis CI to GitHub Actions
146
* Code and comment cleanups
147
* Visualize WordNet relation graphs using Graphviz
148
* Fixed large error in METEOR score
149
* Apply isort, pyupgrade, black, added as pre-commit hooks
150
* Prevent debug_decisions in Punkt from throwing IndexError
151
* Resolved ZeroDivisionError in RIBES with dissimilar sentences
152
* Initialize WordNet IC total counts with smoothing value
153
* Fixed AttributeError for Arabic ARLSTem2 stemmer
154
* Many fixes and improvements to lm language model package
155
* Fix bug in nltk.metrics.aline, C_skip = -10
156
* Improvements to TweetTokenizer
157
* Optional show arg for FreqDist.plot, ConditionalFreqDist.plot
158
* edit_distance now computes Damerau-Levenshtein edit-distance
159

160
Thanks to the following contributors to 3.6.3:
161
Tom Aarsen, Abhijnan Bajpai, Michael Wayne Goodman, Michał Górny, Maarten ter Huurne,
162
Manu Joseph, Eric Kafe, Ilia Kurenkov, Daniel Loney, Rob Malouf, Mohaned Mashaly,
163
purificant, Danny Sepler, Anthony Sottile
164

165
Version 3.6.2 2021-04-20
166
* move test code to nltk/test
167
* clean up some doctests
168
* fix bug in NgramAssocMeasures (order preserving fix)
169
* fixes for compatibility with Pypy 7.3.4
170

171
Thanks to the following contributors to 3.6.2:
172
Ruben Cartuyvels, Rob Malouf, Dalton Pearson, Danny Sepler
173

174
Version 3.6 2021-04-07
175
* add support for Python 3.9
176
* add Tree.fromlist
177
* compute Minimum Spanning Tree of unweighted graph using BFS
178
* fix bug with infinite loop in Wordnet closure and tree
179
* fix bug in calculating BLEU using smoothing method 4
180
* Wordnet synset similarities work for all pos
181
* new Arabic light stemmer (ARLSTem2)
182
* new syllable tokenizer (LegalitySyllableTokenizer)
183
* remove nose in favor of pytest
184
* misc bug fixes, code cleanups, test cleanups, efficiency improvements
185

186
Thanks to the following contributors to 3.6:
187
Tom Aarsen, K Abainia, Akshita Bhagia, Andrew Bird, Thomas Bird,
188
Tom Conroy, Christopher Hench, Andrew Jorgensen, Eric Kafe,
189
Ilia Kurenkov, Yeting Li, Joseph Manu, Marius Mather, Denali Molitor,
190
Jacob Moorman, Philippe Ombredanne, Vassilis Palassopoulos, Ram Rachum,
191
Danny Sepler, Or Sharir, Brad Solomon, Hiroki Teranishi, Constantin Weisser,
192
Pratap Yadav, Louis Yang
193

194
Version 3.5 2020-04-13
195
* add support for Python 3.8
196
* drop support for Python 2
197
* create NLTK's own Tokenizer class distinct from the Treebank reference tokeniser
198
* update Vader sentiment analyser
199
* fix JSON serialization of some PoS taggers
200
* minor improvements in grammar.CFG, Vader, pl196x corpus reader, StringTokenizer
201
* change implementation <= and >= for FreqDist so they are partial orders
202
* make FreqDist iterable
203
* correctly handle Penn Treebank trees with a unlabeled branching top node.
204

205
Thanks to the following contributors to 3.5:
206
Nicolas Darr, Gerhard Kremer, Liling Tan, Christopher Hench, Alexandre Dias, Hervé Nicol,
207
Pierpaolo Pantone, Bonifacio de Oliveira, Maciej Gawinecki, BLKSerene, hoefling, alvations,
208
pyfisch, srhrshr
209

210
Version 3.4.5 2019-08-20
211
* Fixed security bug in downloader: Zip slip vulnerability - for the unlikely
212
  situation where a user configures their downloader to use a compromised server
213
  https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14751)
214

215
Thanks to the following contributors to 3.4.5:
216
Mike Salvatore
217

218
Version 3.4.4 2019-07-04
219
* fix bug in plot function (probability.py)
220
* add improved PanLex Swadesh corpus reader
221

222
Thanks to the following contributors to 3.4.4:
223
Devashish Lal, Liling Tan
224

225
Version 3.4.3 2019-06-07
226

227
* add Text.generate()
228
* add QuadgramAssocMeasures
229
* add SSP to tokenizers
230
* return confidence of best tag from AveragedPerceptron
231
* make plot methods return Axes objects
232
* don't require list arguments to PositiveNaiveBayesClassifier.train
233
* fix Tree classes to work with native Python copy library
234
* fix inconsistency for NomBank
235
* fix random seeding in LanguageModel.generate
236
* fix ConditionalFreqDist mutation on tabulate/plot call
237
* fix broken links in documentation
238
* fix misc Wordnet issues
239
* update installation instructions
240

241
Thanks to the following contributors to 3.4.3:
242
alvations, Bharat123rox, cifkao, drewmiller, free-variation, henchc
243
irisxzhou, nick-ulle, ppartarr, simonepri, yigitsever, zhaoyanpeng
244

245
Version 3.4.1 2019-04-17
246

247
* add chomsky_normal_form for CFGs
248
* add meteor score
249
* add minimum edit/Levenshtein distance based alignment function
250
* allow access to collocation list via text.collocation_list()
251
* support corenlp server options
252
* drop support for Python 3.4
253
* other minor fixes
254

255
Thanks to the following contributors to 3.4.1:
256
Adrian Ellis, Andrew Martin, Ayush Kaushal, BLKSerene, Bharat
257
Raghunathan, Franklin Chen, KMiNT21 Kevin Brown, Liling Tan,
258
Matan Rak, Nat Quayle Nelson, Osman Zubair, Purificant,
259
Uday Krishna, Viresh Gupta
260

261
Version 3.4 2018-11-17
262
* Support Python 3.7
263
* Language Modeling incl Kneser-Ney, Witten-Bell, Good-Turing
264
* Cistem Stemmer for German
265
* Support Russian National Corpus incl POS tag model
266
* Decouple sentiment and twitter packages
267
* Minor extensions for WordNet
268
* K-alpha
269
* Fix warning messages for corenlp
270
* Comprehensive code cleanups
271
* Many other minor fixes
272
* Switch continuous integration from Jenkins to Travis
273

274
Special thanks to Ilia Kurenkov (Language Model package), Liling Tan (Python 3.7, Travis-CI),
275
and purificant (code cleanups). Thanks also to: Afshin Sadeghi, Ales Tamchyna, Alok Debnath,
276
aquatiko, Coykto, Denis Kataev, dnc1994, Fabian Howard, Frankie Robertson, Iaroslav Tymchenko,
277
Jayakrishna Sahit, LBenzahia, Leonie Weißweiler, Linghao Zhang, Rohit Kumar, sahitpj,
278
Tim Gianitsos, vagrant, 53X
279

280
Version 3.3 2018-05-06
281
* Support Python 3.6
282
* New interface to CoreNLP
283
* Support synset retrieval by sense key
284
* Minor fixes to CoNLL Corpus Reader, AlignedSent
285
* Fixed minor inconsistencies in APIs and API documentation
286
* Better conformance to PEP8
287
* Drop moses.py (incompatible license)
288

289
Special thanks to Liling Tan for leading our transition to Python 3.6.
290
Thanks to other contributors listed here: https://github.com/nltk/nltk/blob/develop/AUTHORS.md
291

292
Version 3.2.5 2017-09-24
293

294
* Arabic stemmers (ARLSTem, Snowball)
295
* NIST MT evaluation metric and added NIST international_tokenize
296
* Moses tokenizer
297
* Document Russian tagger
298
* Fix to Stanford segmenter
299
* Improve treebank detokenizer, VerbNet, Vader
300
* Misc code and documentation cleanups
301
* Implement fixes suggested by LGTM
302

303
Thanks to the following contributors to 3.2.5:
304
Ali Abdullah, Lakhdar Benzahia, Henry Elder, Campion Fellin,
305
Tsolak Ghukasyan, Thanh Ha, Jean Helie, Nelson Liu,
306
Nathan Schneider, Chintan Shah, Fábio Silva, Liling Tan,
307
Ziyao Wei, Zicheng Xu, Albert Au Yeung, AbdealiJK,
308
porqupine, sbagan, xprogramer
309

310
Version 3.2.4 2017-05-21
311

312
* remove load-time dependency on Python requests library
313
* add support for Arabic in StanfordSegmenter
314
* fix MosesDetokenizer on irregular quote tokens
315

316
Thanks to the following contributors to 3.2.4:
317
Alex Constantin, Hatem Nassrat, Liling Tan
318

319
Version 3.2.3 2017-05-16
320

321
* new interface to Stanford CoreNLP Web API
322
* improved Lancaster stemmer with customizable rules from Whoosh
323
* improved Treebank tokenizer
324
* improved support for GLEU score
325
* adopt new Abstract base class style
326
* support custom tab files for extending WordNet
327
* make synset_from_pos_and_offset a public method
328
* make non-English WordNet lemma lookups case-insensitive
329
* speed up TnT tagger
330
* speed up FreqDist and ConditionalFreqDist
331
* support additional quotes in TreebankWordTokenizer
332
* clean up Tk's postscript output
333
* drop explicit support for corpora not distributed with NLTK to streamline testing
334
* allow iterator in perceptron tagger training
335
* allow for curly bracket quantifiers in chunk.regexp.CHUNK_TAG_PATTERN
336
* new corpus reader for MWA subset of PPDB
337
* improved testing framework
338

339
Thanks to the following contributors to 3.2.3:
340
Mark Amery, Carl Bolz, Abdelhak Bougouffa, Matt Chaput, Michael Goodman,
341
Jaehoon Hwang, Naoya Kanai, Jackson Lee, Christian Meyer, Dmitrijs Milajevs,
342
Adam Nelson, Pierpaolo Pantone, Liling Tan, Vilhjalmur Thorsteinsson,
343
Arthur Tilley, jmhutch, Yorwba, eromoe and others
344

345
Version 3.2.2 2016-12-31
346
* added Kondrak's Aline algorithm
347
* added ChrF and GLEU MT evaluation metrics
348
* added Russian pos tagger model
349
* added Moses detokenizer
350
* rewrite Porter Stemmer
351
* rewrite FrameNet corpus reader
352
  (adds frame parameter to fes(), lus(), exemplars()
353
  see https://www.nltk.org/howto/framenet.html)
354
* updated FrameNet Corpus to version 1.7
355
* fixes to stanford_segmenter.py, SentiText, CoNLL Corpus Reader
356
* fixes to BLEU, naivebayes, Krippendorff's alpha, Punkt
357
* fixes to tests for TransitionParser, Senna, edit distance
358
* fixes to Moses Tokenizer and Detokenizer
359
* improved TweetTokenizer
360
* strip trailing whitespace when splitting sentences
361
* handle inverted exclamation mark in ToktokTokenizer
362
* resolved some issues with Python 3.5 support
363
* improvements to testing framework
364
* clean up dependencies
365

366
Thanks to the following contributors to 3.2.2:
367

368
Prasasto Adi, Mark Amery, Geoff Bacon, George Berry, Colin Carroll, Alexis Dimitriadis,
369
Nicholas Fabina, German Ferrero, Tsolak Ghukasyan, Hyuckin David Lim, Naoya Kanai,
370
Greg Kondrak, Igor Korolev, Tim Leslie, Rob Malouf, Heguang Miao, Dmitrijs Milajevs,
371
Adam Nelson, Dennis O'Brien, Qi Liu, Pierpaolo Pantone, Andy Reagan, Mike Recachinas,
372
Nathan Schneider, Jānis Šlapiņš, Richard Snape, Liling Tan, Marcus Uneson,
373
Linghao Zhang, drevicko, SaintNazaire
374

375
Version 3.2.1 2016-04-09
376
* Support for CCG semantics, Stanford segmenter, VADER lexicon
377
* Fixes to BLEU score calculation, CHILDES corpus reader
378
* Other miscellaneous fixes
379

380
Thanks to the following contributors to 3.2.1:
381
Andrew Giel, Casper Lehmann-Strøm, David Madl, Tanin Na Nakorn,
382
Guilherme Nardari, Philippe Ombredanne, Nathan Schneider, Liling Tan,
383
Josiah Wang, venticello
384

385
Version 3.2 2016-03-03
386
* Fixes for Python 3.5
387
* Code cleanups now Python 2.6 is no longer supported
388
* Improvements to documentation
389
* Comprehensive use of os.path for platform-specific path handling
390
* Support for PanLex
391
* Support for third party download locations for NLTK data
392
* Fix bugs in IBM method 3 smoothing and BLEU calculation
393
* Support smoothing for BLEU score and corpus-level BLEU
394
* Support RIBES score
395
* Improvements to TweetTokenizer
396
* Updates for Stanford API
397
* Add mathematical operators to ConditionalFreqDist
398
* Fix bug in sentiwordnet for adjectives
399
* Merged internal implementations of Trie
400

401
Thanks to the following contributors to 3.2:
402
Santiago Castro, Jihun Choi, Graham Christensen, Andrew Drozdov, Long
403
Duong, Kyriakos Georgiou, Michael Wayne Goodman, Clark Grubb, Tah Wei
404
Hoon, David Kamholz, Ewan Klein, Reed Loden, Rob Malouf, Philippe
405
Ombredanne, Josh Owen, Pierpaolo Pantone, Mike Recachinas, Elijah
406
Rippeth, Thomas Stieglmaier, Liling Tan, Philip Tzou, Pratap Vardhan.
407

408
Version 3.1 2015-10-15
409
* Fixes for Python 3.5 (drop support for capturing groups in regexp tokenizer)
410
* Drop support for Python 2.6
411
* Adopt perceptron tagger for new default POS tagger nltk.pos_tag
412
* Stanford Neural Dependency Parser wrapper
413
* Sentiment analysis package incl VADER
414
* Improvements to twitter package
415
* Multi word expression tokenizer
416
* Support for everygram and skipgram
417
* consistent evaluation metric interfaces, putting reference before hypothesis
418
* new nltk.translate module, incorporating the old align module
419
* implement stack decoder
420
* clean up Alignment interface
421
* CorpusReader method to support access to license and citation
422
* Multext East Corpus and MTECorpusReader
423
* include six module to streamline installation on MS Windows
424

425
Thanks to the following contributors to 3.1:
426
Le Tuan Anh, Petra Barancikova, Alexander Böhm, Francis Bond,
427
Long Duong, Anna Garbar, Matthew Honnibal, Tah Wei Hoon, Ewan Klein,
428
Rob Malouf, Dmitrijs Milajevs, Will Monroe, Sergio Oller, Pierpaolo
429
Pantone, Jacob Perkins, Lorenzo Rubio, Thomas Stieglmaier, Liling Tan,
430
Pratap Vardhan
431

432
Version 3.0.5 2015-09-05
433
* rewritten IBM models, and new IBM Model 4 and 5 implementations
434
* new Twitter package
435
* stabilized MaltParser API
436
* improved regex tagger
437
* improved documentation on contributing
438
* minor improvements to documentation and testing
439

440
Thanks to the following contributors to 3.0.5:
441
Álvaro Justen, Dmitrijs Milajevs, Ewan Klein, Heran Lin, Justin Hammar,
442
Liling Tan, Long Duong, Lorenzo Rubio, Pierpaolo Pantone, Tah Wei Hoon
443

444
Version 3.0.4 2015-07-13
445
* minor bug fixes and enhancements
446

447
Thanks to the following contributors to 3.0.4:
448
Nicola Bova, Santiago Castro, Len Remmerswaal, Keith Suderman, kabayan55,
449
pln-fing-udelar (NLP Group, Instituto de Computación, Facultad de Ingeniería, Universidad de la República, Uruguay).
450

451
Version 3.0.3 2015-06-12
452
* bug fixes (Stanford NER, Boxer, Snowball, treebank tokenizer,
453
    dependency graph, KneserNey, BLEU)
454
* code clean-ups
455
* default POS tagger permits tagset to be specified
456
* gensim illustration
457
* tgrep implementation
458
* added PanLex Swadesh corpora
459
* visualisation for aligned bitext
460
* support for Google App Engine
461
* POSTagger renamed StanfordPOSTagger, NERTagger renamed StanfordNERTagger
462

463
Thanks to the following contributors to 3.0.3:
464

465
Long Duong, Pedro Fialho, Dan Garrette, Helder, Saimadhav Heblikar,
466
Chris Inskip, David Kamholz, Dmitrijs Milajevs, Smitha Milli,
467
Tom Mortimer-Jones, Avital Pekker, Jonathan Pool, Sam Raker,
468
Will Roberts, Dmitry Sadovnychyi, Nathan Schneider, Anirudh W
469

470
Version 3.0.2 2015-03-13
471
* make pretty-printing method names consistent
472
* improvements to Portuguese stemmer
473
* transition-based dependency parsers
474
* dependency graph visualisation for ipython notebook
475
* interfaces for Senna, BLLIP, python-crfsuite
476
* NKJP corpus reader
477
* code clean ups, minor bug fixes
478

479
Thanks to the following contributors to 3.0.2:
480

481
Long Duong, Saimadhav Heblikar, Helder, Mikhail Korobov, Denis Krusko,
482
Alex Louden, Felipe Madrigal, David McClosky, Dmitrijs Milajevs,
483
Ondrej Platek, Nathan Schneider, Dávid Márk Nemeskey, 0ssifrage, ducki13, kiwipi.
484

485
Version 3.0.1 2015-01-12
486
* fix setup.py for new version of setuptools
487

488
Version 3.0.0 2014-09-07
489
* minor bugfixes
490
* added phrase extraction code by Liling Tan and Fredrik Hedman
491

492
Thanks to the following contributors to 3.0.0:
493
Mark Amery, Ivan Barria, Ingolf Becker, Francis Bond, Lars
494
Buitinck, Cristian Capdevila, Arthur Darcet, Michelle Fullwood,
495
Dan Garrette, Dougal Graham, Dan Garrette, Dougal Graham, Lauri
496
Hallila, Tyler Hartley, Fredrik Hedman, Ofer Helman, Bruce Hill,
497
Marcus Huderle, Nancy Ide, Nick Johnson, Angelos Katharopoulos,
498
Ewan Klein, Mikhail Korobov, Chris Liechti, Peter Ljunglof,
499
Joseph Lynch, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Rob
500
Malouf, Thorsten Marek, Dmitrijs Milajevs, Shari A’aidil
501
Nasruddin, Lance Nathan, Joel Nothman, Alireza Nourian, Alexander
502
Oleynikov, Ted Pedersen, Jacob Perkins, Will Roberts, Alex
503
Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn Soe,
504
Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
505
Wang, Chuck Wooters, lade, isnowfy, onesandzeros, pquentin, wvanlint
506

507
Version 3.0b2 2014-08-21
508
* minor bugfixes and clean-ups
509
* renamed remaining parse_ methods to read_ or load_, cf issue #656
510
* added Paice's method of evaluating stemming algorithms
511

512
Thanks to the following contributors to 3.0.0b2: Lars Buitinck,
513
Cristian Capdevila, Lauri Hallila, Ofer Helman, Dmitrijs Milajevs,
514
lade, Liling Tan, Steven Xu
515

516
Version 3.0.0b1 2014-07-11
517
* Added SentiWordNet corpus and corpus reader
518
* Fixed support for 10-column dependency file format
519
* Changed Tree initialization to use fromstring
520

521
Thanks to the following contributors to 3.0b1: Mark Amery, Ivan
522
Barria, Ingolf Becker, Francis Bond, Lars Buitinck, Arthur Darcet,
523
Michelle Fullwood, Dan Garrette, Dougal Graham, Dan Garrette, Dougal
524
Graham, Tyler Hartley, Ofer Helman, Bruce Hill, Marcus Huderle, Nancy
525
Ide, Nick Johnson, Angelos Katharopoulos, Ewan Klein, Mikhail Korobov,
526
Chris Liechti, Peter Ljunglof, Joseph Lynch, Haejoong Lee, Peter
527
Ljunglöf, Dean Malmgren, Rob Malouf, Thorsten Marek, Dmitrijs
528
Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman, Alireza
529
Nourian, Alexander Oleynikov, Ted Pedersen, Jacob Perkins, Will
530
Roberts, Alex Rudnick, Nathan Schneider, Geraldine Sim Wei Ying, Lynn
531
Soe, Liling Tan, Louis Tiao, Marcus Uneson, Yu Usami, Steven Xu, Zhe
532
Wang, Chuck Wooters, isnowfy, onesandzeros, pquentin, wvanlint
533

534
Version 3.0a4 2014-05-25
535
* IBM Models 1-3, BLEU, Gale-Church aligner
536
* Lesk algorithm for WSD
537
* Open Multilingual WordNet
538
* New implementation of Brill Tagger
539
* Extend BNCCorpusReader to parse the whole BNC
540
* MASC Tagged Corpus and corpus reader
541
* Interface to Stanford Parser
542
* Code speed-ups and clean-ups
543
* API standardisation, including fromstring method for many objects
544
* Improved regression testing setup
545
* Removed PyYAML dependency
546

547
Thanks to the following contributors to 3.0a4:
548
Ivan Barria, Ingolf Becker, Francis Bond, Arthur Darcet, Dan Garrette,
549
Ofer Helman, Dougal Graham, Nancy Ide, Ewan Klein, Mikhail Korobov,
550
Chris Liechti, Peter Ljunglof, Joseph Lynch, Rob Malouf, Thorsten Marek,
551
Dmitrijs Milajevs, Shari A’aidil Nasruddin, Lance Nathan, Joel Nothman,
552
Jacob Perkins, Lynn Soe, Liling Tan, Louis Tiao, Marcus Uneson, Steven Xu,
553
Geraldine Sim Wei Ying
554

555
Version 3.0a3 2013-11-02
556
* support for FrameNet contributed by Chuck Wooters
557
* support for Universal Declaration of Human Rights Corpus (udhr2)
558
* major API changes:
559
  - Tree.node -> Tree.label() / Tree.set_label()
560
  - Chunk parser: top_node -> root_label; chunk_node -> chunk_label
561
  - WordNet properties are now access methods, e.g. Synset.definition -> Synset.definition()
562
  - relextract: show_raw_rtuple() -> rtuple(), show_clause() -> clause()
563
* bugfix in texttiling
564
* replaced simplify_tags with support for universal tagset (simplify_tags=True -> tagset='universal')
565
* Punkt default behavior changed to realign sentence boundaries after trailing parenthesis and quotes
566
* deprecated classify.svm (use scikit-learn instead)
567
* various efficiency improvements
568

569
Thanks to the following contributors to 3.0a3:
570
Lars Buitinck, Marcus Huderle, Nick Johnson, Dougal Graham, Ewan Klein,
571
Mikhail Korobov, Haejoong Lee, Peter Ljunglöf, Dean Malmgren, Lance Nathan,
572
Alexander Oleynikov, Nathan Schneider, Chuck Wooters, Yu Usami, Steven Xu,
573
pquentin, wvanlint
574

575
Version 3.0a2 2013-07-12
576
* speed improvements in word_tokenize, GAAClusterer, TnT tagger, Baum Welch, HMM tagger
577
* small improvements in collocation finders, probability, modelling, Porter Stemmer
578
* bugfix in lowest common hypernyn calculation (used in path similarity measures)
579
* code cleanups, docstring cleanups, demo fixes
580

581
Thanks to the following contributors to 3.0a2:
582
Mark Amery, Lars Buitinck, Michelle Fullwood, Dan Garrette, Dougal Graham,
583
Tyler Hartley, Bruce Hill, Angelos Katharopoulos, Mikhail Korobov,
584
Rob Malouf, Joel Nothman, Ted Pedersen, Will Roberts, Alex Rudnick,
585
Steven Xu, isnowfy, onesandzeros
586

587
Version 3.0a1 2013-02-14
588
* reinstated tkinter support (Haejoong Lee)
589

590
Version 3.0a0 2013-01-14
591
* alpha release of first version to support Python 2.6, 2.7, and 3.
592

593
Version 2.0.4 2012-11-07
594
* minor bugfix (removed numpy dependency)
595

596
Version 2.0.3 2012-09-24
597

598
* fixed corpus/reader/util.py to support Python 2.5
599
* make MaltParser safe to use in parallel
600
* fixed bug in inter-annotator agreement
601
* updates to various doctests (nltk/test)
602
* minor bugfixes
603

604
Thanks to the following contributors to 2.0.3:
605
Robin Cooper, Pablo Duboue, Christian Federmann, Dan Garrette, Ewan Klein,
606
Pierre-François Laquerre, Max Leonov, Peter Ljunglöf, Nitin Madnani, Ceri Stagg
607

608
Version 2.0.2 2012-07-05
609

610
* improvements to PropBank, NomBank, and SemCor corpus readers
611
* interface to full Penn Treebank Corpus V3 (corpus.ptb)
612
* made wordnet.lemmas case-insensitive
613
* more flexible padding in model.ngram
614
* minor bugfixes and documentation enhancements
615
* better support for automated testing
616

617
Thanks to the following contributors to 2.0.2:
618
Daniel Blanchard, Mikhail Korobov, Nitin Madnani, Duncan McGreggor,
619
Morten Neergaard, Nathan Schneider, Rico Sennrich.
620

621
Version 2.0.1 2012-05-15
622

623
* moved NLTK to GitHub: https://github.com/nltk
624
* set up integration testing: https://jenkins.shiningpanda.com/nltk/ (Morten Neergaard)
625
* converted documentation to Sphinx format: https://www.nltk.org/api/nltk.html
626
* dozens of minor enhancements and bugfixes: https://github.com/nltk/nltk/commits/
627
* dozens of fixes for conformance with PEP-8
628
* dozens of fixes to ensure operation with Python 2.5
629
* added interface to Lin's Dependency Thesaurus (Dan Blanchard)
630
* added interface to scikit-learn classifiers (Lars Buitinck)
631
* added segmentation evaluation measures (David Doukhan)
632

633
Thanks to the following	contributors to	2.0.1 (since 2.0b9, July 2010):
634
Rami Al-Rfou', Yonatan Becker, Steven Bethard, Daniel Blanchard, Lars
635
Buitinck, David Coles, Lucas Cooper, David Doukhan, Dan Garrette,
636
Masato Hagiwara, Michael Hansen, Michael Heilman, Rebecca Ingram,
637
Sudharshan Kaushik, Mikhail Korobov, Peter Ljunglof, Nitin Madnani,
638
Rob Malouf, Tomonori Nagano, Morten Neergaard, David Nemeskey,
639
Joel Nothman, Jacob Perkins, Alessandro Presta, Alex Rudnick,
640
Nathan Schneider, Stefano Lattarini, Peter Stahl, Jason Yoder
641

642
Version 2.0.1 (rc1) 2011-04-11
643

644
NLTK:
645
* added interface to the Stanford POS Tagger
646
* updates to sem.Boxer, sem.drt.DRS
647
* allow unicode strings in grammars
648
* allow non-string features in classifiers
649
* modifications to HunposTagger
650
* issues with DRS printing
651
* fixed bigram collocation finder for window_size > 2
652
* doctest paths no longer presume unix-style pathname separators
653
* fixed issue with NLTK's tokenize module colliding with the Python tokenize module
654
* fixed issue with stemming Unicode strings
655
* changed ViterbiParser.nbest_parse to parse
656
* ChaSen and KNBC Japanese corpus readers
657
* preserve case in concordance display
658
* fixed bug in simplification of Brown tags
659
* a version of IBM Model 1 as described in Koehn 2010
660
* new class AlignedSent for aligned sentence data and evaluation metrics
661
* new nltk.util.set_proxy to allow easy configuration of HTTP proxy
662
* improvements to downloader user interface to catch URL and HTTP errors
663
* added CHILDES corpus reader
664
* created special exception hierarchy for Prover9 errors
665
* significant changes to the underlying code of the boxer interface
666
* path-based wordnet similarity metrics use a fake root node for verbs, following the Perl version
667
* added ability to handle multi-sentence discourses in Boxer
668
* added the 'english' Snowball stemmer
669
* simplifications and corrections of Earley Chart Parser rules
670
* several changes to the feature chart parsers for correct unification
671
* bugfixes: FreqDist.plot, FreqDist.max, NgramModel.entropy, CategorizedCorpusReader, DecisionTreeClassifier
672
* removal of Python >2.4 language features for 2.4 compatibility
673
* removal of deprecated functions and associated warnings
674
* added semantic domains to wordnet corpus reader
675
* changed wordnet similarity functions to include instance hyponyms
676
* updated to use latest version of Boxer
677

678
Data:
679
* JEITA Public Morphologically Tagged Corpus (in ChaSen format)
680
* KNB Annotated corpus of Japanese blog posts
681
* Fixed some minor bugs in alvey.fcfg, and added number of parse trees in alvey_sentences.txt
682
* added more comtrans data
683

684
Documentation:
685
* minor fixes to documentation
686
* NLTK Japanese book (chapter 12) by Masato Hagiwara
687

688
NLTK-Contrib:
689
* Viethen and Dale referring expression algorithms
690

691

692
Version 2.0b9 2010-07-25
693

694
NLTK:
695
* many code and documentation cleanups
696
* Added port of Snowball stemmers
697
* Fixed loading of pickled tokenizers (issue 556)
698
* DecisionTreeClassifier now handles unknown features (issue 570)
699
* Added error messages to LogicParser
700
* Replaced max_models with end_size to prevent Mace from hanging
701
* Added interface to Boxer
702
* Added nltk.corpus.semcor to give access to SemCor 3.0 corpus (issue 530)
703
* Added support for integer- and float-valued features in maxent classifiers
704
* Permit NgramModels to be pickled
705
* Added Sourced Strings (see test/sourcedstring.doctest for details)
706
* Fixed bugs in with Good-Turing and Simple Good-Turing Estimation (issue 26)
707
* Add support for span tokenization, aka standoff annotation of segmentation (incl Punkt)
708
* allow unicode nodes in Tree.productions()
709
* Fixed WordNet's morphy to be consistent with the original implementation,
710
  taking the shortest returned form instead of an arbitrary one (issues 427, 487)
711
* Fixed bug in MaxentClassifier
712
* Accepted bugfixes for YCOE corpus reader (issue 435)
713
* Added test to _cumulative_frequencies() to correctly handle the case when no arguments are supplied
714
* Added a TaggerI interface to the HunPos open-source tagger
715
* Return 0, not None, when no count is present for a lemma in WordNet
716
* fixed pretty-printing of unicode leaves
717
* More efficient calculation of the leftcorner relation for left corner parsers
718
* Added two functions for graph calculations: transitive closure and inversion.
719
* FreqDist.pop() and FreqDist.popitems() now invalidate the caches (issue 511)
720

721
Data:
722
* Added SemCor 3.0 corpus (Brown Corpus tagged with WordNet synsets)
723
* Added LanguageID corpus (trigram counts for 451 languages)
724
* Added grammar for a^n b^n c^n
725

726
NLTK-Contrib:
727
* minor updates
728

729
Thanks to the following	contributors to	2.0b9:
730

731
Steven Bethard,	Francis Bond, Dmitry Chichkov, Liang Dong, Dan Garrette,
732
Simon Greenhill, Bjorn Maeland, Rob Malouf, Joel Nothman, Jacob Perkins,
733
Alberto Planas, Alex Rudnick, Geoffrey Sampson, Kevin Scannell, Richard Sproat
734

735

736
Version 2.0b8 2010-02-05
737

738
NLTK:
739
* fixed copyright and license statements
740
* removed PyYAML, and added dependency to installers and download instructions
741
* updated to LogicParser, DRT (Dan Garrette)
742
* WordNet similarity metrics return None instead of -1 when
743
  they fail to find a path (Steve Bethard)
744
* shortest_path_distance uses instance hypernyms (Jordan Boyd-Graber)
745
* clean_html improved (Bjorn Maeland)
746
* batch_parse, batch_interpret and batch_evaluate functions allow
747
    grammar or grammar filename as argument
748
* more Portuguese examples (portuguese_en.doctest, examples/pt.py)
749

750
NLTK-Contrib:
751
* Aligner implementations (Christopher Crowner, Torsten Marek)
752
* ScriptTranscriber package (Richard Sproat and Kristy Hollingshead)
753

754
Book:
755
* updates for second printing, correcting errata
756
  https://nltk.googlecode.com/svn/trunk/nltk/doc/book/errata.txt
757

758
Data:
759
* added Europarl sample, with 10 docs for each of 11 langs (Nitin Madnani)
760
* added SMULTRON sample corpus (Torsten Marek, Martin Volk)
761

762

763
Version 2.0b7 2009-11-09
764

765
NLTK:
766
* minor bugfixes and enhancements: data loader, inference package, FreqDist, Punkt
767
* added Portuguese example module, similar to nltk.book for English (examples/pt.py)
768
* added all_lemma_names() method to WordNet corpus reader
769
* added update() and __add__() extensions to FreqDist (enhances alignment with Python 3.0 counters)
770
* reimplemented clean_html
771
* added test-suite runner for automatic/manual regression testing
772

773
NLTK-Data:
774
* updated Punkt models for sentence segmentation
775
* added corpus of the works of Machado de Assis (Brazilian Portuguese)
776

777
Book:
778
* Added translation of preface into Portuguese, contributed by Tiago Tresoldi.
779

780
Version 2.0b6 2009-09-20
781

782
NLTK:
783
* minor fixes for Python 2.4 compatibility
784
* added words() method to XML corpus reader
785
* minor bugfixes and code clean-ups
786
* fixed downloader to put data in %APPDATA% on Windows
787

788
Data:
789
* Updated Punkt models
790
* Fixed utf8 encoding issues with UDHR and Stopwords Corpora
791
* Renamed CoNLL "cat" files to "esp" (different language)
792
* Added Alvey NLT feature-based grammar
793
* Added Polish PL196x corpus
794

795
Version 2.0b5 2009-07-19
796

797
NLTK:
798
* minor bugfixes (incl FreqDist, Python eggs)
799
* added reader for Europarl Corpora (contributed by Nitin Madnani)
800
* added reader for IPI PAN Polish Corpus (contributed by Konrad Goluchowski)
801
* fixed data.py so that it doesn't generate a warning for Windows Python 2.6
802

803
NLTK-Contrib:
804
* updated Praat reader (contributed by Margaret Mitchell)
805

806
Version 2.0b4 2009-07-10
807

808
NLTK:
809
* switched to Apache License, Version 2.0
810
* minor bugfixes in semantics and inference packages
811
* support for Python eggs
812
* fixed stale regression tests
813

814
Data:
815
* added NomBank 1.0
816
* uppercased feature names in some grammars
817

818
Version 2.0b3 2009-06-25
819

820
NLTK:
821
* several bugfixes
822
* added nombank corpus reader (Paul Bedaride)
823

824
Version 2.0b2 2009-06-15
825

826
NLTK:
827
* minor bugfixes and optimizations for parsers, updated some doctests
828
* added bottom-up filtered left corner parsers,
829
  LeftCornerChartParser and IncrementalLeftCornerChartParser.
830
* fixed dispersion plot bug which prevented empty plots
831

832
Version 2.0b1 2009-06-09
833

834
NLTK:
835
* major refactor of chart parser code and improved API (Peter Ljungl喃)
836
* added new bottom-up left-corner chart parser strategy
837
* misc bugfixes (ChunkScore, chart rules, chatbots, jcn-similarity)
838
* improved efficiency of "import nltk" using lazy module imports
839
* moved CCG package and ISRI Arabic stemmer from NLTK-Contrib into core NLTK
840
* misc code cleanups
841

842
Contrib:
843
* moved out of the main NLTK distribution into a separate distribution
844

845
Book:
846
* Ongoing polishing ahead of print publication
847

848
Version 0.9.9 2009-05-06
849

850
NLTK:
851
* Finalized API for NLTK 2.0 and the book, incl dozens of small fixes
852
* Names of the form nltk.foo.Bar now available as nltk.Bar
853
  for significant functionality; in some cases the name was modified
854
  (using old names will produce a deprecation warning)
855
* Bugfixes in downloader, WordNet
856
* Expanded functionality in DecisionTree
857
* Bigram collocations extended for discontiguous bigrams
858
* Translation toy nltk.misc.babelfish
859
* New module nltk.help giving access to tagset documentation
860
* Fix imports so that NLTK builds without Tkinter (Bjorn Maeland)
861

862
Data:
863
* new maxent NE chunker model
864
* updated grammar packages for the book
865
* data for new tagsets collection, documenting several tagsets
866
* added lolcat translation to the Genesis collection
867

868
Contrib (work in progress):
869
* Updates to coreference package (Joseph Frazee)
870
* New ISRI Arabic stemmer (Hosam Algasaier)
871
* Updates to Toolbox package (Greg Aumann)
872

873
Book:
874
* Substantial editorial corrections ahead of final submission
875

876
Version 0.9.8 2009-02-18
877

878
NLTK:
879
* New off-the-shelf tokenizer, POS tagger, and named-entity tagger
880
* New metrics package with inter-annotator agreement scores,
881
  distance metrics, rank correlation
882
* New collocations package (Joel Nothman)
883
* Many clean-ups to WordNet package (Steven Bethard, Jordan Boyd-Graber)
884
* Moved old pywordnet-based WordNet package to nltk_contrib
885
* WordNet browser (Paul Bone)
886
* New interface to dependency treebank corpora
887
* Moved MinimalSet class into nltk.misc package
888
* Put NLTK applications in new nltk.app package
889
* Many other improvements incl semantics package, toolbox, MaltParser
890
* Misc changes to many API names in preparation for 1.0, old names deprecated
891
* Most classes now available in the top-level namespace
892
* Work on Python egg distribution (Brandon Rhodes)
893
* Removed deprecated code remaining from 0.8.* versions
894
* Fixes for Python 2.4 compatibility
895

896
Data:
897
* Corrected identifiers in Dependency Treebank corpus
898
* Basque and Catalan Dependency Treebanks (CoNLL 2007)
899
* PE08 Parser Evaluation data
900
* New models for POS tagger and named-entity tagger
901

902
Book:
903
* Substantial editorial corrections
904

905
Version 0.9.7 2008-12-19
906

907
NLTK:
908
* fixed problems with accessing zipped corpora
909
* improved design and efficiency of grammars and chart parsers
910
  including new bottom-up combine strategy and a redesigned
911
  Earley strategy (Peter Ljunglof)
912
* fixed bugs in smoothed probability distributions and added
913
  regression tests (Peter Ljunglof)
914
* improvements to Punkt (Joel Nothman)
915
* improvements to text classifiers
916
* simple word-overlap RTE classifier
917

918
Data:
919
* A new package of large grammars (Peter Ljunglof)
920
* A small gazetteer corpus and corpus reader
921
* Organized example grammars into separate packages
922
* Childrens' stories added to gutenberg package
923

924
Contrib (work in progress):
925
* fixes and demonstration for named-entity feature extractors in nltk_contrib.coref
926

927
Book:
928
* extensive changes throughout, including new chapter 5 on classification
929
  and substantially revised chapter 11 on managing linguistic data
930

931
Version 0.9.6 2008-12-07
932

933
NLTK:
934
* new WordNet corpus reader (contributed by Steven Bethard)
935
* incorporated dependency parsers into NLTK (was NLTK-Contrib) (contributed by Jason Narad)
936
* moved nltk/cfg.py to nltk/grammar.py and incorporated dependency grammars
937
* improved efficiency of unification algorithm
938
* various enhancements to the semantics package
939
* added plot() and tabulate() methods to FreqDist and ConditionalFreqDist
940
* FreqDist.keys() and list(FreqDist) provide keys reverse-sorted by value,
941
  to avoid the confusion caused by FreqDist.sorted()
942
* new downloader module to support interactive data download: nltk.download()
943
  run using "python -m nltk.downloader all"
944
* fixed WordNet bug that caused min_depth() to sometimes give incorrect result
945
* added nltk.util.Index as a wrapper around defaultdict(list) plus
946
  a functional-style initializer
947
* fixed bug in Earley chart parser that caused it to break
948
* added basic TnT tagger nltk.tag.tnt
949
* new corpus reader for CoNLL dependency format (contributed by Kepa Sarasola and Iker Manterola)
950
* misc other bugfixes
951

952
Contrib (work in progress):
953
* TIGERSearch implementation by Torsten Marek
954
* extensions to hole and glue semantics modules by Dan Garrette
955
* new coreference package by Joseph Frazee
956
* MapReduce interface by Xinfan Meng
957

958
Data:
959
* Corpora are stored in compressed format if this will not compromise speed of access
960
* Swadesh Corpus of comparative wordlists in 23 languages
961
* Split grammar collection into separate packages
962
* New Basque and Spanish grammar samples (contributed by Kepa Sarasola and Iker Manterola)
963
* Brown Corpus sections now have meaningful names (e.g. 'a' is now 'news')
964
* Fixed bug that forced users to manually unzip the WordNet corpus
965
* New dependency-parsed version of Treebank corpus sample
966
* Added movie script "Monty Python and the Holy Grail" to webtext corpus
967
* Replaced words corpus data with a much larger list of English words
968
* New URL for list of available NLTK corpora
969
  https://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
970

971
Book:
972
* complete rewrite of first three chapters to make the book accessible
973
  to a wider audience
974
* new chapter on data-intensive language processing
975
* extensive reworking of most chapters
976
* Dropped subsection numbering; moved exercises to end of chapters
977

978
Distributions:
979
* created Portfile to support Mac installation
980

981

982
Version 0.9.5 2008-08-27
983

984
NLTK:
985
* text module with support for concordancing, text generation, plotting
986
* book module
987
* Major reworking of the automated theorem proving modules (Dan Garrette)
988
* draw.dispersion now uses pylab
989
* draw.concordance GUI tool
990
* nltk.data supports for reading corpora and other data files from within zipfiles
991
* trees can be constructed from strings with Tree(s) (cf Tree.parse(s))
992

993
Contrib (work in progress):
994
* many updates to student projects
995
  - nltk_contrib.agreement (Thomas Lippincott)
996
  - nltk_contrib.coref (Joseph Frazee)
997
  - nltk_contrib.depparser (Jason Narad)
998
  - nltk_contrib.fuf (Petro Verkhogliad)
999
  - nltk_contrib.hadoop (Xinfan Meng)
1000
* clean-ups: deleted stale files; moved some packages to misc
1001

1002
Data
1003
* Cleaned up Gutenberg text corpora
1004
* added Moby Dick; removed redundant copy of Blake songs.
1005
* more tagger models
1006
* renamed to nltk_data to facilitate installation
1007
* stored each corpus as a zip file for quicker installation
1008
  and access, and to solve a problem with the Propbank
1009
  corpus including a file with an illegal name for MSWindows
1010
  (con.xml).
1011

1012
Book:
1013
* changed filenames to chNN format
1014
* reworked opening chapters (work in progress)
1015

1016
Distributions:
1017
* fixed problem with mac installer that arose when Python binary
1018
  couldn't be found
1019
* removed dependency of NLTK on nltk_data so that NLTK code can be
1020
  installed before the data
1021

1022
Version 0.9.4 2008-08-01
1023

1024
NLTK:
1025
- Expanded semantics package for first order logic, linear logic,
1026
  glue semantics, DRT, LFG (Dan Garrette)
1027
- new WordSense class in wordnet.synset supporting access to synsets
1028
  from sense keys and accessing sense counts (Joel Nothman)
1029
- interface to Mallet's linear chain CRF implementation (nltk.tag.crf)
1030
- misc bugfixes incl Punkt, synsets, maxent
1031
- improved support for chunkers incl flexible chunk corpus reader,
1032
  new rule type: ChunkRuleWithContext
1033
- new GUI for pos-tagged concordancing nltk.draw.pos_concordance
1034
- new GUI for developing regexp chunkers nltk.draw.rechunkparser
1035
- added bio_sents() and bio_words() methods to ConllChunkCorpusReader in conll.py
1036
    to allow reading (word, tag, chunk_typ) tuples off of CoNLL-2000 corpus. Also
1037
    modified ConllChunkCorpusView to support these changes.
1038
- feature structures support values with custom unification methods
1039
- new flag on tagged corpus readers to use simplified tagsets
1040
- new package for ngram language modeling with Katz backoff nltk.model
1041
- added classes for single-parented and multi-parented trees that
1042
  automatically maintain parent pointers (nltk.tree.ParentedTree and
1043
  nltk.tree.MultiParentedTree)
1044
- new WordNet browser GUI (Jussi Salmela, Paul Bone)
1045
- improved support for lazy sequences
1046
- added generate() method to probability distributions
1047
- more flexible parser for converting bracketed strings to trees
1048
- made fixes to docstrings to improve API documentation
1049

1050
Contrib (work in progress)
1051
- new NLG package, FUF/SURGE (Petro Verkhogliad)
1052
- new dependency parser package (Jason Narad)
1053
- new Coreference package, incl support for
1054
  ACE-2, MUC-6 and MUC-7 corpora (Joseph Frazee)
1055
- CCG Parser (Graeme Gange)
1056
- first order resolution theorem prover (Dan Garrette)
1057

1058
Data:
1059
- Nnw NPS Chat Corpus and corpus reader (nltk.corpus.nps_chat)
1060
- ConllCorpusReader can now be used to read CoNLL 2004 and 2005 corpora.
1061
- Implemented HMM-based Treebank POS tagger and phrase chunker for
1062
  nltk_contrib.coref in api.py. Pickled versions of these objects are checked
1063
  in in data/taggers and data/chunkers.
1064

1065
Book:
1066
- misc corrections in response to feedback from readers
1067

1068
Version 0.9.3 2008-06-03
1069

1070
NLTK:
1071
- modified WordNet similarity code to use pre-built information content files
1072
- new classifier-based tagger, BNC corpus reader
1073
- improved unicode support for corpus readers
1074
- improved interfaces to Weka, Prover9/Mace4
1075
- new support for using MEGAM and SciPy to train maxent classifiers
1076
- rewrite of Punkt sentence segmenter (Joel Nothman)
1077
- bugfixes for WordNet information content module (Jordan Boyd-Graber)
1078
- code clean-ups throughout
1079

1080
Book:
1081
- miscellaneous fixes in response to feedback from readers
1082

1083
Contrib:
1084
- implementation of incremental algorithm for generating
1085
  referring expressions (contributed by Margaret Mitchell)
1086
- refactoring WordNet browser (Paul Bone)
1087

1088
Corpora:
1089
- included WordNet information content files
1090

1091
Version 0.9.2 2008-03-04
1092

1093
NLTK:
1094
- new theorem-prover and model-checker module nltk.inference,
1095
  including interface to Prover9/Mace4 (Dan Garrette, Ewan Klein)
1096
- bugfix in Reuters corpus reader that causes Python
1097
  to complain about too many open files
1098
- VerbNet and PropBank corpus readers
1099

1100
Data:
1101
- VerbNet Corpus version 2.1: hierarchical, verb lexicon linked to WordNet
1102
- PropBank Corpus: predicate-argument structures, as stand-off annotation of Penn Treebank
1103

1104
Contrib:
1105
- New work on WordNet browser, incorporating a client-server model (Jussi Salmela)
1106

1107
Distributions:
1108
- Mac OS 10.5 distribution
1109

1110
Version 0.9.1 2008-01-24
1111

1112
NLTK:
1113
- new interface for text categorization corpora
1114
- new corpus readers: RTE, Movie Reviews, Question Classification, Brown Corpus
1115
- bugfix in ConcatenatedCorpusView that caused iteration to fail if it didn't start from the beginning of the corpus
1116

1117
Data:
1118
- Question classification data, included with permission of Li & Roth
1119
- Reuters 21578 Corpus, ApteMod version, from CPAN
1120
- Movie Reviews corpus (sentiment polarity), included with permission of Lillian Lee
1121
- Corpus for Recognising Textual Entailment (RTE) Challenges 1, 2 and 3
1122
- Brown Corpus (reverted to original file structure: ca01-cr09)
1123
- Penn Treebank corpus sample (simplified implementation, new readers treebank_raw and treebank_chunk)
1124
- Minor redesign of corpus readers, to use filenames instead of "items" to identify parts of a corpus
1125

1126
Contrib:
1127
- theorem_prover: Prover9, tableau, MaltParser, Mace4, glue semantics, docs (Dan Garrette, Ewan Klein)
1128
- drt: improved drawing, conversion to FOL (Dan Garrette)
1129
- gluesemantics: GUI demonstration, abstracted LFG code, documentation (Dan Garrette)
1130
- readability: various text readability scores (Thomas Jakobsen, Thomas Skardal)
1131
- toolbox: code to normalize toolbox databases (Greg Aumann)
1132

1133
Book:
1134
- many improvements in early chapters in response to reader feedback
1135
- updates for revised corpus readers
1136
- moved unicode section to chapter 3
1137
- work on engineering.txt (not included in 0.9.1)
1138

1139
Distributions:
1140
- Fixed installation for Mac OS 10.5 (Joshua Ritterman)
1141
- Generalize doctest_driver to work with doc_contrib
1142

1143
Version 0.9 2007-10-12
1144

1145
NLTK:
1146
- New naming of packages and modules, and more functions imported into
1147
  top-level nltk namespace, e.g. nltk.chunk.Regexp -> nltk.RegexpParser,
1148
    nltk.tokenize.Line -> nltk.LineTokenizer, nltk.stem.Porter -> nltk.PorterStemmer,
1149
    nltk.parse.ShiftReduce -> nltk.ShiftReduceParser
1150
- processing class names changed from verbs to nouns, e.g.
1151
  StemI -> StemmerI, ParseI -> ParserI, ChunkParseI -> ChunkParserI, ClassifyI -> ClassifierI
1152
- all tokenizers are now available as subclasses of TokenizeI,
1153
  selected tokenizers are also available as functions, e.g. wordpunct_tokenize()
1154
- rewritten ngram tagger code, collapsed lookup tagger with unigram tagger
1155
- improved tagger API, permitting training in the initializer
1156
- new system for deprecating code so that users are notified of name changes.
1157
- support for reading feature cfgs to parallel reading cfgs (parse_featcfg())
1158
- text classifier package, maxent (GIS, IIS), naive Bayes, decision trees, weka support
1159
- more consistent tree printing
1160
- wordnet's morphy stemmer now accessible via stemmer package
1161
- RSLP Portuguese stemmer (originally developed by Viviane Moreira Orengo, reimplemented by Tiago Tresoldi)
1162
- promoted ieer_rels.py to the sem package
1163
- improvements to WordNet package (Jussi Salmela)
1164
- more regression tests, and support for checking coverage of tests
1165
- miscellaneous bugfixes
1166
- remove numpy dependency
1167

1168
Data:
1169
- new corpus reader implementation, refactored syntax corpus readers
1170
- new data package: corpora, grammars, tokenizers, stemmers, samples
1171
- CESS-ESP Spanish Treebank and corpus reader
1172
- CESS-CAT Catalan Treebank and corpus reader
1173
- Alpino Dutch Treebank and corpus reader
1174
- MacMorpho POS-tagged Brazilian Portuguese news text and corpus reader
1175
- trained model for Portuguese sentence segmenter
1176
- Floresta Portuguese Treebank version 7.4 and corpus reader
1177
- TIMIT player audio support
1178

1179
Contrib:
1180
- BioReader (contributed by Carlos Rodriguez)
1181
- TnT tagger (contributed by Sam Huston)
1182
- wordnet browser (contributed by Jussi Salmela, requires wxpython)
1183
- lpath interpreter (contributed by Haejoong Lee)
1184
- timex -- regular expression-based temporal expression tagger
1185

1186
Book:
1187
- polishing of early chapters
1188
- introductions to parts 1, 2, 3
1189
- improvements in book processing software (xrefs, avm & gloss formatting, javascript clipboard)
1190
- updates to book organization, chapter contents
1191
- corrections throughout suggested by readers (acknowledged in preface)
1192
- more consistent use of US spelling throughout
1193
- all examples redone to work with single import statement: "import nltk"
1194
- reordered chapters: 5->7->8->9->11->12->5
1195
  * language engineering in part 1 to broaden the appeal
1196
    of the earlier part of the book and to talk more about
1197
    evaluation and baselines at an earlier stage
1198
  * concentrate the partial and full parsing material in part 2,
1199
    and remove the specialized feature-grammar material into part 3
1200

1201
Distributions:
1202
- streamlined mac installation (Joshua Ritterman)
1203
- included mac distribution with ISO image
1204

1205
Version 0.8 2007-07-01
1206

1207
Code:
1208
- changed nltk.__init__ imports to explicitly import names from top-level modules
1209
- changed corpus.util to use the 'rb' flag for opening files, to fix problems
1210
  reading corpora under MSWindows
1211
- updated stale examples in engineering.txt
1212
- extended feature structure interface to permit chained features, e.g. fs['F','G']
1213
- further misc improvements to test code plus some bugfixes
1214
Tutorials:
1215
- rewritten opening section of tagging chapter
1216
- reorganized some exercises
1217

1218
Version 0.8b2 2007-06-26
1219

1220
Code (major):
1221
- new corpus package, obsoleting old corpora package
1222
  - supports caching, slicing, corpus search path
1223
  - more flexible API
1224
  - global updates so all NLTK modules use new corpus package
1225
- moved nltk/contrib to separate top-level package nltk_contrib
1226
- changed wordpunct tokenizer to use \w instead of a-zA-Z0-9
1227
  as this will be more robust for languages other than English,
1228
  with implications for many corpus readers that use it
1229
- known bug: certain re-entrant structures in featstruct
1230
- known bug: when the LHS of an edge contains an ApplicationExpression,
1231
    variable values in the RHS bindings aren't copied over when the
1232
    fundamental rule applies
1233
- known bug: HMM tagger is broken
1234
Tutorials:
1235
- global updates to NLTK and docs
1236
- ongoing polishing
1237
Corpora:
1238
- treebank sample reverted to published multi-file structure
1239
Contrib:
1240
- DRT and Glue Semantics code (nltk_contrib.drt, nltk_contrib.gluesemantics, by Dan Garrette)
1241

1242
Version 0.8b1 2007-06-18
1243

1244
Code (major):
1245
- changed package name to nltk
1246
- import all top-level modules into nltk, reducing need for import statements
1247
- reorganization of sub-package structures to simplify imports
1248
- new featstruct module, unifying old featurelite and featurestructure modules
1249
- FreqDist now inherits from dict, fd.count(sample) becomes fd[sample]
1250
- FreqDist initializer permits: fd = FreqDist(len(token) for token in text)
1251
- made numpy optional
1252
Code (minor):
1253
- changed GrammarFile initializer to accept filename
1254
- consistent tree display format
1255
- fixed loading process for WordNet and TIMIT that prevented code installation if data not installed
1256
- taken more care with unicode types
1257
- incorporated pcfg code into cfg module
1258
- moved cfg, tree, featstruct to top level
1259
- new filebroker module to make handling of example grammar files more transparent
1260
- more corpus readers (webtext, abc)
1261
- added cfg.covers() to check that a grammar covers a sentence
1262
- simple text-based wordnet browser
1263
- known bug: parse/featurechart.py uses incorrect apply() function
1264
Corpora:
1265
- csv data file to document NLTK corpora
1266
Contrib:
1267
- added Glue semantics code (contrib.glue, by Dan Garrette)
1268
- Punkt sentence segmenter port (contrib.punkt, by Willy)
1269
- added LPath interpreter (contrib.lpath, by Haejoong Lee)
1270
- extensive work on classifiers (contrib.classifier*, Sumukh Ghodke)
1271
Tutorials:
1272
- polishing on parts I, II
1273
- more illustrations, data plots, summaries, exercises
1274
- continuing to make prose more accessible to non-linguistic audience
1275
- new default import that all chapters presume: from nltk.book import *
1276
Distributions:
1277
- updated to latest version of numpy
1278
- removed WordNet installation instructions as WordNet is now included in corpus distribution
1279
- added pylab (matplotlib)
1280

1281
Version 0.7.5 2007-05-16
1282

1283
Code:
1284
- improved WordNet and WordNet-Similarity interface
1285
- the Lancaster Stemmer (contributed by Steven Tomcavage)
1286
Corpora:
1287
- Web text samples
1288
- BioCreAtIvE-PPI - a corpus for protein-protein interactions
1289
- Switchboard Telephone Speech Corpus Sample (via Talkbank)
1290
- CMU Problem Reports Corpus sample
1291
- CONLL2002 POS+NER data
1292
- Patient Information Leaflet corpus
1293
- WordNet 3.0 data files
1294
- English wordlists: basic English, frequent words
1295
Tutorials:
1296
- more improvements to text and images
1297

1298
Version 0.7.4 2007-05-01
1299

1300
Code:
1301
- Indian POS tagged corpus reader: corpora.indian
1302
- Sinica Treebank corpus reader: corpora.sinica_treebank
1303
- new web corpus reader corpora.web
1304
- tag package now supports pickling
1305
- added function to utilities.py to guess character encoding
1306
Corpora:
1307
- Rotokas texts from Stuart Robinson
1308
- POS-tagged corpora for several Indian languages (Bangla, Hindi, Marathi, Telugu) from A Kumaran
1309
Tutorials:
1310
- Substantial work on Part II of book on structured programming, parsing and grammar
1311
- More bibliographic citations
1312
- Improvements in typesetting, cross references
1313
- Redimensioned images and tables for better use of page space
1314
- Moved project list to wiki
1315
Contrib:
1316
- validation of toolbox entries using chunking
1317
- improved classifiers
1318
Distribution:
1319
- updated for Python 2.5.1, Numpy 1.0.2
1320

1321
Version 0.7.3 2007-04-02
1322

1323
* Code:
1324
 - made chunk.Regexp.parse() more flexible about its input
1325
 - developed new syntax for PCFG grammars, e.g. A -> B C [0.3] | D [0.7]
1326
 - fixed CFG parser to support grammars with slash categories
1327
 - moved beta classify package from main NLTK to contrib
1328
 - Brill taggers loaded correctly
1329
 - misc bugfixes
1330
* Corpora:
1331
 - Shakespeare XML corpus sample and corpus reader
1332
* Tutorials:
1333
 - improvements to prose, exercises, plots, images
1334
 - expanded and reorganized tutorial on structured programming
1335
 - formatting improvements for Python listings
1336
 - improved plots (using pylab)
1337
 - categorization of problems by difficulty
1338
Contrib:
1339
 - more work on kimmo lexicon and grammar
1340
 - more work on classifiers
1341

1342
Version 0.7.2 2007-03-01
1343

1344
* Code:
1345
 - simple feature detectors (detect module)
1346
 - fixed problem when token generators are passed to a parser (parse package)
1347
 - fixed bug in Grammar.productions() (identified by Lucas Champollion and Mitch Marcus)
1348
 - fixed import bug in category.GrammarFile.earley_parser
1349
 - added utilities.OrderedDict
1350
 - initial port of old NLTK classifier package (by Sam Huston)
1351
 - UDHR corpus reader
1352
* Corpora:
1353
 - added UDHR corpus (Universal Declaration of Human Rights)
1354
     with 10k text samples in 300+ languages
1355
* Tutorials:
1356
 - improved images
1357
 - improved book formatting, including new support for:
1358
   - javascript to copy program examples to clipboard in HTML version,
1359
   - bibliography, chapter cross-references, colorization, index, table-of-contents
1360

1361
* Contrib:
1362
  - new Kimmo system: contrib.mit.six863.kimmo (Rob Speer)
1363
  - fixes for: contrib.fsa (Rob Speer)
1364
  - demonstration of text classifiers trained on UDHR corpus for
1365
      language identification: contrib.langid (Sam Huston)
1366
  - new Lambek calculus system: contrib.lambek
1367
  - new tree implementation based on elementtree: contrib.tree
1368

1369
Version 0.7.1 2007-01-14
1370

1371
* Code:
1372
  - bugfixes (HMM, WordNet)
1373

1374
Version 0.7 2006-12-22
1375

1376
* Code:
1377
  - bugfixes, including fixed bug in Brown corpus reader
1378
  - cleaned up wordnet 2.1 interface code and similarity measures
1379
  - support for full Penn treebank format contributed by Yoav Goldberg
1380
* Tutorials:
1381
  - expanded tutorials on advanced parsing and structured programming
1382
  - checked all doctest code
1383
  - improved images for chart parsing
1384

1385
Version 0.7b1 2006-12-06
1386

1387
* Code:
1388
  - expanded semantic interpretation package
1389
  - new high-level chunking interface, with cascaded chunking
1390
  - split chunking code into new chunk package
1391
  - updated wordnet package to support version 2.1 of Wordnet.
1392
  - prototyped basic wordnet similarity measures
1393
    (path distance, Wu + Palmer and Leacock + Chodorow, Resnik similarity measures.)
1394
  - bugfixes (tag.Window, tag.ngram)
1395
  - more doctests
1396
* Contrib:
1397
  - toolbox language settings module
1398
* Tutorials:
1399
  - rewrite of chunking chapter, switched from Treebank to CoNLL format as main focus,
1400
    simplified evaluation framework, added ngram chunking section
1401
  - substantial updates throughout (esp programming and semantics chapters)
1402
* Corpora:
1403
  - Chat-80 Prolog data files provided as corpora, plus corpus reader
1404

1405
Version 0.7a2 2006-11-13
1406

1407
* Code:
1408
  - more doctest
1409
  - code to read Chat-80 data
1410
  - HMM bugfix
1411
* Tutorials:
1412
  - continued updates and polishing
1413
* Corpora:
1414
  - toolbox MDF sample data
1415

1416
Version 0.7a1 2006-10-29
1417

1418
* Code:
1419
  - new toolbox module (Greg Aumann)
1420
  - new semantics package (Ewan Klein)
1421
  - bugfixes
1422
* Tutorials
1423
  - substantial revision, especially in preface, introduction, words,
1424
    and semantics chapters.
1425

1426
Version 0.6.6 2006-10-06
1427

1428
* Code:
1429
  - bugfixes (probability, shoebox, draw)
1430
* Contrib:
1431
  - new work on shoebox package (Stuart Robinson)
1432
* Tutorials:
1433
  - continual expansion and revision, especially on introduction to
1434
    programming, advanced programming and the feature-based grammar chapters.
1435

1436
Version 0.6.5 2006-07-09
1437

1438
* Code:
1439
  - improvements to shoebox module (Stuart Robinson, Greg Aumann)
1440
  - incorporated feature-based parsing into core NLTK-Lite
1441
  - corpus reader for Sinica treebank sample
1442
  - new stemmer package
1443
* Contrib:
1444
  - hole semantics implementation (Peter Wang)
1445
  - Incorporating yaml
1446
  - new work on feature structures, unification, lambda calculus
1447
  - new work on shoebox package (Stuart Robinson, Greg Aumann)
1448
* Corpora:
1449
  - Sinica treebank sample
1450
* Tutorials:
1451
  - expanded discussion throughout, incl: left-recursion, trees, grammars,
1452
    feature-based grammar, agreement, unification, PCFGs,
1453
    baseline performance, exercises, improved display of trees
1454

1455
Version 0.6.4 2006-04-20
1456

1457
* Code:
1458
  - corpus readers for Senseval 2 and TIMIT
1459
  - clusterer (ported from old NLTK)
1460
  - support for cascaded chunkers
1461
  - bugfix suggested by Brent Payne
1462
  - new SortedDict class for regression testing
1463
* Contrib:
1464
  - CombinedTagger tagger and marshalling taggers, contributed by Tiago Tresoldi
1465
* Corpora:
1466
  - new: Senseval 2, TIMIT sample
1467
* Tutorials:
1468
  - major revisions to programming, words, tagging, chunking, and parsing tutorials
1469
  - many new exercises
1470
  - formatting improvements, including colorized program examples
1471
  - fixed problem with testing on training data, reported by Jason Baldridge
1472

1473
Version 0.6.3 2006-03-09
1474

1475
* switch to new style classes
1476
* repair FSA model sufficiently for Kimmo module to work
1477
* port of MIT Kimmo morphological analyzer; still needs lots of code clean-up and inline docs
1478
* expanded support for shoebox format, developed with Stuart Robinson
1479
* fixed bug in indexing CFG productions, for empty right-hand-sides
1480
* efficiency improvements, suggested by Martin Ranang
1481
* replaced classeq with isinstance, for efficiency improvement, as suggested by Martin Ranang
1482
* bugfixes in chunk eval
1483
* simplified call to draw_trees
1484
* names, stopwords corpora
1485

1486
Version 0.6.2 2006-01-29
1487

1488
* Peter Spiller's concordancer
1489
* Will Hardy's implementation of Penton's paradigm visualization system
1490
* corpus readers for presidential speeches
1491
* removed NLTK dependency
1492
* generalized CFG terminals to permit full range of characters
1493
* used fully qualified names in demo code, for portability
1494
* bugfixes from Yoav Goldberg, Eduardo Pereira Habkost
1495
* fixed obscure quoting bug in tree displays and conversions
1496
* simplified demo code, fixed import bug
1497
nltk

Использование cookies