rulex

lexholder-ru.1
257 строк · 10.0 Кб
Перенос по словам
1
.\"                                      Hey, EMACS: -*- nroff -*-
2
.TH LEXHOLDER\-RU 1 "October 28, 2006"
3
.SH NAME
4
lexholder\-ru \- rulex database holding utility
5
.SH SYNOPSIS
6
.B lexholder\-ru
7
[\fIoptions\fR] <\fIdb_path\fR>
8
.SH DESCRIPTION
9
\fBlexholder\-ru\fP is a small utility intended for use from the
10
command line or shell-scripts. It allows one to construct, test,
11
manage and query lexical database as well as extract its content
12
in textual form.
13
.PP
14
This database is primarily intended for use along with the Russian
15
TTS engine \fBru_tts\fP to provide stressing and pronunciation
16
information for the Russian words.
17
.PP
18
When filling and updating the database,
19
new records are read from the standard input.
20
When extracting data from the database,
21
The result is printed to the standard output.
22
This behaviour can be changed by the \fB\-f\fP switch.
23
.SH OPTIONS
24
All options recognized in the command line are described below.
25
For more convenience they are arranged into several groups
26
by its functionality.
27
.PP
28
The first group consists of options specifying an action to be done.
29
These options are mutually exclusive. We can do only one action
30
per invocation. If no action is specified, the program reads
31
its standard input (or a file specified by \fB\-f\fP option)
32
and stores its content in the database. Here are the other actions:
33
.TP
34
.B \-h
35
.br
36
Print summary of options and exit. This option discards all other
37
command line specifications. It is the only case when the database
38
path is not required.
39
.TP
40
.B \-l
41
.br
42
List database content in textual form. This action requires
43
the dataset to be specified explicitly by one of the \fB\-X\fP,
44
\fB\-M\fP, \fB\-G\fP, \fB\-L\fP, \fB\-P\fP or \fB\-C\fP options.
45
.TP
46
.B \-s <key>
47
.br
48
Search specified key in the lexical database. If the word is found
49
program exits successfully and outputs its pronunciation string,
50
 otherwise prints the lowercased original word and exits
51
with non-zero exit code. This action is affected by
52
the search mode options described below.
53
If the \fB\-q\fP switch is specified in the command line, nothing
54
will be printed on the standard output, but return code still
55
can be used to find out whether the word was found or not.
56
.TP
57
.B \-b <key>
58
.br
59
Treat specified word as an implicit form and discover basic forms
60
(if any) which could be used in the \fBImplicit\fP dictionary.
61
If quiet mode is not in use then all possible basic forms
62
for the word will be printed to the standard output
63
(or to the file specified by the \fB\-f\fP option)
64
along with the numbers of corresponding \fBClassifiers\fP.
65
Program exits successfully if it can suggest some basic forms
66
for specified word and returns a non-zero exit code otherwise.
67
In quiet mode nothing will be printed on the standard output,
68
but the exit code still can be used to make a decision
69
concerning the operation result.
70
.TP
71
.B \-t <dictionary_file>
72
.br
73
Test the database against specified dictionary. Test dictionary file
74
is read line by line. Each line is treated as a record consisting
75
of two fields separated by space. The first field represents
76
a key word and the second one gives its pronunciation string.
77
If this pronunciation string differs from the one obtained from
78
the database, then this record is printed to the standard output
79
or written to the file specified by \fB\-f\fP option. Specifying
80
"\-" as test dictionary file name causes the testing records
81
to be read from the standard input. This action is affected
82
by the search mode options described below.
83
.TP
84
.B \-d <key>
85
.br
86
Delete record for specified key. This action requires the dataset
87
to be specified explicitly by one of the \fB\-X\fP, \fB\-M\fP,
88
\fB\-G\fP, \fB\-L\fP, \fB\-P\fP or \fB\-C\fP options. For rules its
89
number in the ruleset is used as a key.
90
.TP
91
.B \-D
92
.br
93
Discard the dataset. The dataset must be chosen by one of the
94
\fB\-X\fP, \fB\-M\fP, \fB\-G\fP, \fB\-L\fP, \fB\-P\fP or \fB\-C\fP
95
options.
96
.TP
97
.B \-c
98
.br
99
Clean the database removing redundant entries from dictionaries. By
100
default all records that surely do not affect any search result are
101
removed. These are the entries of the \fBImplicit\fP dictionary that do
102
not represent any lexical base and the entries of the \fBExplicit\fP
103
dictionary that in fact duplicate the result of usual lookup process.
104
If one of the \fB\-X\fP or \fB\-M\fP options is specified as well,
105
then only that chosen dictionary will be cleaned. If the
106
\fBImplicit\fP dictionary is chosen, the extensive cleanup is
107
performed for it, that can drop some useful records. Be careful.
108
.PP
109
The next group of options is responsible for choosing the dataset.
110
These options are mutually exclusive and affect deletion, insertion
111
and listing operations. For listing and deletion the dataset must be
112
specified explicitly. If no one of these options is mentioned when
113
inserting new data, an appropriate dataset will be chosen according
114
to the input data. Only lexical data can be inserted in such a manner.
115
For rules target dataset must be specified explicitly.
116
.TP
117
.B \-X
118
.br
119
Explicit dictionary.
120
.TP
121
.B \-M
122
.br
123
Implicit dictionary.
124
.TP
125
.B \-G
126
.br
127
General rules.
128
.TP
129
.B \-L
130
.br
131
Lexical classification rules.
132
.TP
133
.B \-P
134
.br
135
Prefix detection rules.
136
.TP
137
.B \-C
138
.br
139
Correction rules.
140
.PP
141
The next group contains options devoted to search mode specification.
142
 These options affect search and test operation. By default (no options)
143
full search will be performed, otherwise only those stages specified
144
explicitly will be included in the search process.
145
.TP
146
.B \-x
147
.br
148
Search in the explicit dictionary.
149
.TP
150
.B \-m
151
.br
152
Try to treat the word as an implicit form.
153
.TP
154
.B \-g
155
.br
156
Try to apply general rules.
157
.PP
158
The next group contains only one option that affects insertion
159
new data into the lexical database.
160
.TP
161
.B \-r
162
.br
163
Replace mode. For a dictionary this mode causes that the new records
164
replace existing ones with the same key. By default such records
165
are ignored. For rules this mode means that the ruleset content
166
should be fully replaced by the new data. Otherwise new rules
167
are appended to the ruleset.
168
.PP
169
The last group contains several options affecting program behaviour
170
in general.
171
.TP
172
.B \-f <file>
173
.br
174
Use specified file instead of standard input or output.
175
.TP
176
.B \-q
177
.br
178
Be more quiet than usual: don't print search results as well
179
as warnings about duplicate records.
180
.TP
181
.B \-v
182
.br
183
Be more verbose than usual: print messages about work stages
184
and final statistical information when finishing.
185
.SH DATA REPRESENTATION
186
Externally all the data are represented textually. For the Russian
187
letters the \fBkoi8\-r\fP character set is used and only lower case
188
is allowed.
189
.PP
190
The database itself consists of two dictionaries and four sets
191
of rules. The \fBExplicit\fP dictionary contains the words that
192
are described individually and do not imply any information for
193
other forms. This dictionary is looked up first if the search
194
includes this stage. The \fBImplicit\fP dictionary contains
195
words in some basic form. This dictionary is used to construct
196
pronunciation string for various forms of these words. The basic
197
form of a word is guessed according to the rules from the
198
\fBClassifiers\fP and \fBPrefix detectors\fP rulesets. This is the
199
second stage of search process. If these stages do not bring a result
200
or are not performed the rules from the \fBGeneral\fP ruleset are used
201
to guess stressing word. If no one of these rules can be applied than
202
no guessing is made and search process fails. By default, all three
203
stages are performed, but it can be specified explicitly which ones
204
should be taken in account.
205
.PP
206
Externally dictionary data are represented by text lines
207
consisting of two fields separated by space. The first field is
208
a Russian word. It serves as a key when searching. Only lowercase
209
Russian letters are allowed here. The second field provides
210
pronunciation string for this word. The pronunciation string
211
is the word itself, but written in such a manner as it should
212
be pronounced. There are three additional symbols allowed
213
in the pronunciation string along with the lowercase
214
Russian letters. The "+" sign can be used to point the stressed
215
letter. It should be placed just after that letter. The "=" sign
216
is used in some cases just in the same manner to point so-called
217
weak stress. The "-" sign can serve as a separator in some complex
218
words. All other symbols are treated as illegal.
219
.PP
220
There are four rulesets in the database: \fBGeneral\fP rules,
221
\fBClassifiers\fP, \fBPrefix detectors\fP and
222
\fBCorrectors\fP. Externally all these rules are represented by
223
strings consisting of one or two fields separated by space. The first
224
field always contains a regular expression which is matched against
225
the word to make a decision whether this rule can be applied.
226
.PP
227
The only task of \fBGeneral\fP rules is to guess stress
228
in the words when dictionary lookup fails. The rules are tried
229
sequentially until match or the list exhaustion. If match succeeds
230
then the "+" sign is inserted into the word right after the first
231
subexpression match to point stressing position.
232
 These rules do not contain a second field.
233
.PP
234
For the \fBClassifiers\fP ruleset each rule is checked one by one
235
until match occurs. Then the part from the beginning of the word
236
through to the end of the first subexpression match is extracted
237
and if a second field is present it is appended to the extracted
238
part as a suffix. The resulting string is treated as a basic form
239
of the word, so it is looked up in the \fBImplicit\fP dictionary.
240
If nothing is found the process continues
241
until the ruleset will be exceeded.
242
.PP
243
When nothing is found in the database for a word in its original form,
244
\fBPrefix detection\fP rules are applied to it sequentially until
245
match occurs. The matched prefix is stripped and replaced by the
246
replacement string if any. Then the result word is searched in the
247
\fBImplicit\fP dictionary. In the case of success the original prefix
248
is restored in the pronunciation string.
249
.PP
250
The rules from \fBCorrectors\fP ruleset are applied
251
to the pronunciation strings instead of the original words.
252
The second field in these rules specifies a regular replacement
253
string where digits serve as subexpression numbers.
254
.SH SEE ALSO
255
.BR ru_tts (1), /usr/share/doc/rulex/README.
256
.SH AUTHOR
257
Igor B. Poretsky <poretsky@mlbox.ru>.
258
rulex

Использование cookies