1
.\" Hey, EMACS: -*- nroff -*-
2
.TH RULEXDB_OPEN 3 "February 19, 2012"
4
rulexdb_open \- open or create a rulex database
9
.BI "RULEXDB *rulexdb_open(const char *" path ", int " mode );
14
function opens the rulex database in the file whose name is the string
17
and allocates and initializes all necessary internal data structures
22
specifies a database access mode. It may accept one of the following
26
Open the database only for searching (read only mode).
29
Open existing database for searching and updating (read and write
33
Create new database and open it for updating and searching.
34
.SH "DATABASE STRUCTURE"
35
The rulex database consists of two dictionaries and four sets
36
of rules. The \fBExplicit\fP dictionary contains the words that
37
are described individually and do not imply any information for
38
other forms. This dictionary is looked up first if the search
39
includes this stage. The \fBImplicit\fP dictionary contains
40
words in some basic form. This dictionary is used to construct
41
pronunciation string for various forms of these words. The basic
42
form of a word is guessed according to the rules from the
43
\fBClassifiers\fP and \fBPrefix detectors\fP rulesets. This is the
44
second stage of search process. If these stages do not bring a result
45
or are not performed the rules from the \fBGeneral\fP ruleset are used
46
to guess stressing word. If no one of these rules can be applied than
47
no guessing is made and search process fails.
49
Externally all the data are represented textually. For the Russian
50
letters the \fBkoi8\-r\fP character set is used and only lower case
53
Each dictionary record consists of two fields. The first field
54
contains Russian word that serves as a key when searching. Only
55
lowercase Russian letters are allowed here. The second field provides
56
pronunciation string for this word. The pronunciation string
57
is the word itself, but written in such a manner as it should
58
be pronounced. There are three additional symbols allowed
59
in the pronunciation string along with the lowercase
60
Russian letters. The "+" sign can be used to point the stressed
61
letter. It should be placed just after that letter. The "=" sign
62
is used in some cases just in the same manner to point so-called
63
weak stress. The "-" sign can serve as a separator in some complex
64
words. All other symbols are treated as illegal.
66
There are four rulesets in the database: \fBGeneral\fP rules,
67
\fBClassifiers\fP, \fBPrefix detectors\fP and
68
\fBCorrectors\fP. Externally all these rules are represented by
69
records consisting of one or two fields. The first field always
70
contains a regular expression which is matched against the word to
71
make a decision whether this rule can be applied.
73
The only task of \fBGeneral\fP rules is to guess stress
74
in the words when dictionary lookup fails. The rules are tried
75
sequentially until match or the list exhaustion. If match succeeds
76
then the "+" sign is inserted into the word right after the first
77
subexpression match to point stressing position.
78
These rules do not contain a second field.
80
For the \fBClassifiers\fP ruleset each rule is checked one by one
81
until match occurs. Then the part from the beginning of the word
82
through to the end of the first subexpression match is extracted
83
and if a second field is present it is appended to the extracted
84
part as a suffix. The resulting string is treated as a basic form
85
of the word, so it is looked up in the \fBImplicit\fP dictionary.
86
If nothing is found the process continues
87
until the ruleset will be exceeded.
89
When nothing is found in the database for a word in its original form,
90
\fBPrefix detection\fP rules are applied to it sequentially until
91
match occurs. The matched prefix is stripped and replaced by the
92
replacement string if any. Then the result word is searched in the
93
\fBImplicit\fP dictionary. In the case of success the original prefix
94
is restored in the pronunciation string.
96
The rules from \fBCorrectors\fP ruleset are applied
97
to the pronunciation strings instead of the original words.
98
The second field in these rules specifies a regular replacement
99
string where digits serve as subexpression numbers.
101
Upon successful completion
105
pointer that should be used in other database access functions for
106
referencing the database.
107
Otherwise, NULL is returned.
109
.BR rulexdb_classify (3),
110
.BR rulexdb_close (3),
111
.BR rulexdb_dataset_name (3),
112
.BR rulexdb_discard_dictionary (3),
113
.BR rulexdb_discard_ruleset (3),
114
.BR rulexdb_fetch_rule (3),
115
.BR rulexdb_lexbase (3),
116
.BR rulexdb_load_ruleset (3),
117
.BR rulexdb_remove_item (3),
118
.BR rulexdb_remove_rule (3),
119
.BR rulexdb_remove_this_item (3),
120
.BR rulexdb_retrieve_item (3),
121
.BR rulexdb_search (3),
123
.BR rulexdb_subscribe_item (3),
124
.BR rulexdb_subscribe_rule (3)
126
Igor B. Poretsky <poretsky@mlbox.ru>.