Hunspell-fi - Kesäkoodi 2006 - blog
2006-10-06
This blog is not in use anymore! I will continue writing notes at
http://www.puimula.org/htp/notes.html.
2006-09-03
As promised, the final report of my project is now available as a
PDF and in the original
LyX format. I want to thank all of my sponsors and especially
those who worked with me for free during this summer! I have not yet decided how I will continue to
write about my progress, but there will be something to replace this blog. It will not however be
a proper blog (with comments enabled) as I want to keep the discussions on our mailing list.
2006-08-27
Results from this week
- Spent a day trying to port Voikko to Mac OS X. The work was done on a Intel Mac
(OS X 10.4). There were only small issues that needed to be fixed in Malaga
(missing #includes, did not write a patch because Malaga has no public development tree
and the bugs were in files for which I already have sent patches after the last release) and
libvoikko (fixed in revision
389).
After these, it was possible to get spellchecking working with the test programs supplied
with libvoikko. OpenOffice.org integration component could not be compiled because the
OOo SDK seems to be totally broken on that platform. There were even trivial bugs like configuration
failing if the OOo installation path had spaces in it. And the OOo installation path has spaces
in it by default on OS X, which suggests that nobody has ever actually tested the SDK there.
- Redesigned the word search facility in Joukahainen. There are not that many search options available, but
the most important ones have been implemented. It is also possible to plug in different output
formatters, which can reside either in the common code or language packs. Two formatters have
been implemented: the ordinary list of search results on a html page and Finnish specific
format for use in Suomi-malaga.
Related SVN revisions:
391,
392,
396 and
397.
- Added a simple tool to help users find the correct inflection class for new words.
Related SVN revisions:
398 and
399.
- Released new version of tmispell-voikko with some fixes to make the program easier to package
for Linux distributions and working Finnish message translations.
Related SVN revisions:
390,
404 and
405.
Plan for week ending 2006-09-03
- Write installation instructions for Joukahainen.
- Write the final report.
Next week will be my last on Kesäkoodi. The week will be spent mainly by writing documentation
and final report, so I do not expect to be writing any more of these blog entries. Quick look at
the server statistics show that this page has been relatively popular; apparently some of you have
actually found my writings interesting enough to regulary read them. I have to consider if continuing
some sort of blog might make sense even after I am no longer required to do so.
Obviously it would not be very similar to this one: here I have intentionally written in the style
of semi formal status reports which I have assumed would be most useful for the target audience.
I still do think that it makes sense for me to keep people informed of what I am doing if I continue
to work on matters of general interests. Mailing lists are good, but not optimal if I
choose to work on things that are not directly related to the subjects of that list.
I think about this, and if I decide to continue blogging in some form or another I add a link
to the blog at the top of this page. In any case I will publish my final report here as well.
2006-08-20
Results from this week
There are no references to individual SVN commits this week, because there
has been so many of them. If you are interested, my source code related
commits from past week (there seems to have been 31 of them) are listed
in trunk commit log.
This does not include any web site related commits.
- Joukahainen is now almost fully translatable. The only exception is word
searching functionality, which I am planning to totally rewrite next week.
All language dependent files have been moved under new subdirectory langpacks.
This should make it quite easy to add support for new languages. I thus consider
this part of my project to be completed (apart from some documentation, which I
will be writing during the last week of Kesäkoodi.)
- User management got mostly implemented. Users can be added and they can change their
passwords through the www interface. Editing other parts of user data and deleting
users still has to be done directly with the database administrator tools. These will
of course be fixed, but I consider these to be low priority items at the moment.
- Added some of the most important missing verb classes to the database. The words with
no inflcetion data can now be imported as well. Installed the full database (22662 words
from which 19765 have usable inflection data) at joukahainen.lokalisointi.org, created
some user accounts and posted instructions on how to use the new system to proofread
the included verbs.
- Worked on Suomi-malaga: merged some changes from Hannu and tried to clean up the code in
suomi.mor. Managed to remove some no longer needed checks and improved the analysis speed
by almost 5 %.
- Worked on libvoikko: small improvements to the suggestion code, added ability to accept
alternative Unicode sequences for certain characters as suggested by Teemu Likonen (who
also compiled the list of possible character replacements).
- Worked on libenchant plugin and tmispell: wrote a patch to add support for Voikko to
upstream version of Enchant. Filed this to
upstream bugzilla and
found out six hours later that Anssi Hannula had been working on the exactly same thing.
Because our patches were very similar, Anssi merged them and agreed to continue pushing
this to upstream. Hopefully this succeeds, because Enchant seems to be the sanest and
most actively developed of all currently available options for universal spell checker
backend.
- Worked on our web site to make it a bit more useful for the users of Voikko. Teemu has
continued to do most of the work here.
Plan for week ending 2006-08-27
- Do nothing on Monday: I have been working every day since the release of Voikko, have
to take a break now.
- Go through my project plan and finish anything that still is not properly implemented.
There does not seem to be any big items left.
2006-08-13
Results from this week
- Task support was finished in Joukahainen.
Related SVN revisions:
275,
278.
- Made Joukahainen translatable by adding gettext support. Only one module has
been translated by changing the original strings to English and moving Finnish
strings to fi.po, but I hope to be able to translate rest of the modules next week.
Related SVN revisions:
290,
291.
- Small fixes to Suomi-malaga and libvoikko.
Related SVN revisions:
295,
301,
304.
- Released Voikko 1.0. This involved lots of changes to our website at
http://www.hunspell-fi.org, creating a new download
site at http://www.lemi.fi/voikko and fixes
to the download site of the old non-free spellchecker Oo2-soikko at
http://www.lemi.fi/oo2-soikko. Teemu Likonen
has helped a lot here.
Plan for week ending 2006-08-20
- Finish translation of Joukahainen to English.
- Implement user management and update the test installation. It would be nice to
use the test installation for real work and still be able discard the modifications.
One possibility is to set up
a task for proofreading verb inflections. For the included verbs the inflection
data should be complete enough for that. If there is time, I should add few more verb classes
to the database.
2006-08-06
Results from this week
- All components of Voikko 1.0 are now ready. The release will be next week as
planned. After the release we should try to merge the changes to Hannu's version
of Suomi-malaga. This will require some difficult decisions because some of the
differences between these versions are quite fundamental. Maybe we should only
share some of the files and keep separate versions of the rest. I am mainly thinking
about suomi.mor which seems to be more or less complete for use in indexing
applications but may need quite a lot of changes for our spellchecker application if
we want to develop new features or improve the performance.
Related SVN revisions:
256,
257,
258,
259,
260 and
262.
- Implemented Wiki links in Joukahainen and improved the word addition support
based on feedback from Kevin Scannell and Reijo Tomperi. This turned out to be
more challenging than I had first thought because there are quite a lot of error cases
to cover and it is hard to make that process intuitive and efficient at the same time.
Additionally there should be some way to add language specific improvements to the
workflow such as automatic classification suggestions. I would love to work on these
but maybe they would not be so useful for Finnish and this whole thing is getting
too much beyond my original project plan. Wrote a mail to Openoffice.org lingu-dev
mailing list to let people know of my progress. If some of the other language
teams gets seriously interested about using Joukahainen it will be easier to
figure out what is the best way to proceed here.
Related SVN revisions:
266,
267,
268 and
271.
- Worked on the desing of "systematic evaluation of inflections" feature. I
decided to generalise this to a concept of "vocabulary wide review tasks". The idea
is that a task is defined as a SQL query returning a list of word identifiers that
should be processed in that task. When a user wants to work on a task, she will be
given a list of unprocessed words from that query. If a word needs to be changed, she
can either do the change herself or add a "needs review" flag with a comment stating
what seems to be wrong with the word. Checking the inflections for Finnish will be done
by seeing if the inflections displayed in the green box are correct or not. Implemented
the database schema changes, example tasks and a page that lists the available tasks and
their progress (words to be checked / total number of words within this task). See
SVN revision 273
- Updated the test installation at
joukahainen.lokalisointi.org
Plan for week ending 2006-08-13
- Complete the task support.
- Release Voikko 1.0: choose the material to be copied to www.lemi.fi and write the
download page, announce the release.
2006-07-28
Results from this week
- Verbs are now supported within Joukahainen, and they can be imported from and
exported to Suomi-malaga.
Related SVN revisions:
216,
217,
218,
231,
252 and
254.
- It is now possible to add words from a list of (possibly incorrect) candidate
words (revision
249).
This list may have been generated for example by crawling the web with
a language recognising crawler.
Additionally, the initial word list may have been split into categories. In these
cases it is possible to add words either from a selected category or from the combined list.
(The final phase in word adding process is currently broken, I will fix it next week.)
- Did lots of small things related to tmispell, Suomi-malaga and libvoikko. For example
all small fixes from Suomi-malaga 0.8 beta versions have been ported to our Voikko edition.
The plan is to merge these versions soon after we have released Voikko 1.0.
- Read the second discussion draft of GPL version 3 and the first discussion draft of
LGPL version 3. At least to me it seems like people have been way too worried about
the upcoming changes to these licenses. The changes to GPL from first to second draft
were mostly clarifications, but there were surprisingly lot of them. The new DRM clause
was reworded exactly as I had hoped to see it written: it is not really forbidden to
implement DRM mechanisms, but the author of the code cannot use that as an excuse to
prevent further modifications to the program. The new LGPL is no longer a separate license
but a list of additional rights granted on top of the GPL. This seems like a smart thing
to do, as it makes the license a lot shorter. It will also clarify a fundamental point that many
people seem to have missed: it is possible to convert a work from LGPL to GPL without
explicit premission from the copyright holder (this point has affected our vocabulary
licensing discussions in the past). All things considered I am pretty happy about these
new drafts. I think that I would not have any problems with distributing my work under
either of these, although I will not be changing from "GPL v2 or later" to "GPL v3 or later"
until at least all of the important projects that might want to use libvoikko have expressed
their views on the issue. One thing seems quite clear at the moment: I will not be interested
in contributing to projects that drop "or later" from their GPL v2 products, the risk of
having the work wasted if everyone else later moves to v3 is just not worth it.
Plan for week ending 2006-08-06
- Package the final version of Suomi-malaga 0.7.1 (Voikko edition). I will try to fix
some compounding issues first though.
- Implement some small but interesting features in Joukahainen. For example cross references
to fi.wiktionary.org might be a nice addition
(thanks to Kalle Lampila for remainding me again about this great resource). We cannot use any
material from Wiktionary directly, as it is under the GFDL, but linking is OK and it is all that we
really need anyway. Update the test installation at
joukahainen.lokalisointi.org because there has
been more interest from testers lately.
- Start working on support for systematic evaluation of inflections.
2006-07-23
Results from this week
- Adding single new words works in Joukahainen (revision
212).
I did not yet implement adding words from list of unchecked words because
I received a suggestion about allowing some additional metadata in the unchecked
word list as well. This seems reasonable, I just have to think how to handle that.
- Tried to port Voikko to Mac OS X on SourceForge compile farm. After compiling a
lot of dependencies I got Malaga compiled, but libvoikko failed to compile. The
reason was that the version of Mac OS X on the compile farm host (10.2) does not
support wide character variants of the C library string functions. This support has
been added in version 10.3 of OS X but I do not have access to such system. Almost
everything in libvoikko uses the
wchar_t
type so fixing this
within libvoikko does not make much sense. The only third party library that could
have been used to add wide character support to earlier versions of OS X is
under the original BSD license which contains the GPL incompatible advertising
clause. So I had to give up here. However, it does look like porting to the modern
versions of OS X should not be that difficult. It would be great if someone wanted
to help here, either by doing the porting or offering me access to a suitable
machine.
- Wrote a patch to Malaga 7.5 that contains better fixes for issues that already
have been worked around in out Debian packages. I also included some changes from
the new Gentoo packages made by Flammie Pirinen.
Plan for week ending 2006-07-30
- Work on supporting verbs in Joukahainen (adding them into the database and
supporting them in the inflection tool).
2006-07-07
Results from this week
- Implemented word flag editing (as planned), alternative word form editing and
commenting (which were not on the plan, but seemed reasonable to do now). Also
other small improvements to Joukahainen, not worth listing here (see revision logs
if you are interested).
Related SVN revisions:
199,
204,
207,
208 and
209.
- More fixes to libvoikko and Suomi-malaga. It seems that one attempt is not
enough for me to fix a simple logic bug in libvoikko, I needed revision
200 to
fix 172
from last week. Thanks to Teemu Likonen for noticing the odd behaviour this bug
was causing.
- Wrote a status
report and instructions for packagers in order
to prepare for the release of "Voikko 1.0" next month.
- No testers showed up for the editing capabilities in Joukahainen. The new version
is not all that interesting if you cannot log in to it, so I have
therefore decided not to update the test installation. This should save quite
much time and allow me to make last minute changes to the database structure without
need to move data from old installation. I believe that I will update
joukahainen.lokalisointi.org
only after the database contains all nouns, adjectives and verbs which may not
be untils late August or perhaps September. Meanwhile testing and feedback will be
organised by publishing static versions of some pages from the system (see for
example Helsinki
and šamanistinen),
and perhaps by allowing some testers to access the development version on my desktop machine.
Plan for week ending 2006-07-23
(No plan for week ending 2006-07-16 as I will be totally unreachable in Italy until 2006-07-21.)
- Fully implement support for adding new words. There might even be a staging area for unchecked
word lists (a feature that will not be needed for our Finnish installation but might be useful
for other languages).
2006-07-02
Results from this week
- Used Valgrind to find bugs from tmispell-voikko. Two bugs were found: the
first was in libvoikko (fixed in revision
172)
and another seemed to be in Malaga, but I could not reproduce that so fixing
it was impossible.
- Implemented login/logout, text field editing and event logging in
Joukahainen.
Related SVN revisions:
173,
174,
177,
178,
181 and
182.
- Lots of fixes to Suomi-malaga. Some were found while moving records between
Suomi-malaga and Joukahainen and others were found by Kalle Lampila by analysing
Finnish Wikipedia content. I have also taken a few important words from our public
missing word collector application.
Related SVN revisions:
175,
176,
179,
183,
184,
185,
186,
187,
188 and
191.
Plan for week ending 2006-07-09
- Make it possible to edit word flags in Joukahainen.
- Implement database dumping and restoring to make it possible to anonymously
get an up to date snapshot of the entire database.
- Document current functionality, at least so that interested people can use our test installation
while I will be away the following week.
- Send a status report in Finnish to the mailing list.
2006-06-22
Results from this week
- Joukahainen is now installed at
joukahainen.lokalisointi.org.
- Copied version 0.7 of Suomi-Malaga to SVN, added a bit of documentation
and fixed the remaining important bugs that affected only spellchecking.
Related SVN revisions:
166,
167,
168,
169,
170 and
171.
Plan for week ending 2006-07-02
- Work on user authentication and editing capabilities of Joukahainen.
2006-06-17
Results from this week
- The converter seems to work: I have now converted 12349 words from Suomi-Malaga to Joukahainen,
and they can all be converted back. A few errors still happen in the conversion, but most of them seem
irrelevant (conversion from Suomi-Malaga to Joukahainen and then back to Suomi-Malaga changes the
inflection class, but in a way that should not affect the actual inflections in any way). There are also
bugs that need to be fixed manually. For example word "alkemisti" would be inflected incorrectly after
the conversion, but that is just because the word contains a prefix ("al|kemisti") and this information
is not present in the original dataset.
Related SVN revisions:
154,
155,
156,
157 and
160.
- A few more features in Joukahainen: search for word and support for related and alternative word forms
and compound words. See the SVN revisions above.
- Added some missing GPL headers to tmispell-voikko and noticed that it fails to compile with the new
GCC 4.1 in Debian unstable; fixed this.
Related SVN revisions:
158 and
159.
- Read a bit more about Finnish grammar (Pirkko Leino: Hyvää suomea) which inspired me to fix few small
hyphenation and suggestion bugs in libvoikko (revision
161).
Plan for week ending 2006-06-25
- I will be visiting Hämeenlinna on Monday and Tuesday. During that time I can be contacted by e-mail but will
not be doing anything related to Joukahainen (this is why I did some extra work last Saturday and today).
And Friday is Midsummer eve, so the week will be quite short for me.
- A public test setup of Joukahainen should be made available. I plan to do this on Wednesday, but it
is still impossible to say whether this will be possible or not. I have been given an account with root access
on lokalisointi.org where the application should be placed, but
the system is running a development version of Debian and hosts a few other services that I know nothing about.
So installing anything there needs to be done quite carefully, and if the versions of Python, Postgresql and
Apache do not fit together (I cannot safely update any of those as that would require manual changes to the
configuration of services used by other people) I am out of luck. I have asked on our mailing list if anyone
there is willing to host a test installation of Joukahainen. If someone reading this blog is interested, please
read my mail and get
in touch.
- Take a copy of Suomi-Malaga, put it to SVN and merge all missing Voikko specific changes to it. Try to
fix any known bugs, check whether the performance is good enough and optimise if necessary.
2006-06-10
Results from this week
- Did a slightly overlong week to allow for a few days off later.
- Discussed about the possibility of using a word list and classification from
Kotus
instead of the one in Suomi-malaga. I will not go into details here, but the conclusion was
that we cannot safely rely on that material becoming available under any useful licence. In order to
give time for these discussions I did not start the week by writing the converter for
Suomi-Malaga/Joukahainen as I had planned. I designed and implemented a page template system, noun
inflection display and a new word attribute type (flag) instead.
Related SVN revisions:
147,
149,
150 and
153.
- After it became clear that it was not possible to wait for alternative word lists I started writing the
converter. It seems like quite a lot work is needed to make the conversion work in both directions without any
loss of information. This is due to the way how consonant gradation is done in Suomi-malaga: each combination
of inflection class and gradation class has its own name there, but I do not want to do the same in Joukahainen.
So I chose to use the inflection classes from Suomi-Malaga but take the simple gradation classification from my
earlier work (av1 - av6 in Hunspell-fi) and combine those. I actually do not know for sure if that will even work,
but I believe it will and it should make classifying new words somewhat easier. At the moment I have converted
only a small part of the vocabulary (1177 words) but that should change soon.
Related SVN revisions:
151 and
152.
- Fixed Oo2-voikko to build, install and work correctly on Linux/x86_64. This involved two small fixes, one
in Oo2-voikko (revision 146)
and another in OpenOffice.org (bug 66162)
Plan for week ending 2006-06-18
2006-06-02
Results from this week
- The suggestion code was finished much faster than I had expexted. I only needed
one day to rewrite the Python code in C and another to implement suggestion sorting
based on estimated frequencies of the suggested words in Finnish text. Unfortunately we
do not currently have any way of telling how common a given word in the vocabulary is, so
this estimation relies on counting the number of elements in compounds and some other
available properties of the words. I will return to this detail later when Joukahainen is
ready to be used, as it will allow tagging words with this kind of frequency information.
A few cases of suboptimal order in candidate string generation need to be fixed next week.
Related SVN revisions:
138 and
139.
- The structure of Joukahainen is starting to shape up pretty well. Instead of choosing
the widely used combination of PHP and MySql I have decided to use Python and Postgresql. These
are the tools I have used earlier and I have some very useful code already written in Python
in our SVN repository. I have written and committed to SVN a database schema and example data
that should be nearly sufficient to implement read only access to the application. There is
also some not so useful code to display a list of words and links to a page where they can
be edited (this latter page does not exist yet). Related SVN revisions:
142 and
143.
- Other things worth mentioning after the last report: debugged and fixed a problem with tmispell
libenchant plugin and Debian Sarge (revision
137), reviewed and
committed a patch from Kai Solehmainen to make libvoikko work better on Windows (revision
140) and
added checks for malloc failures and similar ugly but necessary checks in libvoikko now that
we have reached the goals of version 0.9 (revision
144).
Plan for week ending 2006-06-11
- Write a converter that can be used to import data from Suomi-Malaga to Joukahainen and
export it back to Suomi-Malaga.
2006-05-27
Results from this week
- The Python prototype was implemented (revisions
125 and
129).
Not much testing was needed, the suggestions were generally quite good right from the
beginning. It is not possible
to find suggestions for complex spelling errors (those with more than one independent
errors in the same word) because Malaga (or actually Suomi-Malaga) is quite slow to
check the validity of candidate strings.
- Case-correcting suggestions, and a few other suggestion types, were implemented
in C (revisions
119,
126,
130 and
135).
- Since there was time, I worked on a few other things as well: manual pages (revision
129), hyphenation
(revision 131) and
tmispell-voikko, an ispell style spellchecker interface originally developed for soikko
by Pauli Virtanen (revisions
133 and
134)
Plan for week ending 2006-06-04
- Finish suggestion code in libvoikko.
- Start working on Joukahainen: design database schema and write a skeleton
application where actual functionality can be added later.
2006-05-21
Plan for week ending 2006-05-28
- C implementation of case-correcting suggestions
- Python prototype implementing other suggestion generation algorithms
- Tests with Python prototype: adjust algorithms so that they produce about 300
variations for each input string
Any questions should be sent to hatapitk@iki.fi.