Morph-it!
A Free Morphological Lexicon for the Italian Language
by Marco Baroni (marco.baroni@unitn.it) and Eros Zanchetta (eros@sslmit.unibo.it)
Morph-it! is a free (as in free speech and in free beer) morphological resource for the Italian language. Morph-it! is a lexicon of inflected forms with their lemma and morphological features. For example:
gattini | gattino | NOUN-M:p |
andarono | andare | VER:ind+past+3+p |
fastidiosetto | fastidioso | ADJ:dim+m+s |
As of version 0.4.7 the list contains 504,906 entries and 34,968 lemmas. Morph-it! can be used as a data source for an Italian lemmatizer / morphological analyzer / morphological generator.
As example applications, on the Morph-it! site you can download the lexicon compiled for the SFST [1] and Finite State Utilities [2] packages.
The data for Morph-it! were prepared by Marco Baroni and Eros Zanchetta using a mixture of corpus-based methods, regular-expression-based rules and manual checking.
Morph-it! is still under development and there may still be gaps, unlikely forms, etc. We will be very grateful if you let us know about missing forms, problems, and ideas/resources that can help us expanding or cleaning the list (sslmitdevonline@sslmit.unibo.it).
Notice in particular that, since we extracted data from an Italian newspaper corpus (the la Repubblica corpus, also accessible here), we have many gaps in basic, every-day vocabulary. Also, the current version does not distinguish between coordinative and subordinative conjunctions. We plan to do this in the near future. More in general, we are not fully satisfied with our current features for function words, and we plan to revise them.
A more ambitious plan we would like to pursue is the identification of derivational structure and derivationally related lemmas. Then, we will add full semantic representations. Then, we will take over the world and reign supreme for the next 100 years.
The remainder of this document contains a commented list of the morphological features used in the lexicon, licensing information and aknowledgments.
Download
Features
We distinguish between derivational features, that pertain to the lemma, and inflectional features, that pertain to the wordform.
Derivational and inflectional features are separated by a colon.
The derivational features are in upper case and they are dash-delimited. The inflectional features are in lower case and they are plus-sign-delimited.
For example, we represent gender as a derivational feature of nouns (we take “cameriere” and “cameriera” to belong to different lemmas), whereas we treat number as an inflectional feature of nouns. Thus, gender and number are represented as in the following examples:
cameriere | cameriera | NOUN-F:p |
cameriera | cameriera | NOUN-F:s |
camerieri | cameriere | NOUN-M:p |
cameriere | cameriere | NOUN-M:s |
For adjectives, gender is considered an inflectional feature. Thus, gender is represented differently in adjectives and nouns:
azzurre | azzurra | NOUN-F:p |
azzurra | azzurra | NOUN-F:s |
azzurri | azzurro | NOUN-M:p |
azzurro | azzurro | NOUN-M:s |
azzurra | azzurro | ADJ:pos+f+s |
azzurri | azzurro | ADJ:pos+m+p |
azzurro | azzurro | ADJ:pos+m+s |
azzurre | azzurro | ADJ:pos+f+p |
Changes that are purely orthographical/phonological but do not affect morphology/syntax/meaning are not reflected in the features. For example, the following variants of “cento” share the same lemma and the same features:
cent' | cento | DET-NUM-CARD |
cento | cento | DET-NUM-CARD |
We now present the full list of features we used, organized by major syntactic categories.
ABL
Abbreviated locutions, such as “a.C.”, “ecc.” and “i.e.”
ADJ
Adjectives, with the following inflectional features:
pos/comp/sup
Thas is: positive, comparative, superlative. Although these are not true inflectional features, given their high productivity we decided to represent them as properties of inflected forms.
f/m
That is: feminine, masculine.
s/p
Thas is: singular, plural.
ADV
Adverbs.
ART
Articles, with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).
ARTPRE
Preposition+article compounds (“col”, “della”, “nei”…), with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).
ASP
Aspectuals (“stare” in “stare per”). Same inflectional features as VER (see below).
AUX
Auxiliaries (“essere”, “avere”, “venire”). Same inflectional features as VER (see below).
CAU
Causatives (“fare” in “far sapere”). Same inflectional features as VER (see below).
CE
Clitic “ce” as in “ce l'ho fatta”.
CI
Clitic “ci” as in “ci prova”.
CON
Conjunctions.
DET-DEMO
Demonstrative determiners (such as “questa” in “questa sera”), with inflectional gender (f/s) and number (s/p) features.
DET-INDEF
Indefinite determiners (such as “molti” in “molti amici”) with inflectional gender (f/s) and number (s/p) features.
DET-NUM-CARD
Cardinal number determiners (e.g., “cinque” in “cinque amici”). Pure-digit numbers are not included (i.e., the list includes “100mila” but not “100000” nor “100,000”, “100.000”, etc.)
DET-POSS
Possessive determiners (e.g., “mio”, “suo”), with inflectional gender (f/s) and number (s/p) features.
DET-WH
Wh determiners (e.g., quale in “quale amico”), with inflectional gender (f/s) and number (s/p) features.
INT
Interjections.
MOD
Modal verbs (e.g. “dover” in “dover ricostruire”). Same inflectional features as VER (see below).
NE
Clitic “ne” (as in: “ne hanno molte”).
NOUN
Nouns, with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).
PON
Non-sentential punctuation marks (e.g. , “ $).
PRE
Prepositions.
PRO-DEMO
Demonstrative pronouns (e.g. “questa” in “voglio questa”), with both gender and number as derivational features (F/M, S/P).
PRO-INDEF
Indefinite pronouns (e.g., “molti” in “vengono molti”), with both gender and number as derivational features (F/M, S/P).
PRO-NUM
Numeral pronouns (e.g., “cinque” in “cinque sono sopravvissuti”). Pure-digit numbers are not included (e.g., the list includes “100mila” but not 100000 nor 100,000, 100.000, etc.)
PRO-PERS
Personal pronouns, such as “lui” and “loro”. Clitic possessive pronouns (such as pronominal “lo” and “si”) are marked by the derivational feature CLI. Person, gender and number are also encoded as derivational features (1/2/3, F/M, S/P).
PRO-POSS
Possessive pronouns, such as “loro” in “non era uno dei loro”), with gender and number encoded as derivational features (F/M, S/P).
PRO-WH
Wh-pronouns, such as “quale” in “quale e' venuto?”
SENT
End of sentence marker (! . … : ?).
SI
Clitic “si” as in “di cui si discute”.
TALE
“Tale” in constructions such as “una fortuna tale che…”, “la tal cosa”, “tali amici”, ecc. Gender (f/m) and number (s/p) as inflectional features.
VER
Verbs, with the following inflectional features:
cond/ger/impr/ind/inf/part/sub
Conditional, gerundive, imperative, indicative, infinitive, participle, subjunctive.
pre/past/impf/fut
Present, past, imperfective, future.
1/2/3
Person.
s/p
Number.
f/m
Gender (only relevant for participles).
cela/cele/celi/celo/cene/ci/gli/gliela/gliele/glieli/glielo/gliene/la/
le/li/lo/mela/mele/meli/melo/mene/mi/ne/sela/sele/seli/selo/sene/si/
tela/tele/teli/telo/tene/ti/vela/vele/veli/velo/vene/vi
Clitics attached to the verb.
WH
Wh elements (“come”, “qualora”, “quando”…)
WH-CHE
“Che” as a wh element (e.g., “l'uomo che hai visto”, “hai detto che”).
Creating a MySQL version
This mini-howto guides you through the creation of a mysql version of Morph-it!
You need a working installation of MySQL on your server and CREATE privileges
Step 1: create a new database:
CREATE DATABASE morphit;
Step 2: change to the newly created database:
USE morphit;
Step 3: create a new table (adjust version number in COMMENT):
CREATE TABLE `morphit` ( `form` VARCHAR(255) NOT NULL DEFAULT '', `lemma` VARCHAR(255) NOT NULL DEFAULT '', `features` VARCHAR(255) NOT NULL DEFAULT '' ) TYPE=MyISAM COMMENT='Version X.XX';
Step 4: load data into table:
LOAD DATA LOCAL INFILE '/home/user/filename.txt' INTO TABLE morphit FIELDS TERMINATED BY '\t';
Step 5: add a primary key:
ALTER TABLE `morphit`.`morphit` ADD COLUMN `id` INT(11) NOT NULL AUTO_INCREMENT FIRST, ADD PRIMARY KEY(`id`);
Licensing information
This program is dual-licensed free software; you can redistribute it and/or modify it under the terms of the under the Creative Commons Attribution ShareAlike 2.0 License and the GNU Lesser General Public License.
Creative Commons Attribution ShareAlike 2.0
Morph-it! is licensed under the Creative Commons Attribution ShareAlike 2.0 License.
You are free:
- to copy, distribute and display the resource;
- to make derivative works;
- to make commercial use of the resource;
under the following conditions:
- you must give the original authors credit;
- if you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one;
- for any reuse or distribution, you must make clear to others the license terms of this work;
- any of these conditions can be waived if you get permission from the copyright holders.
Your fair use and other rights are in no way affected by the above.
You can find a link to the full license from the Morph-it! website.
Copyright (C) 2004-2007 Marco Baroni and Eros Zanchetta.
GNU Lesser General Public License
Morph-it! A free morphological lexicon for the Italian Language Copyright (C) 2004-2007 Marco Baroni and Eros Zanchetta
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
Credits
The main data source for the Morph-it! lexicon was the “la Repubblica” corpus. Thus, we would like to thank the colleagues who developed this resource with us: Lorenzo Piccioni, Guy Aston, Silvia Bernardini, Federica Comastri, Alessandra Volpi and Marco Mazzoleni.
We would like to thank the developers of the tools we used to tag, lemmatize and index the Repubblica corpus: the (Italian) TreeTagger (Helmut Schmid, Achim Stein), the ACOPOST taggers (Ingo Schroeder) and the IMS Corpus WorkBench (Oli Christ, Arne Fitschen and Stefan Evert).
Thanks to Helmut Schmid also for converting the Morph-it! lexicon into a SFST transducer.
We would like to thank Aldo Calpini</a>, who developed the perl module Lingua:IT:Conjugate.
We are also very grateful to Jan Daciuk for creating his finite-state utilities and for helping us learn to use them.
Finally, a big thanks to the members of the FoLUG, SannioLUG and Scuola (software libero nella scuola) mailing lists, for advice about licensing and dissemination.