Table of Contents

Morph-it!

A Free Morphological Lexicon for the Italian Language

by Marco Baroni (marco.baroni@unitn.it) and Eros Zanchetta (eros@sslmit.unibo.it)

Morph-it! is a free (as in free speech and in free beer) morphological resource for the Italian language. Morph-it! is a lexicon of inflected forms with their lemma and morphological features. For example:

gattinigattinoNOUN-M:p
andaronoandareVER:ind+past+3+p
fastidiosettofastidiosoADJ:dim+m+s

As of version 0.4.7 the list contains 504,906 entries and 34,968 lemmas. Morph-it! can be used as a data source for an Italian lemmatizer / morphological analyzer / morphological generator.

As example applications, on the Morph-it! site you can download the lexicon compiled for the SFST [1] and Finite State Utilities [2] packages.

The data for Morph-it! were prepared by Marco Baroni and Eros Zanchetta using a mixture of corpus-based methods, regular-expression-based rules and manual checking.

Morph-it! is still under development and there may still be gaps, unlikely forms, etc. We will be very grateful if you let us know about missing forms, problems, and ideas/resources that can help us expanding or cleaning the list (sslmitdevonline@sslmit.unibo.it).

Notice in particular that, since we extracted data from an Italian newspaper corpus (the la Repubblica corpus, also accessible here), we have many gaps in basic, every-day vocabulary. Also, the current version does not distinguish between coordinative and subordinative conjunctions. We plan to do this in the near future. More in general, we are not fully satisfied with our current features for function words, and we plan to revise them.

A more ambitious plan we would like to pursue is the identification of derivational structure and derivationally related lemmas. Then, we will add full semantic representations. Then, we will take over the world and reign supreme for the next 100 years.

The remainder of this document contains a commented list of the morphological features used in the lexicon, licensing information and aknowledgments.

Download

Morph-IT

Compiled Automata

Features

We distinguish between derivational features, that pertain to the lemma, and inflectional features, that pertain to the wordform.

Derivational and inflectional features are separated by a colon.

The derivational features are in upper case and they are dash-delimited. The inflectional features are in lower case and they are plus-sign-delimited.

For example, we represent gender as a derivational feature of nouns (we take “cameriere” and “cameriera” to belong to different lemmas), whereas we treat number as an inflectional feature of nouns. Thus, gender and number are represented as in the following examples:

camerierecamerieraNOUN-F:p
camerieracamerieraNOUN-F:s
cameriericameriereNOUN-M:p
camerierecameriereNOUN-M:s

For adjectives, gender is considered an inflectional feature. Thus, gender is represented differently in adjectives and nouns:

azzurreazzurraNOUN-F:p
azzurraazzurraNOUN-F:s
azzurriazzurroNOUN-M:p
azzurroazzurroNOUN-M:s
azzurraazzurroADJ:pos+f+s
azzurriazzurroADJ:pos+m+p
azzurroazzurroADJ:pos+m+s
azzurreazzurroADJ:pos+f+p

Changes that are purely orthographical/phonological but do not affect morphology/syntax/meaning are not reflected in the features. For example, the following variants of “cento” share the same lemma and the same features:

cent'centoDET-NUM-CARD
centocentoDET-NUM-CARD

We now present the full list of features we used, organized by major syntactic categories.

ABL

Abbreviated locutions, such as “a.C.”, “ecc.” and “i.e.”

ADJ

Adjectives, with the following inflectional features:

pos/comp/sup

Thas is: positive, comparative, superlative. Although these are not true inflectional features, given their high productivity we decided to represent them as properties of inflected forms.

f/m

That is: feminine, masculine.

s/p

Thas is: singular, plural.

ADV

Adverbs.

ART

Articles, with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).

ARTPRE

Preposition+article compounds (“col”, “della”, “nei”…), with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).

ASP

Aspectuals (“stare” in “stare per”). Same inflectional features as VER (see below).

AUX

Auxiliaries (“essere”, “avere”, “venire”). Same inflectional features as VER (see below).

CAU

Causatives (“fare” in “far sapere”). Same inflectional features as VER (see below).

CE

Clitic “ce” as in “ce l'ho fatta”.

CI

Clitic “ci” as in “ci prova”.

CON

Conjunctions.

DET-DEMO

Demonstrative determiners (such as “questa” in “questa sera”), with inflectional gender (f/s) and number (s/p) features.

DET-INDEF

Indefinite determiners (such as “molti” in “molti amici”) with inflectional gender (f/s) and number (s/p) features.

DET-NUM-CARD

Cardinal number determiners (e.g., “cinque” in “cinque amici”). Pure-digit numbers are not included (i.e., the list includes “100mila” but not “100000” nor “100,000”, “100.000”, etc.)

DET-POSS

Possessive determiners (e.g., “mio”, “suo”), with inflectional gender (f/s) and number (s/p) features.

DET-WH

Wh determiners (e.g., quale in “quale amico”), with inflectional gender (f/s) and number (s/p) features.

INT

Interjections.

MOD

Modal verbs (e.g. “dover” in “dover ricostruire”). Same inflectional features as VER (see below).

NE

Clitic “ne” (as in: “ne hanno molte”).

NOUN

Nouns, with gender as a derivational feature (F/M) and number as an inflectional feature (s/p).

PON

Non-sentential punctuation marks (e.g. , “ $).

PRE

Prepositions.

PRO-DEMO

Demonstrative pronouns (e.g. “questa” in “voglio questa”), with both gender and number as derivational features (F/M, S/P).

PRO-INDEF

Indefinite pronouns (e.g., “molti” in “vengono molti”), with both gender and number as derivational features (F/M, S/P).

PRO-NUM

Numeral pronouns (e.g., “cinque” in “cinque sono sopravvissuti”). Pure-digit numbers are not included (e.g., the list includes “100mila” but not 100000 nor 100,000, 100.000, etc.)

PRO-PERS

Personal pronouns, such as “lui” and “loro”. Clitic possessive pronouns (such as pronominal “lo” and “si”) are marked by the derivational feature CLI. Person, gender and number are also encoded as derivational features (1/2/3, F/M, S/P).

PRO-POSS

Possessive pronouns, such as “loro” in “non era uno dei loro”), with gender and number encoded as derivational features (F/M, S/P).

PRO-WH

Wh-pronouns, such as “quale” in “quale e' venuto?”

SENT

End of sentence marker (! . … : ?).

SI

Clitic “si” as in “di cui si discute”.

TALE

“Tale” in constructions such as “una fortuna tale che…”, “la tal cosa”, “tali amici”, ecc. Gender (f/m) and number (s/p) as inflectional features.

VER

Verbs, with the following inflectional features:

cond/ger/impr/ind/inf/part/sub

Conditional, gerundive, imperative, indicative, infinitive, participle, subjunctive.

pre/past/impf/fut

Present, past, imperfective, future.

1/2/3

Person.

s/p

Number.

f/m

Gender (only relevant for participles).

cela/cele/celi/celo/cene/ci/gli/gliela/gliele/glieli/glielo/gliene/la/

le/li/lo/mela/mele/meli/melo/mene/mi/ne/sela/sele/seli/selo/sene/si/

tela/tele/teli/telo/tene/ti/vela/vele/veli/velo/vene/vi

Clitics attached to the verb.

WH

Wh elements (“come”, “qualora”, “quando”…)

WH-CHE

“Che” as a wh element (e.g., “l'uomo che hai visto”, “hai detto che”).

Creating a MySQL version

This mini-howto guides you through the creation of a mysql version of Morph-it!

You need a working installation of MySQL on your server and CREATE privileges

Step 1: create a new database:

CREATE DATABASE morphit;

Step 2: change to the newly created database:

USE morphit;

Step 3: create a new table (adjust version number in COMMENT):

CREATE TABLE `morphit` (
  `form` VARCHAR(255) NOT NULL DEFAULT '',
  `lemma` VARCHAR(255) NOT NULL DEFAULT '',
  `features` VARCHAR(255) NOT NULL DEFAULT ''
) TYPE=MyISAM COMMENT='Version X.XX';

Step 4: load data into table:

LOAD DATA LOCAL INFILE '/home/user/filename.txt'
INTO TABLE morphit
FIELDS TERMINATED BY '\t';

Step 5: add a primary key:

ALTER TABLE `morphit`.`morphit`
ADD COLUMN `id` INT(11) NOT NULL AUTO_INCREMENT FIRST,
ADD PRIMARY KEY(`id`);

Licensing information

This program is dual-licensed free software; you can redistribute it and/or modify it under the terms of the under the Creative Commons Attribution ShareAlike 2.0 License and the GNU Lesser General Public License.

Creative Commons Attribution ShareAlike 2.0

Morph-it! is licensed under the Creative Commons Attribution ShareAlike 2.0 License.

You are free:

under the following conditions:

Your fair use and other rights are in no way affected by the above.

You can find a link to the full license from the Morph-it! website.

Copyright (C) 2004-2007 Marco Baroni and Eros Zanchetta.

GNU Lesser General Public License

Morph-it! A free morphological lexicon for the Italian Language Copyright (C) 2004-2007 Marco Baroni and Eros Zanchetta

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

Credits

The main data source for the Morph-it! lexicon was the “la Repubblica” corpus. Thus, we would like to thank the colleagues who developed this resource with us: Lorenzo Piccioni, Guy Aston, Silvia Bernardini, Federica Comastri, Alessandra Volpi and Marco Mazzoleni.

We would like to thank the developers of the tools we used to tag, lemmatize and index the Repubblica corpus: the (Italian) TreeTagger (Helmut Schmid, Achim Stein), the ACOPOST taggers (Ingo Schroeder) and the IMS Corpus WorkBench (Oli Christ, Arne Fitschen and Stefan Evert).

Thanks to Helmut Schmid also for converting the Morph-it! lexicon into a SFST transducer.

We would like to thank Aldo Calpini</a>, who developed the perl module Lingua:IT:Conjugate.

We are also very grateful to Jan Daciuk for creating his finite-state utilities and for helping us learn to use them.

Finally, a big thanks to the members of the FoLUG, SannioLUG and Scuola (software libero nella scuola) mailing lists, for advice about licensing and dissemination.