Probabilistic Reasoning and Learning for the Semantic Web

Zese, Riccardo

The Semantic Web introduced a new vision of the World Wide Web where the information resources published on the Internet are readable and understandable by machines. However, incompleteness and/or uncertainty are intrinsic to much information, specially when it is collected from different sources. Thus we need a way to manage this kind of data. In this thesis we address this problem and we present a complete framework for handling uncertainty in the Semantic Web. Description Logics (DLs) are the basis of the Semantic Web. DL knowledge bases (KBs) contains both assertional and terminological information regarding individuals, classes of individuals and relationships among them. We first defined a probabilistic semantics for DLs, called DISPONTE. It is inspired by the distribution semantics, a well known approach in probabilistic logic programming. DISPONTE permits to associate degrees of belief to pieces of information and to compute the probability of queries to KBs. The thesis then proposes a suite of algorithms for reasoning with KBs following DISPONTE: • BUNDLE, for ``Binary decision diagrams for Uncertain reasoNing on Description Logic thEories'', computes the probability of queries w.r.t. DISPONTE KBs by means of the tableau algorithm and knowledge compilation. BUNDLE is based on Pellet, a state of the art reasoner, and is written in Java. • TRILL, for ``Tableau Reasoner for descrIption Logics in Prolog'', performs inference over DISPONTE KBs with the tableau algorithm implemented in the declarative Prolog language. Prolog is useful for managing the non-determinism of the reasoning process. • TRILLP, for ``TRILL powered by Pinpointing formulas'', differs from TRILL because it encodes the set of all explanations for queries with a more compact Boolean formula. A second problem to address is the fact that the probability values are difficult to set for humans. However, usually information is available which can be leveraged for tuning these parameters. Moreover, terminological information in KBs may be incomplete or poorly structured. We thus need of learning systems able to cope with these problems. We present two learning systems, one for each problem: • EDGE, for ``Em over bDds for description loGics paramEter learning'', learns the parameters of a DISPONTE KB. • LEAP, for ``LEArning Probabilistic description logics'', learns terminological axioms together with their parameters by using EDGE. However, the size of the data is constantly increasing, leading to the so-called Bid Data, Dataset are often too huge to be handled by a single machine in a reasonable time. Modern computing infrastructures such as clusters and clouds must be used where the work is divided among different machines. We thus extended both EDGE and LEAP to exploit these facilities by implementing EDGEMR and LEAPMR that distribute the work using a MapReduce approach. All systems were tested on real life problems and their performances was comparable or superior to the state of the art.

Il Semantic Web è basato su una nuova visione del World Wide Web in cui le informazioni contenute nelle risorse pubblicate sono leggibili e gestibili dalle macchine. Tuttavia, queste informazioni sono spesso incomplete e/o incerte, specialmente quando vengono raccolte da diverse sorgenti. Risulta quindi necessario gestirle in maniera appropriata. In questa tesi ci siamo concentrati su questo problema, presentando un insieme completo di strumenti per gestire l'incertezza nel contesto del Semantic Web. Le logiche descrittive (LD) rappresentano la base del Semantic Web. Le basi di conoscenza espresse con le LD contengono informazioni sia asserzionali sia terminologiche riguardanti individui, classi di individui e le relazioni che intercorrono fra loro. Il primo passo è stato la definizione di una semantica probabilistica per LD, chiamata DISPONTE. Essa è ispirata alla semantica distributiva, molto diffusa nel campo della programmazione logico-probabilistica. DISPONTE permette di associare gradi di fiducia a porzioni di informazione e di calcolare la probabilità delle interrogazioni basandosi su basi di conoscenza probabilistiche. La tesi propone inoltre un insieme di algoritmi capaci di ragionare su basi di conoscenza che seguono DISPONTE: • BUNDLE, acronimo di ``Binary decision diagrams for Uncertain reasoNing on Description Logic thEories'', calcola la probabilità di interrogazioni rispetto ad una base di conoscenza DISPONTE sfruttando l'algoritmo tableau e tecniche di knowledge compilation. BUNDLE è basato sul noto ragionatore Pellet ed è interamente scritto in Java. • TRILL, acronimo di ``Tableau Reasoner for descrIption Logics in Prolog'', esegue inferenza su basi di conoscenza DISPONTE sfruttando un'implementazione dell'algoritmo tableau scritta in Prolog, utile per gestire il non-determinismo intrinseco nel processo di inferenza. • TRILLP, acronimo di ``TRILL powered by Pinpointing formulas'', differisce da TRILL nella codifica dell'insieme di spiegazioni che risulta essere, in questo secondo algoritmo, più compatta. Un secondo problema risiede nel fatto che i valori di probabilità sono difficili da definire per gli esseri umani. Normalmente però si hanno a disposizione informazioni sul dominio che possono essere sfruttate per definire questi parametri. Inoltre, le informazioni terminologiche contenute nelle basi di conoscenza sono spesso incomplete e scarsamente strutturate. In questa tesi vengono presentati due sistemi di apprendimento che risolvono i problemi sopra citati: • EDGE, acronimo di ``Em over bDds for description loGics paramEter learning'', apprende i parametri di una base di conoscenza DISPONTE. • LEAP, acronimo di ``LEArning Probabilistic description logics'', apprende assiomi terminologici insieme ai parametri associati usando EDGE. Va inoltre notato che negli ultimi anni la quantitàdi dati da gestire sta costantemente e rapidamente crescendo, portando alla nascita del concetto di Big Data. La quantità di dati risulta troppo grande per poter essere gestita da una singola macchina in tempi ragionevoli. Le moderne infrastrutture di calcolo come i cluster e il cloud devono essere sfruttati per poter dividere il carico di lavoro su più nodi. Abbiamo quindi esteso EDGE e LEAP per utilizzare queste infrastrutture implementando EDGEMR e LEAPMR che impiegano un approccio MapReduce per distribuire il lavoro. Tutti i sistemi sono stati testati su problemi reali e le loro prestazioni sono risultate comparabili o superiori agli approcci considerati lo stato dell'arte.