Inference and Learning Systems for Uncertain Relational Data

Cota, Giuseppe

Representing uncertain information and being able to reason on it is of foremost importance for real world applications. The research field Statistical Relational Learning (SRL) tackles these challenges. SRL combines principles and ideas from three important subfields of Artificial Intelligence: machine learning, knowledge representation and reasoning on uncertainty. The distribution semantics provides a powerful mechanism for combining logic and probability theory. The distribution semantics has been applied so far to extend Logic Programming (LP) languages such as Prolog and represents one of the most successful approaches of Probabilistic Logic Programming (PLP), with several PLP languages adopting it such as PRISM, ProbLog and LPADs. However, with the birth of the Semantic Web, that uses Description Logics (DLs) to represent knowledge, it has become increasingly important to have Probabilistic Description Logic (PDLs). The DISPONTE semantics was developed for this purpose and applies the distribution semantics to description logics. The main objective of this dissertation is to propose approaches for reasoning and learning on uncertain relational data. The first part concerns reasoning over uncertain data. In particular, with regard to reasoning in PLP, we present the latest advances in the cplint system, which allows hybrid programs, i.e. programs where some of the random variables are continuous, and causal inference. Moreover cplint has a web interface, named cplint on SWISH, which allows the user to easily experiment with the system. To perform inference on PDLs that follow DISPONTE, a suite of algorithms was developed: BUNDLE (“Binary decision diagrams for Uncertain reasoNing on Description Logic thEories”), TRILL (“Tableau Reasoner for descrIption Logics in Prolog” and TRILL P (“TRILL powered by Pinpointing formulas”). The second part, which focuses on learning, considers two problems: parameter learning and structure learning. We describe the systems EDGE (“Em over bDds for description loGics paramEter learning”) for parameter learning and LEAP (“LEArning Probabilistic description logics”) for structure learning of PDLs. The execution of these algorithms and those for PLP, such as EMBLEM for parameter learning and SLIPCOVER for structure learning, is rather expensive from a computational point of view, taking a few hours on datasets of the order of MBs. In order to efficiently manage larger datasets in the era of Big Data and Linked Open Data, it is extremely important to develop fast learning algorithms. One solution is to distribute the algorithms using modern computing infrastructures such as clusters and clouds. We thus extended EMBLEM, SLIPCOVER, EDGE and LEAP to exploit these facilities by developing their MapReduce versions: EMBLEM^MR, SEMPRE, EDGE^MR and LEAP^MR. We tested the proposed approaches on real-world datasets and their performance was comparable or superior to those of state-of-the-art systems.

Rappresentare informazioni incerte e fare inferenza su di esse è di primaria importanza per sviluppare applicazioni reali. Il campo di ricerca chiamato Statistical Relational Learning (SRL) affronta queste sfide. SRL combina principi e idee provenienti da tre importanti settori dell'Intelligenza Artificiale: l'apprendimento automatico (machine learning), la rappresentazione della conoscenza e l'inferenza sull'incertezza. La distribution semantics fornisce un meccanismo potente per combinare logica e teoria della probabilità. La distribution semantics è stata finora applicata per estendere i linguaggi di programmazione logica come Prolog e rappresenta uno degli approcci di maggior successo nel campo della Programmazione Logica Probabilistica (PLP). Viene adottata da diversi linguaggi PLP come PRISM, ProbLog e LPAD. Tuttavia, con la nascita del Semantic Web, che utilizza logiche descrittive per rappresentare la conoscenza, è diventato sempre più importante definire delle Logiche Descrittive Probabilistiche (LDP). La semantica DISPONTE è stata sviluppata per raggiungere questo scopo e applica la distribution semantics alle logiche descrittive. L'obiettivo principale di questa tesi è quello di proporre approcci di inferenza e apprendimento su dati relazionali incerti. La prima parte riguarda l'inferenza su dati incerti. In particolare, per quanto riguarda l'inferenza nel campo del PLP, presentiamo gli ultimi progressi nel sistema cplint, che permette di definire programmi logici ibridi, cioè programmi in cui alcune delle variabili casuali sono continue, e di poter eseguire l'inferenza causale. Inoltre cplint dispone di un'interfaccia web, denominata cplint su SWISH, che consente all'utente di sperimentare facilmente il sistema. Per poter fare inferenza su logiche descrittive probabilistiche che adottano DISPONTE, è stata sviluppata una suite di algoritmi: BUNDLE ("Binary decision diagrams for Uncertain reasoNing on Description Logic thEories"), TRILL ("Tableau Reasoner for descrIption Logics in Prolog") e TRILL^P ("TRILL powered by Pinpointing formulas"). La seconda parte, incentrata sull'apprendimento, affronta due problemi: l'apprendimento dei parametri e l'apprendimento della struttura. Vengono descritti i sistemi EDGE ("Em over bDds for description paramEter learning") per l'apprendimento dei parametri e LEAP ("LEArning Probabilistic description logics") per l'apprendimento della struttura di LDP. L'esecuzione di questi algoritmi e di quelli per PLP, come EMBLEM per l'apprendimento dei parametri e SLIPCOVER per l'apprendimento della struttura, è piuttosto costosa dal punto di vista computazionale, impiegando delle ore su dataset dell'ordine dei MB. Per gestire in modo efficiente dataset più grandi nell'era dei Big Data e dei Linked Open Data, è estremamente importante sviluppare algoritmi di apprendimento rapido. Una soluzione consiste nel distribuire gli algoritmi utilizzando moderne infrastrutture di calcolo come cluster e cloud. Abbiamo quindi esteso EMBLEM, SLIPCOVER, EDGE e LEAP per sfruttare queste strutture sviluppando le loro versioni MapReduce: EMBLEM^MR, SEMPRE, EDGE^MR e LEAP^MR. Abbiamo testato gli approcci proposti su basi di conoscenza reali e le loro prestazioni sono risultate paragonabili o superiori a quelle di sistemi allo stato dell'arte.