Dissertations & Diploma Theses

Dissertations

in Progress

Werner Klieber

Automatic orchestration of knowledge discovery services

Knowledge discovery processes need to be flexibly configurable for various domain scenarios. This includes the specification of the tasks, their orchestration, integration and optimising. Modelling of such processes allows an implementation independent and intuitive usage. However, for complex tasks it becomes difficult to select the appropriate services and put them together to build the correct work flow.

Current research examines approaches like service oriented architectures, grid (parallel computing) and semantic description languages to deal with complex tasks. The main goal of the work concerns the adaptability of these approaches for the knowledge discovery domain. Furthermore it will be examined to what extend processes based on semantic descriptions can be automatically orchestrated.

Mark Kröll

 

Discovery of Relations in Semi-Structured Datasets

Knowledge-intensive work plays an increasingly important role in organisations of all types. Knowledge workers contribute their effort to achieve a common purpose; they are part of (business) processes. Workflow Management Systems support them during their daily work, featuring guidance and providing intelligent resource delivery. However, the emergence of richly structured, heterogeneous datasets requires a reassessment of existing mining techniques which do not take possible relations between individual instances into account. Neglecting these relations might lead to inappropriate conclusions about the data. In order to uphold the support quality of knowledge workers, the application of mining
methods, that consider structure information rather than content information, is necessary.

Structural information is obtained and maintained by representing user interaction patterns, e.g., relations between users, resources and tasks, as graphs. It can be used to improve predictive accuracy of learnt models: attributes of linked objects are often correlated, and links are more likely to exist between objects that have something in common. The graph structure itself may be an important feature to be included in the mining procedure.

In the course of this work, experiments will show which selection of features succeeds in the mining challenge. Under what circumstances there is a need for balance between features regarding content and structure? Are certain graph mining techniques better suited than others?

To sum up, this work aims at answering following questions:

  • What is an adequate graph representation of the datasets at hand?
  • Which features have to be taken into account for further analysis?
  • Which graph mining techniques are appropriate to ensure continuative assistence of knowledge workers in knowledge intensive business environments?

finished

Michael Granitzer

KnowMiner: Conception and Development of a Generic Knowledge Discovery Framework

Steadily increasing information amounts require new ways and techniques for making knowledge efficiently and goal oriented utilisable. The lack of structure in information, the incomplete allocation of metadata, and the vagueness of human language make things even more difficult. For making information utilisable for different user groups, various techniques from different domains must be combined: Knowledge Discovery and Information Retrieval provide approaches and techniques for a semantic enrichment of information and thus for a better utilisation for information which partly lies idle.

Against this background the work at hand deals with the conceptualisation and development of an integrative and flexible software framework. This framework supports the development of knowledge discovery and information retrieval services. The main goal of the work concerns the integration of different algorithms and techniques from the above mentioned domains, whereby the framework should be applicable in different application scenarios.

The analysis of processes, data flows and application areas of Knowledge Discovery provides the conceptual foundation for the realisation of the software framework. The conceptualisation and realisation was divided into several iteration cycles, oriented along the spiral model of software development. The end of each cycle consisted in applying and evaluating the framework in practical scenarios and projects. This, on the one hand allowed for the further development of the framework, and on the other hand showed its applicability in different scenarios and technical areas.

The developed KnowMiner-Framework was successfully realised and applied in five large and a number of smaller projects. As experiences show, the framework can easily be integrated in various application areas and techniques from Knowledge Discovery can easily be applied in practical scenarios

Mathias Lux

The Role of Metadata in Knowledge Discovery

Metadata is a broad term ranging from simple attribute structures describing data to huge and complex ontologies trying to formalize the knowledge about a resource. In knowledge discovery the extracted knowledge has to be codified and used for retrieval and inference. One major aspect is the comparison of knowledge and the retrieval of similar codified knowledge.

Based on graph structures different ways of knowledge representation exist, for instance RDF, which builds a foundation to the Semantic Web, the MPEG-7 semantic descriptor scheme, which allows the semantic description of multimedia resources or conceptual graphs, which are a general model for formalizing knowledge. The PhD thesis concentrates on retrieval and comparison of graph structures used to store semantic metadata and its application.

Master Thesis

in Progress

Georg Öttl

Recent IE systems extract named entities rule based, with machine learning approaches or by using a mixture of both. The main drawback of a rule based approach is that it requires the manual adaption of rules to a particular dataset. A machine learning algorithm, on the other hand, typically needs to be trained on a dataset.

This study introduces mechanisms to support and improve the rule adaption process by learning rules. An important detail of this rule learning process is the semi-automatic extension of the used training dataset. If the quality of the learned rules is good enough, in means of precision and recall, the created set of rules can be reused to create multiple instances of ontology.

The evaluation of the hybrid approach happens through comparison with state of the art machine learning algorithms and pure rule based information extraction systems

Finished

Michael Granitzer

 

Classification of Hierarchical Document Spaces Using Machine Learning Technologies

Due to the permantently growing amount of textual data, automatic methods for organizing the data are required. Automatic text classification is one of these methods. Based on the textual content of the document, it automatically assigns documents to a predefined set of classes.
Normally, the set of classes is hierarchically structured but most of today's classification approaches are ignoring hierarchical structures, thereby loosing information. This thesis exploits the hierarchical organization of classes to improve accuracy and reduce computational complexity. Classification methods from machine learning, namely BoosTexter and the newly introduced CentroidBoosting algorithm, are used for learning hierarchies. In doing so, error propagation from higher level nodes and comparing decisions between independently trained leaf nodes are two problems which are considered in this thesis.
Experiments are performed on the Reuters 21578, the Reuters Corpus Volume 1 and the Ohsumed data set (the used version can be downloaded here), which are well known in literature. Rocchio and Support Vector Machines, which are state of the art algorithms in the field of text classification, serve as base line classifiers. Comparing algorithms is done by applying statistical significance tests. Results show that, depending on the structure of a hierarchy, accuracy improves and computational complexity decreases when hierarchical classification is used. Also, the introduced model for comparing leaf nodes yields an increase in performance.

     

Philip Hofmair

 

Asset- and Rightsmanagement in the context of digital libraries

A reaction to the ever increasing flood of information, especially in the digital sector, is the growing desire to find better methods to organize and control it. Be it documenting and coping with the general information appearing daily on billions of internet sites, or be it dealing with the highly specialized information as found within schools and universities. Digital libraries provide a very good method of collecting and logging information in a controlled manner. However, if the volume of data in such libraries exceeds a certain limit and moreover it also contains highly confidential information, then the use of Systems requiring access authorization, becomes a must.
This thesis covers the possibilities which are available at the moment for constructing a DRMsystem for use in the field of digital libraries. The heterogeneity of existing DRM solutions has resulted in the individual standards not being compatible with each other. Therefore a DRM-system will be presented, which on basis of ontology, is to a large extent able to bridge these incompatibilities. Finally, using a prototype implementation of a digital Handapparat, the practical possibilities of DRM-system in connection with information retrieval is demonstrated.

     

Mathias Lux

 

Magick - A Tool for Cross-Media Clustering and Visualization

The high tide of digital information, that that takes course towards us in 21st century, brings along enough motivation for research and developments in the area of information retrieval. The con-junction of different media like TV, radio, Internet, newspapers and telephone leads to a heterogeneous information landscape in which uniform navigation and search is hard to apply. Information retrieval solves common problems with handling textual and image data, using metadata allows to enrich data with semantic computable content descriptions, evaluations and classifications independent of the actual media. The application Magick combines these well known and tested techniques to allow cross media retrieval and to restyle the information landscape in a more homogenous way for the user.

     

Vedran Sabol

 

Visualisation Islands: Interactive Visualisation and Clustering of Search Result Sets

The amount of knowledge available electronically is increasing exponentially. Huge amounts of information are available over the Internet and searching for a specific topic often results in a large number of matches. A significant portion of hits is often not at all of interest and the retrieved information contains no explicit relations between different hits, making it hard to obtain an overview and find relevant information.
Visualisation is a powerful technique for distinguishing relevant from non-relevant information and for locating information of interest easily and efficiently. This thesis describes Visualisation Islands, a system for topically organising documents returned in a response to a search query according to their similarity. Search results are visualised in the form of an explorable, intuitive, topically organised topographic map, where relationships between documents are encoded by proximity. Topically similar documents are grouped together forming densely populated areas visualised as mountains and labeled with corresponding documents' keywords. These areas are separated by lower areas or water containing less similar objects.
The topical map visualisation is constructed by applying clustering algorithms on documents in vectorised form, creating groups of similar documents, subsequently positioning the documents in the 2-D viewport space according to the similarity of their vectors by using a force-directed placement algorithm, and generating a topographic background image based on computed 2-D document coordinates. To provide platform-independence Visualisation Islands is implemented in Java.

     

Werner Klieber

 

Using MPEG-7 for Multimedia retrieval

The aim of knowledge retrieval is an efficient knowledge finding in complex knowledge spaces. In this thesis a Multimedia query Framework is realized that supports Multimedia queries composed of different media types linked together for a unique query request. The user does not have to enter Low-level data. The Multimedia query Framework supports the direct usage of Multimedia data for query specification. The Metadata standard MPEG-7 is used to ensure a uniform representation of the information and to make feature and semantic based information explicitly available for the system. This Multimedia query Framework is integrated into an existing distributed Web environment based on XML.
XML and Web technologies are examined to fit the requirements to a Multimedia Query Framework in a distributed environment. MPEG-7 and its meaningful integration into the Framework is inspected. A design of a user interface that's components are “speaking MPEG-7" is developed. An implementation of the user interface, the search logic and the integration into an existing digital library is applied. The result is a user interface that supports the query specification of 5 prototypically implemented Multimedia search types. The interface is integrated into an existing digital library and can be used for Multimedia Metadata queries. The search results and their Metadata can be reused for further query specification.

     

Thomas Neidhart

 

Semiautomatic creation of knowledge maps using knowledge mining techniques

By the abundance of existing information the need after a suitable structuring of
this data flood arises to facilitate a respective user/system in extracting knowledge or
to make it possible at all.
The first step to structure given data-sets consists in finding and defining suitable
concepts which are used to group similar documents together. The sum of the individual
concepts and the relations among themselves form the structure (ontology,
taxonomy) into which all documents of the data-set are arranged. This procedure is
not only very time consuming, but leads usually also to problems with the automatic
allocation from documents to concepts.
The goal of this work is to visualize existing, unstructured data sets with the help of
machine-learning algorithms (clustering) onto knowledge maps using a semi-automatic
process. The process involves the manipulation of the data representation to retrieve
relevant concepts for further tasks, e.g. text classification.

     

Andreas Juffinger

 

Focused Crawling in the Context of Digital Libraries

Focused crawling is gathering increasing momentum not only in combination with search engines but also in the context of digital libraries. Crawling is useful for developing new documents in certain topics. Focused crawling can also assist people to find data on the world wide web by suggesting sites and pages of interest. The overall crawling process is splitted up into a crawling part and a web mining part.
In the crawling part different crawling algorithms are evaluated and a new reinforcement learning algorithm for focused crawling is proposed. Furthermore we consider the impact of whitelisting and blacklisting.
In the web mining part a heuristic controlling mechanism for optimal bandwidth utilisation is shown. In addition a possible way to deal with the huge amount of data will be presented.

     

Andreas Augustin

 

Acquisition of semantic information from encyclopedic data

In the last years the amount of data has been massivly growing and keeps on growing, hence it became necessary to develop new methods to overcome this large amount of data. Besides the search capability improvements, one of the main forces in current research on data mining is the need to expose and understand the underlying knowledge inside the data.

Encyclopedias are known as a reflection of a decades knowledge. As encyclopedias were always and are a great resource for people to gain common knowledge, there is a need to build such common knowledge for computer systems too. The primary objective of this thesis is to extract knowledge out of the textual representation of those Encyclopedias and the preparation of the extracted knowledge for exploitation in different applications and domains.

As an implementation of this process, the accurate methods of Ontology Learning are applied to the text to create taxonomies and concept hierachies. These structures are combined in an computer processable ontology. Furthermore the extracted information is evaluated and refined by using additional methods like online validation and clustering.

Under the circumstance that suitable methods are used, this thesis shows that the semiautomatic extraction of high quality semantic information from an encyclopedic dataset to build a baseontology is possible.