What I Think About...: 2007

Tuesday 3 July 2007

Using SemWeb with MySQL

Getting SemWeb to work with MySQL is pretty much the same as for SQLite (see the last post). Again, using the news.rdf file from Mozilla, the command line to enter data into the MySQL database now becomes:

rdfstorage news.rdf --out "mysql:rdf:database=news;username=test"

This assumes that a new database (schema) called "news" has been created in MySQL and that a user "test" exists that doesn't require a password (both these things can be done using the MySQL Administrator UI).

Initially running this command line in the installed version of the SemWeb\bin directory results in the following error being generated:

System.IO.FileNotFoundException: Could not load file or assembly 'MySql.Data, Version=1.0.8.0, Culture=neutral, PublicKeyToken=c5687fc88969c44d' or one of its dependencies. The system cannot find the file specified.

This error means that the MySQL.Data.dll is missing, so the first step is to get this from MySQL ADO.Net Data Provider 5.0.7

After placing this DLL in the bin directory and running the command line again, the following error will appear:

System.IO.FileLoadException: Could not load file or assembly 'MySql.Data, Version=1.0.8.0, Culture=neutral, PublicKeyToken=c5687fc88969c44d' or one of its dependencies. The located assembly's manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)

In this case the important part is the "manifest definition does not match the assembly reference" bit - as happened with the SQLite DLL, this MySQL DLL is newer than the version that was used to build the SemWeb.MySQLStore.dll and so the 2 DLLs are incompatible. We need to rebuild this SemWeb DLL.

The steps to do this are as follows:

Open the SemWeb solution in Visual Studio
Add a new project of type "Class Library" and called "SemWeb.MySQLStore"
Delete the default "Class1.cs" that gets added to the project
Choose to add an existing item to the project and pick "MySQLStore.cs" from the SemWeb\src directory
Add a reference to the "MySQL.Data.dll " that should already be in the SemWeb\bin directory
Add a reference to the SemWeb project
In the project properties, set the output path to the SemWeb\bin directory

These are the exact same steps that were required for the SQLiteStore DLL (although obviously referencing different files). However, building the MySQLStore DLL now results in the following compiler error:

The type or namespace name 'MySqlConnection' could not be found (are you missing a using directive or an assembly reference?)

To fix this error the following needs to be added to the top of the MySQLStore.cs file:

using MySql.Data.MySqlClient;

So the complete set of using directives at the top of the file should now appear as:

using System;
using System.Collections;
using MySql.Data.MySqlClient;
using System.Data;

The project should now rebuild successfully and the new SemWeb.MySQLStore.dll should now appear in the SemWeb\bin directory.

Unfortunately this still isn't the end of the process - running the command line again results in the following error:

MySql.Data.MySqlClient.MySqlException: Table 'news.rdf_statements' doesn't exist

The DLL is now ok, its just that the database that its creating has a few problems; inspecting the database in MySQL Administrator shows that the "rdf_literals" and "rdf_entities" tables have been created but the "rdf_statements" table hasn't been made.

Setting the SEMWEB_DEBUG_SQL environment variable (type "set SEMWEB_DEBUG_SQL=1" at the command prompt) and then running the command line again generates some debug information (alternatively put a break point on the "
CreateTable" function in the SQLStore.cs file and run the program in the debugger, after first setting up the command line arguments in the project properties). The error generated is:

System.FormatException: Input string was not in a correct format.

This is caused by the SemWeb database "Open" function (from MySQLStore.cs) querying the version of the database and then trying to set this into the .NET Version class; however, the version string returned from my version of MySQL is "5.0.41-community-nt" and this doesn't match with the expected .NET version string format, which is expecting only integers seperated by a period character ('.') - hence an error is thrown on the first call to "Open" and so the creation of the first database table ("rdf_statements") fails.

To fix this I've just put a try-catch around the setting of the version string, so basically I just ignore when the version can't be set.



private void Open()
{
  if (connection != null)
    return;
  MySqlConnection c = new MySqlConnection(connectionString);
  c.Open();
  connection = c; // only set field if open was successful

  using (IDataReader reader = RunReader(@"show variables like 'version'"))
  {
    reader.Read();
    try
    {
      version = new Version(reader.GetString(1));
    }
    catch(Exception e)
    {
      if (Debug) Console.Error.WriteLine(e.Message);
    }
  }
}

Finally, rebuilding the SemWeb DLL and re-running the command line successfully adds the RDF to the database:

news.rdf 0m0s, 81 statements, 1728 st/sec
Total Time: 0m2s, 81 statements, 30 st/sec

Monday 2 July 2007

Using SemWeb with SQLite

As mentioned in the previous post I've had a few problems getting SemWeb, the C# RDF library, to persist data to a database. Here I describe the steps I've had to follow to get things working.

I've been using Visual Studio on Windows - I suspect that these problems may not exist when using Mono on Linux.

I'm using the following versions:

SQLite

Following the approach described in the SemWeb documentation, I first tried to set up a database connection to a SQLite database.

Following the SemWeb documentation I performed the following steps:

Placed the "sqlite3.dll" (from the SQLite download) into my windows System32 directory
Added the "Mono.Data.SqliteClient.dll" from the the Mono install directory "Mono-1.2.4\lib\mono\2.0" into my SemWeb\bin directory
Downloaded the sample RDF file from http://www.mozilla.org/news.rdf
Entered the specified command line at a DOS prompt in the SemWeb\bin directory:
rdfstorage.exe news.rdf --out "sqlite:rdf:Uri=file:news.sqlite;version=3"
Pressed return and stood well back.

I was greeted with the following:

System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.IO.FileLoadException: The located assembly's manifest definition with the name 'Mono.Data.SqliteClient' does not match the assembly reference.

Basically my Mono SQLite adapter was newer than the version that the SemWeb.SqliteStore.dll had been built with.

Unfortunately the SemWeb Visual Studio solution doesn't come with a SemWeb.SqliteStore project, so the following steps are required:

Open the SemWeb solution in Visual Studio
Add a new project of type "Class Library" and called "SemWeb.SqliteStore"
Delete the default "Class1.cs" that gets added to the project
Choose to add an existing item to the project and pick "SQLiteStore.cs" from the SemWeb\src directory
Add a reference to the "Mono.Data.SqliteClient.dll" that should already be in the SemWeb\bin directory
Add a reference to the SemWeb project
In the project properties, set the output path to the SemWeb\bin directory

I also rebuilt the SemWeb and RDFStorage projects (after setting their output directories to the bin directory). Once built I re-ran the command line and this time obtained the expected result:

news.rdf 0m0s, 81 statements, 1036 st/sec
Total Time: 0m1s, 81 statements, 42 st/sec

Next time I'll describe the steps I've had to follow to get SemWeb up and running with MySQL.

Topic Maps, OWL and RDF

It's been quite a while since my last post. This has mostly been due to spending a large amount of time playing with some Topic Map (the Ontopia Omnigator) and OWL (Protege) software. Both of these are excellent applications and I was especially impressed by the ability of OWL to form advanced object relationships.

However, at the end of all this research, I think that neither Topic Maps nor OWL are suitable for my purposes; yes, both can perform complex classifications and describe advanced relationships between classes but, for my purposes of creating a heterogeneous database, I only require simple classifications and relationships. Therefore using either of these technologies would probably be like taking the proverbial hammer to a nut. For me, a stronger argument against using these technologies is that, at this moment in time, neither is freely available for C# - the language that I'd like to work with.

So my requirements are a technology that can describe simple classifications and relationships and that is freely available in C#. The one application that seems to match these requirements is Joshua Tauberer's SemWeb - an RDF library for C#. The RDF and RDFS support provided by this package can perform categorisation and describe the relationships between objects to a level suitable for my requirements. An added bonus is that this library is still under active development - version 1.0 was only released on the 10th June 2007 (unlike the other C# RDF library Drive - the web page of which had actually expired the last time I looked).

This doesn't mean to say that SemWeb is perfect - like most things I've encountered on this journey, using SemWeb is not as straightforward as it could be. The main reason for this is that SemWeb has been developed on Mono, as opposed to Microsoft's Visual Studio that I want to use. As a result of this the supplied Visual Studio solutions are incomplete (I needed to refer to the makefiles to create the required projects) and the package currently doesn't support SQL Server, the Express version of which I'd been using for some other areas of this project (such as user administration). I've therefore now switched databases to use MySQL and am now struggling to add and retrieve some RDF data from this. I'll give more details of this struggle in the next post.

Wednesday 16 May 2007

RDF - An alternative to Topic Maps?

In my last post I described how Topic Maps appeared to be a possible mechanism for creating a flexible database, capable of storing a range of different, and possibly unrelated, items - a heterogeneous database. However, as I also mentioned previously, a couple of things worried me about Topic Maps; namely that work in the area seems to have gone a bit quiet in the last couple of years and, secondly, that there appears to be quite a large European bias to the companies and institutions working in the field - not that this is a bad thing, but what are everyone else up to?

These two things were playing on my mind, so rather than rushing headlong into Topic Maps, I decided to take a step back and spend a bit more time investigating other possible solutions. This search revealed that there is another possible solution to the problem - RDF - the Resource Description Framework (I'd already suggested in my last post that this could be a possibility). Indeed RDF and Topic Maps appear to cover very similar ground and, from an initial study of the available information, it's hard to decide between the two. It may be that subtle differences exist between the two technologies but my initial impression is that they are pretty much going head-to-head against each other – another VHS versus Betamax, but with probably slightly less interest from the general public.

So what are the actual differences between RDF and Topic Maps and which technology is the most suitable for my purposes? As with Topic Maps, the documentation for RDF is rather sparse and also seems to mostly be a couple of years old - the only book I could find that described both subjects was The Explorer's Guide to the Semantic Web, published in 2004. While this provided a good overview of both technologies, it didn't really cover them at a low enough level for my needs and, by the end, I was still left uncertain of the differences in the two. In addition, RDF has other related technologies - namely RDFS (RDF Schema) and OWL (Web Ontology Language - which surely should be called WOL, although this would have reduced the oportunities to publish books with bird covers) and I felt that these weren't covered in enough detail. The only thing for it was to dig a bit deeper into each subject.

RDF, RDFS and OWL
The definitive (and about the only) text on RDF is Shelley Power's Practical RDF. This book did give low level descriptions of RDF, RDFS and OWL and I'd definitely recommend it for anyone interested in RDF. My only criticism would be that the chapters describing RDF tools are now a bit out of date (the book was published in 2003) and I found quite a few typos and mistakes in the RDF examples - a new edition of the book is surely required. However, by the end, I felt that I'd gained enough knowledge to pursue RDF a bit further.

I think I now understand how RDF, RDFS and OWL fit together. My interpretation is that RDF is a mechanism that can describe simple relationships (subject-predicate-object expressions, known as triples e.g. apples have the colour green). RDF can be serialized using RDF\XML (i.e. an XML based language can be used to describe the RDF relationships). RDF Schema can then be used, in much the same way as XML Schema (XSD) are used with XML, to describe the contents and structure of the RDF, allowing relationships to be created. OWL is effectively built on top of RDFS, extending it to allow the generation of complex class and sub-class relationships.

I'm still not sure about the differences in Topic Maps and RDF (the RDF book didn't cover Topic Maps), so I think the only thing for it is to play with the two technologies. I'll let you know how it goes in the next post.

Sunday 15 April 2007

Information Architecture and Topic Maps

Umm... things aren't as straightforward as I'd initially imagined. On investigating what's required for heterogeneous databases I've discovered that there's a whole world of stuff out there that I know little or nothing about: facets, synonym rings, and controlled vocabularies to name but a few. In fact this world has a name: Information Architecture and even its own bible, the legendary Polar Bear Book. Hidden amongst all this lies the mysterious field of Topic Maps. From my initial skimming of the available resources and documentation on the subject, it sounds as if these may be exactly what I’m after.

I describe Topic Maps as being mysterious for two reasons: Firstly, after an initial flurry of activity in the area at the turn of the century, it appears as if things have now gone a bit quiet. The Topic Maps standard, ISO/IEC 13250 and the related XML Topic Maps (XTM), were last published in 2002 and 2001 respectively. The only book I could find on the subject, XML Topic Maps, was also published in 2002. Additionally, many of the main published links on the subject no longer exist. The second reason that Topic Maps are a bit mysterious is the fact that the vast majority of the research, and companies working in the area, seems to be concentrated in Europe – America seems to be ignoring their existence; either Europe is at the cutting edge of this technology, or something equivalent must exist that’s being used in America and I haven’t discovered it yet! (RDF and OWL perhaps?).

Topic Maps – The Basics

In Topic Maps nearly everything is represented as a standalone entity - a topic, and relationships can be created between these topics. For example, if we wished to describe a book, we could create separate topics for the name of the author, the title of the book, the publisher of the book, etc. and other topics to describe the relationships between these topics – for example, that this book was written by the author. This makes it possible to create relationships for practically anything and therefore it’s also possible to create a relational database that can be used to represent anything, which is exactly what I’m after. In addition, new topics and relationships can be added as more information is known about a subject.

In the next post I’ll hopefully describe how a topic map database can be created and how a range of different subjects can be stored. For anyone who can’t wait till then, some useful Topic Map information can be found at the following links:

"Metadata? Thesauri? Taxonomies? Topic Maps!" http://www.ontopia.net/topicmaps/materials/tm-vs-thesauri.html

Topic Maps and Relational Databases
http://www.xml.com/pub/a/2003/03/05/tmrdb.html

Monday 2 April 2007

In search of database flexibility

Initially, to refresh my knowledge of ASP.NET and to get up to speed with SQL and relational databases (RDBs), I built a simple web site capable of storing information book and CD information. (Mainly this was based on the examples given in this book: Beginning ASP.NET 2.0).

However, even at this stage, I was beginning to realise that it was going to be difficult to expand the RDB tables to provide a generic means of storage. To solve the problem I tried applying some object orientated design, creating a base table, to contain all common information and from which all others would be derived. However, it doesn't appear that relational databases lend themselves particularly well to object orientation, and I was still left with the situation were a new table would need to be created for each new type of item to be stored.

To categorize the items added to the database, and as a means of navigating, I used a three level hierarchy of Department, Section and Category. So, for example, "Eyes Open" by Snow Patrol was stored into "Entertainment - Music - Indie". This seemed to provide a fairly sensible breakdown for music and books, but seemed that it might prove a bit restrictive for other sorts of item. It forced items to always be placed into a category (i.e. items couldn't be stored into departments or sections) and categories couldn't be broken down any further to give more specific classifications.

Indeed, when I came to try to store more esoteric information, I soon discovered that the rigid nature of my database made it very difficult to add anything that didn't conform to a 3-layer classification scheme. For example, I wanted to add information about buildings but, knowing nothing about architecture, I had no idea what parent category ("Department" in my classification scheme) architecture would belong to, nor what sub-categories it would contain.

Now, whilst I could have gone off and studied more about architecture to know how it should be classified, this wasn't really what I wanted. Instead I wanted a database that would give me the flexibility to add items at any level, plus the ability to reclassify items and insert hierarchies as my knowledge of a subject increased. Clearly my intial database wasn't up to the job. As a result, I've now set off in search of a database structure that can give me the flexibility I require.

Saturday 31 March 2007

Developing a heterogeneous database

This, hopefully, is the start of a blog that will describe the development of a database for anything and everything - a heterogeneous database.

What exactly is it that I'm trying to create?

Basically, as mentioned above, a database that can store and index absolutely anything - books, CDs, buildings, restaurants, baked beans, etc. I want to create something into which I can store information about anything, and be able to retrieve, relate and search those items as easily as possible.

Why's it hard to create this database?

Since the information I want to store can potentially be anything, the indexes for each item can also be anything. So, for example, the main indexes for a book could be author, title and publisher, whereas a building could be name, location and architect; with standard database tables it wouldn't be easy to create a structure capable of storing this information, without creating a table for each type of item to be stored.

Extra problems also arise when relationships between items need to be created. For example, if a book contained information about a building, how would this association be expressed or, if categories and sub-categories existed for a subject (e.g. fiction and non-fiction for books), how could the structure of the stored data be made to represent this.

Why the blog?

There are 3 main reasons why I've chosen to create a blog on this:

1. My knowledge of some of the required subject areas is a bit limited, so hopefully there are people out there who will take the time to step in when I'm heading off in the wrong direction.

2. From my initial searches I haven't been able to find an easy way to create what I'm after - hopefully anything posted here will be of use to anyone following along the same path.

3. Finally, just documenting the steps that I take should hopefully give me a clearer view of what's been done and where I'm going. If you feel like adding anything to this journey, please do.

What I Think About...