TopBraid offers a "Semantic XML" feature that transforms XML into an RDF data structure, preserving all XML tag hierarchies, tag orderings, attributes, etc. You can also convert back to XML, thus supporting round-trip XML access. 100% success rate on well-formed XML files. RDFa is also supported.
There is a PDF describing this and other import capabilities at http://www.topquadrant.com/resources/Import_and_Transformation_with_TBS.pdf There is also a screencam video titled "Importing arbitrary XML files"at http://www.topquadrant.com/resources/videos.html.
It really depends what semantics you want to extract! Things like page title are easy, but section headings etc. are used in pretty random ways.
You can convert anything digital into RDF. The question isn't how to convert it into RDF, it's what do you plan to do with the RDF (or hope other people will do)?
Being able to pull out the title and actual content HTML from a page might be useful, also the link and meta tags etc. but it really depends on what you're trying to produce. Start with a use-case and work backwards.
answered 27 Oct '10, 11:15
Your question is very generic. My answer assumes that you are talking about extracting RDF metadata from documents. There are two opensource frameworks I would recommend you look at:
-Aperture : a Java framework for extracting and querying full-text content and metadata from various information systems (e.g. file systems, web sites, mail boxes) and the file formats (e.g. documents, images) occurring in these systems.
-Any23 : a library, a Web service and a set of command line tools for extracting structured data in RDF format from a variety of Web documents. This project has been developed by SIndice team from DERI. The project has recently moved as a Apache incubator project.
answered 13 May '13, 10:08
Alchemy offers a pretty good API that support entity extraction, relation extraction, sentiment analysis etc. You can try it below:
As I understand, Open Calais support something similar:
answered 14 May '13, 12:00
To add to the previous answers: if the XML or HTML has a lot of unstructured text (paragraphs, etc.), then you may want to think about some type of text analytic software to do basic entity extraction. That way, the RDF version of the document at least has statements describing the metadata within the running text.
I haven't tried TopBraid's tool listed above, but you may be better served just using XSLT, since, as I'm sure you know, the structure of an XML document does not really correlate exactly to the semantic representation (big example of that today is any XBRL document).
answered 23 Dec '10, 21:37
You may want to try Tripliser, a library/command-line tool for XML to RDF conversion/extraction.
Unlike most other solutions, it does not use XSLT, which can become hard to read for more complex mappings.
It maps XML to RDF using XPath to extract the data.
For the source XML:
The mappings could look like this...
answered 22 Jun '11, 06:28