login about faq
5
1

I'm wondering if there's a really fast and scalable command-line tool to convert TTL files to NTriples -- I'm aware of the Jena command line tools, but I'm sure at least a factor of two could be gained by eliminating the Java tax. One critical feature is that it has to be streaming... It can't load everything into a model and spit them out.

asked Jan 24 at 11:23

database_animal's gravatar image

database_animal
6.6k410


serdi http://drobilla.net/software/serd/ is a streaming tools so doesn't need too much memory and can work with huge input files. Make conversion in both directions turtle to ntriples and ntriples to turtle.

answered Feb 13 at 18:07

oesxyl's gravatar image

oesxyl
32414

I'm not sure if you consider Java itself taxing, or just using it, but Sesame's RIO parsers & writers are streaming. Writing that conversion program would be a handful lines of code at most, it probably would take you five minutes to write something that'll take the statements reported by the turtle parser and run them out the NTriples writer. If you don't want to use Java, that won't help you, but I'd go with RIO.

answered Jan 24 at 11:28

mhgrove's gravatar image

mhgrove
1.0k15

Anything involving text processing in Java pays a heavy tax because inexpensive UTF-8 text needs to be puffed up to an expensive 2-byte representation and then converted back to UTF-8.

Java makes up for this by being threadsafe and parallelizable in constract to the Gnu world in linux in which writing threadsafe code in C is quit difficult. I use Java a lot.

If a job is parallelizable or needs complex functionality in Java I can except it, but for serial tasks at the head-end with high data bulk I'd like to avoid the tax.

(Jan 24 at 12:20) database_animal database_animal's gravatar image
1

@database_animal The UTF-8 to UTF-16 conversion might not happen if you use a jvm later than 1.6.0_21 have a look at the commandline options to the java command (e.g. -XX:+UseCompressedStrings). The thing is the task is parallel friendly. The reading and writing could use different threads which is especially interesting if compressing and decompressing is involved. Otherwise you are more likely to be limited by disk io.

(Jan 25 at 05:24) Jerven Jerven's gravatar image

PS when I was doing a lot of conversions I found Sesame with RIO to be just a bit faster than redland rapper. But not significant so.

(Jan 25 at 05:37) Jerven Jerven's gravatar image

Try the redland utilities like rapper (http://librdf.org/raptor/). It's written in C and very fast. The Windows version is pretty old, but still does the job, and there are much more up-to-date Linux versions.

Bob

answered Jan 24 at 23:16

bobdc's gravatar image

bobdc
2.2k5

How did I forget Redland! The main issue is getting it running on non-nix systems if I remember and you may always struggle if you have a funky linux environment, I know that's why we don't use it ourselves because our Linux environments don't have the requisite combination of packages for using raptor2

(Jan 25 at 01:50) Rob Vesse ♦ Rob%20Vesse's gravatar image
2

I was going to answer "rapper" myself, but thought I'd give it a try first, and lo and behold, it doesn't stream. It loads the whole graph into memory before serialising.

(Jan 25 at 05:21) tobyink tobyink's gravatar image

In Jena RIOT, N-Triples is faster to parse than Turtle (a lot of system find related effects, not just Java ones - despite more bytes, the effect of sitting in tight loop outweighs the byte count because of stream I/O). Note that RIOT parses need enabling if you just use Jena core code, no ARQ etc, then they may not wired in. There are command line tools which stream (modulo bNode labels, which is a language design issue) from Turtle to N-Triples. Validation can be added with --validate (e.g. checks lexical forms of XSD literals - this, unsurprisingly, slows parsing down).

A good practice is to always parse new data to N-Triples with as much checking as possible, then store compressed N-triples (expect x8-x10 compression). RIOT reads from gzip files.

answered Jan 25 at 06:22

AndyS's gravatar image

AndyS
5.3k27

There is rdf2rdf which wraps the Sesame RIO parsers and writers but I know from experience within our company that they do still use a lot of memory over time for very large data, this may be an artifact of the data (lots of BNodes) but in our experience RIO only scales so far.

We've had decent success with RIOT (from the Jena ARQ package) and that's probably our go to tool for conversion

My own toolkit has a command line conversion tool though it's nowhere as performant as Sesame RIO or Jena RIOT in the released version at least - certainly it struggles on large data (millions of triples) and the current version does have a bug where BNode IDs may not be correctly written in NTriples when converting from some formats.

answered Jan 24 at 19:20

Rob%20Vesse's gravatar image

Rob Vesse ♦
8.6k515

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×27
×20
×2
×2
×1

Asked: Jan 24 at 11:23

Seen: 476 times

Last updated: Feb 13 at 18:07