|
I'm wondering if there's a really fast and scalable command-line tool to convert TTL files to NTriples -- I'm aware of the Jena command line tools, but I'm sure at least a factor of two could be gained by eliminating the Java tax. One critical feature is that it has to be streaming... It can't load everything into a model and spit them out. |
|
serdi http://drobilla.net/software/serd/ is a streaming tools so doesn't need too much memory and can work with huge input files. Make conversion in both directions turtle to ntriples and ntriples to turtle. |
|
I'm not sure if you consider Java itself taxing, or just using it, but Sesame's RIO parsers & writers are streaming. Writing that conversion program would be a handful lines of code at most, it probably would take you five minutes to write something that'll take the statements reported by the turtle parser and run them out the NTriples writer. If you don't want to use Java, that won't help you, but I'd go with RIO. Anything involving text processing in Java pays a heavy tax because inexpensive UTF-8 text needs to be puffed up to an expensive 2-byte representation and then converted back to UTF-8. Java makes up for this by being threadsafe and parallelizable in constract to the Gnu world in linux in which writing threadsafe code in C is quit difficult. I use Java a lot. If a job is parallelizable or needs complex functionality in Java I can except it, but for serial tasks at the head-end with high data bulk I'd like to avoid the tax. 1
@database_animal The UTF-8 to UTF-16 conversion might not happen if you use a jvm later than 1.6.0_21 have a look at the commandline options to the java command (e.g. -XX:+UseCompressedStrings). The thing is the task is parallel friendly. The reading and writing could use different threads which is especially interesting if compressing and decompressing is involved. Otherwise you are more likely to be limited by disk io. PS when I was doing a lot of conversions I found Sesame with RIO to be just a bit faster than redland rapper. But not significant so. |
|
Try the redland utilities like rapper (http://librdf.org/raptor/). It's written in C and very fast. The Windows version is pretty old, but still does the job, and there are much more up-to-date Linux versions. Bob How did I forget Redland! The main issue is getting it running on non-nix systems if I remember and you may always struggle if you have a funky linux environment, I know that's why we don't use it ourselves because our Linux environments don't have the requisite combination of packages for using raptor2 2
I was going to answer "rapper" myself, but thought I'd give it a try first, and lo and behold, it doesn't stream. It loads the whole graph into memory before serialising. |
|
In Jena RIOT, N-Triples is faster to parse than Turtle (a lot of system find related effects, not just Java ones - despite more bytes, the effect of sitting in tight loop outweighs the byte count because of stream I/O). Note that RIOT parses need enabling if you just use Jena core code, no ARQ etc, then they may not wired in. There are command line tools which stream (modulo bNode labels, which is a language design issue) from Turtle to N-Triples. Validation can be added with --validate (e.g. checks lexical forms of XSD literals - this, unsurprisingly, slows parsing down). A good practice is to always parse new data to N-Triples with as much checking as possible, then store compressed N-triples (expect x8-x10 compression). RIOT reads from gzip files. |
|
There is rdf2rdf which wraps the Sesame RIO parsers and writers but I know from experience within our company that they do still use a lot of memory over time for very large data, this may be an artifact of the data (lots of BNodes) but in our experience RIO only scales so far. We've had decent success with RIOT (from the Jena ARQ package) and that's probably our go to tool for conversion My own toolkit has a command line conversion tool though it's nowhere as performant as Sesame RIO or Jena RIOT in the released version at least - certainly it struggles on large data (millions of triples) and the current version does have a bug where BNode IDs may not be correctly written in NTriples when converting from some formats. |


