Hello!

I have 5 different triplestores on my local harddrive which I needed to dump to text files (N-Triples). I tried doing so using tdbdump on a Windows machine. For 3 of the triplestores this was not a problem. The other two give me the following exception:

com.hp.hpl.jena.tdb.base.file.FileException: ObjectFileStorage.read[nodes.dat](5712499)[filesize=115185505][file.size()=115185505]: Impossibly large object : 1768974624 bytes > filesize-(loc+SizeOfInt)=109473002
    at com.hp.hpl.jena.tdb.base.objectfile.ObjectFileStorage.read(ObjectFileStorage.java:319)
    at com.hp.hpl.jena.tdb.lib.NodeLib.fetchDecode(NodeLib.java:72)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.readNodeFromTable(NodeTableNative.java:178)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._retrieveNodeByNodeId(NodeTableNative.java:103)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getNodeForNodeId(NodeTableNative.java:74)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._retrieveNodeByNodeId(NodeTableCache.java:103)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getNodeForNodeId(NodeTableCache.java:74)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getNodeForNodeId(NodeTableWrapper.java:55)
    at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getNodeForNodeId(NodeTableInline.java:67)
    at com.hp.hpl.jena.tdb.lib.TupleLib.triple(TupleLib.java:137)
    at com.hp.hpl.jena.tdb.lib.TupleLib.triple(TupleLib.java:114)
    at com.hp.hpl.jena.tdb.lib.TupleLib.access$000(TupleLib.java:45)
    at com.hp.hpl.jena.tdb.lib.TupleLib$3.convert(TupleLib.java:76)
    at com.hp.hpl.jena.tdb.lib.TupleLib$3.convert(TupleLib.java:72)
    at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
    at org.openjena.atlas.iterator.Iter$4.next(Iter.java:301)
    at org.openjena.atlas.iterator.Iter.next(Iter.java:828)
    at org.openjena.atlas.iterator.IteratorCons.next(IteratorCons.java:89)
    at org.openjena.atlas.iterator.Iter.sendToSink(Iter.java:572)
    at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:45)
    at org.openjena.riot.out.NQuadsWriter.write(NQuadsWriter.java:37)
    at org.openjena.riot.RiotWriter.writeNQuads(RiotWriter.java:41)
    at tdb.tdbdump.exec(tdbdump.java:49)
    at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
    at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
    at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
    at tdb.tdbdump.main(tdbdump.java:31)

The curious thing is that one of the triplestores that cause this exception is a lot smaller than the other ones which could be dumped without any problems. The sizes of the ones that worked are 1.3gb, 0.7gb and 3.75gb. The sizes of the ones that cause the problem are 0.6gb and 6.7gb.

I guess my problem is related to this issue. Due to poor programming on my end the program was terminated without properly closing the triplestore during the population a few times. The suggestion in the referred issue of simply rebuilding the triplestore would work in theory but is not desirable since the triples were collected over an API and it would probably take over a week to do so.

The mentioned issue also points out that it could be a bug of TDB version before 0.9, but I am using 0.9.3. Also I did not use concurrent access (unless I started the program for the triple collection twice, which I am pretty certain did not occur).

So is there anything else that could be done? I already tried running tdbrecovery which didn't help. I also tried iterating the triples using Java, which caused the same problem. My probably very naive first approach was to iterate all the triples in the model and afterwards I tried to reduce the object size by iterating over the subjects and for each subject iterate over the statements.

Any help would be highly appreciated!

asked 21 Nov '12, 19:50

knut_'s gravatar image

knut_
75117
accept rate: 0%

edited 21 Nov '12, 20:18

This is quite a specialist question. A few Jena devs and contributers hang out around here, but have you tried the mailing list?

http://jena.apache.org/help_and_support/index.html

You might get a faster response and you'll probably get more expert eyeballs on your question through that.

In any case, I wish you luck. Losing data like that sucks.

(21 Nov '12, 20:31) Signified ♦ Signified's gravatar image

This size test is usually triggered by a corrupt nodetable in database. You don't have large objects but the node table is broken and data has overwritten a length field. This is a data-write time problem even though it shows up at later at read time.

Not using transactions and crashing the JVM is a possible cause when the data was written. The pre/post 0.9 version comment refers to using transactions. It looks like you are not using transactions at one point in the past, and exited the JVM without sync'ing the caches.

Version 0.9.4 is safer again and has had some systemic user testing for crash situations. It fixed some cases of a bad restore after a crash although these were not on the node table but on indexes. I'm afraid this does not help you directly.

With transactions, you can take a backup of a live database by simply starting a read transaction and writing the triples. tdbdump requires exclusive access to the database as it's a separate JVM.

permanent link

answered 22 Nov '12, 03:52

AndyS's gravatar image

AndyS ♦
13.7k37
accept rate: 32%

Thank you for that explanation! So it seems there is nothing that can be done, or at least you do not know of any solution, besides using transactions in the first place? If the latter is the case, I think this should be pointed out more prominently in the TDB manual. Anyway, I restarted the data collection for it to run over the weekend, but unfortunatelly I didn't have enough time to implement transactions. So I just have to hope it works this time or redo it again properly with transactions.

(23 Nov '12, 03:05) knut_ knut_'s gravatar image
1

Suggestions and contributions to the document more than welcome.

If used non-transactionally, then a clean shutdown (TDB.sync(dataset)) will be useable for start and stopping. Its the aborted JVM that looks like it caused the problem - the in-memory caches were ahead of the disk.

(23 Nov '12, 09:21) AndyS ♦ AndyS's gravatar image
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×623
×63

question asked: 21 Nov '12, 19:50

question was seen: 1,452 times

last updated: 23 Nov '12, 09:21