Hi All,

I was wondering in LDSpider could be used ot crawl an entire website (i.e all the links) like bbc.co.uk using the sitemap provided in robots.txt and gather the triples from that crawl. If i give the seed URI as bbc.co.uk I do not get any triples back. LDSpider does not seem to follow the robots.txt and go to sitemap.xml to crawl all the links in the sitemap.

Can this be accomplished using LDSpider?

I am suing a breadth first crawling strategy.I understand that the kind of crawling that I am trying to do is not neccessarily LOD in nature but I want to crawl multiple sites and intend to leverage the LDSpider because it can be hooked up with Any23Handler .

Thank You

asked 08 Jan '13, 00:21

altruist's gravatar image

altruist
11126
accept rate: 0%

@Signified

Thank you , I was actually also looking for suggestions besides LDspider where the crawler would crawl a predefined set of sites and "tripilify" the data .

I was in favour of LDSpider only because it has an ANy23Handler which seems to handle both RDFa and Micdordata

(08 Jan '13, 13:10) altruist altruist's gravatar image

No problem. I should add that I'm sort of slightly half involved in the LDspider project.

Another option to check out is slug (haven't tried it myself):

https://code.google.com/p/slug-semweb-crawler/

Also, you can have a skim through these questions ....

http://answers.semanticweb.com/search/?q=crawler&Submit=Search&t=question

(08 Jan '13, 13:22) Signified ♦ Signified's gravatar image

Thanks again Signified, LDSpider seems to be working if I include a specific page from a website as the seed URI , however I am not sure how I would leverage the robots.txt and sitemap.xml from LDSPider to crawl the entire set of pages in a website.

DO you have any suggestions ?

(08 Jan '13, 14:03) altruist altruist's gravatar image
Be the first one to answer this question!
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×7

question asked: 08 Jan '13, 00:21

question was seen: 1,059 times

last updated: 08 Jan '13, 17:19