|
There seems to be a lot of research on anonymisation methods for graph structured data to protect privacy in online social networks(Zhou, et al. 2008), but where can I find similar methods for anonymising RDF and privacy in the semantic web in general? It seems to me that privacy preservation in the Linked data model is even more complex, given that: (I'm new to this, so please correct me)
For someone who wants to publish anonmyised personal data as Linked Data - if only to create FOAF Persons and attributes (on the behalf of my friends :), where do I turn to? |
|
Good question. My gut says that anonymisation on the infrastructural level of RDF/RDFS/OWL would be difficult, and that it would be up to domain experts to make informed decisions. My initial impression was that, as a crude analogy, it would be like asking how to anonymise JSON ... i.e., it depends on the JSON. On the other hand, there's an interesting parallel between "entity disambiguation" or "sameAs mining" (that look to resolve identity through looking for potentially complex or approximate "keys" ... e.g., birthday and full name, or username and site, etc.) and de-anonymisation techniques. So maybe you could take that research and invert it. :) /2 cents |
|
Wouldn't this be as easy as making your RDF data not public? And if you have a SPARQL end-point you would also make that private. Otherwise it doesn't make much sense to use Linked Data, which is all about collaboration and sharing data, if you're not willing to share. That's a bit of an oversimplification. There is tons of data (for example, in government organizations) that can be of great interest if published, yet may contain privacy-sensitive things (details of individuals). One goal of anonymisation is to be able to 'filter' such datasets, so that the "safe-to-publish" bit can be extracted and made public. 1
I'm not sure about that. I would separate the privacy-sensitive triples, and make them private. So you could imagine a system where you have a public.rdf and a private.rdf dataset. Triples offer this kind of flexibility, so why not take advantage of it? You could protect your private.rdf dataset with HTTP standard authentication, making it extremely easy to implement. @lmatteis, that's fine, but how do you decide which triples to make private? :) That is not as simple as it may seem at first glance... @Jeen What do mean by saying it's be not so simple? 4
I don't claim to be an expert, but what I mean is that it is often more involved than just hiding the phone-number and lastname properties - deciding what to show and what to hide is a bit of a balancing act. Is leaving birthdays in ok if we leave out names? Can links to external documents be left in if those documents potentially mention individuals? Apart from the obvious fields (name, birthday, email), which combinations of properties make individuals potentially identifiable? You can of course just filter everything out, but that way you will likely end up with an uninteresting dataset. If by separation of triples, you mean removing certain nodes or edges, this would distort the graph and therefore break the chain of inferences required to answer queries. Grouping triples (aggregating Classes and/or properties) is a viable alternative, but will introduce some level of uncertainty in results, eg: "Bob" (:knows OR :worksFor) ("Jim" or "Sally"). Apparently graph anonymization is much more complex than tabular data anonymization, where you can remove certain values, columns or rows to acheive k-anonymity. Zhou et al. (2008) explain why in their paper (2.1) Maybe I am oversimplifying this, but if you want to anonymize your data, you most likely need to know what is the data that you want to anonymize. This is true not only in a triple system, but on any other medium where you have data (even on a piece of paper). So of course you need to identify what is considered to be privacy-sensitive in your system. All I'm trying to say is that it doesn't seem to be different from anonymizing any other type of data. You simply identify if, and you anonymize it using the same exact methods you would use with any other type of data.
showing 5 of 7
show 2 more comments
|

