` Code Snippets - PySparql

PySparql

The Semantic Web defines a number of technologies and standards for delivering structured data over the internet. It aims to make it easy to produce and consume data with clearly defined semantics.

dbpedia provides a collection of structured data which has been amassed as part of the wikipedia project. The data is maintained as a set of triples which associate an entity (identified by a unique URI) with a value or entity via a named relationship. Relationships are defined by ontologies such as FOAF (Friend of a friend) which standardises relationships such as name.

SPARQL is a language for querying and extracting data from a set of triples. One way to execute such a query is to send it to a SPARQL end-point (a web-service which processes the query and returns a document in XML or JSON with the results).

The PySparql snippet provides some examples of python scripts which hide some of the complexity of SPARQL and locate and download information from dbpedia and make it available for processing. These scripts could form the basis of a semantic web crawler.

The following SPARQL query can be used to list the relationships and values of an entity (in this case, the entity is the film "Blade Runner").

SELECT
    ?rel ?val 
WHERE {
    <http://dbpedia.org/resource/Blade_Runner> ?rel ?val .
} 
		
The query can be evaluated via the Virtuoso SPARQL end-point to yield:
http://dbpedia.org/ontology/director 	http://dbpedia.org/resource/Ridley_Scott
http://dbpedia.org/ontology/Film/director 	http://dbpedia.org/resource/Ridley_Scott
http://dbpedia.org/ontology/producer 	http://dbpedia.org/resource/Michael_Deeley
http://dbpedia.org/ontology/Film/producer 	http://dbpedia.org/resource/Michael_Deeley
http://dbpedia.org/ontology/writer 	http://dbpedia.org/resource/Philip_K._Dick
http://dbpedia.org/ontology/writer 	http://dbpedia.org/resource/Hampton_Fancher
http://dbpedia.org/ontology/writer 	http://dbpedia.org/resource/David_Peoples
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/Edward_James_Olmos
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/Sean_Young
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/Rutger_Hauer
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/Daryl_Hannah
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/M._Emmet_Walsh
http://dbpedia.org/ontology/starring 	http://dbpedia.org/resource/Harrison_Ford
...
		

Having to deal directly with constructing queries and processing the results is somewhat messy. For example, an end point usually limits the number of results returned from each query. SPARQL has facilities for reconstructing the full set of results from a series of queries.

This snippet will build up a library to automate the construction and execution of SPARQL. First, consider some basic code for issuing SPARQL queries and processing the results.

core.py
import urllib2
from urllib import quote
from xml.dom import minidom
import StringIO

# run a SPARQL query against an end point, return the resulting document
def runSparqlQuery(url):
    req = urllib2.Request(url)
    url = urllib2.urlopen(req)
    rsp = url.read()
    return rsp
    
# utility: concatenate text from a list of XML nodes
def getText(nodelist):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc

# extract value that is either a uri or literal
# if successful return (<type>,<valuetext>) where
# <type> is one of url|literal
# on error return (?,?)
def extractBinding(binding):
    try:
        uris = binding.getElementsByTagName("uri")
        return ('uri',getText(uris[0].childNodes))
    except Exception, ex:
        # print "extractBinding:"+str(ex)
        pass
    try:
        literals = binding.getElementsByTagName("literal")
        return ('literal',getText(literals[0].childNodes))
    except:
        pass
    return ("?","?")

# parse the response XML returned by a sparql end point
def parseSparqlResponse(responsexml):
    results = []
    dom = minidom.parse(StringIO.StringIO(responsexml))
    for result in dom.getElementsByTagName("result"):
        bindings = result.getElementsByTagName("binding")
        row = []
        for binding in bindings:
            rowval = extractBinding(binding)
            row.append(rowval)
        results.append(row)
    return results        

# run a sparql query against dbpedia, parse the response and return it as a list of tuples
def processSparqlQuery(query):
    responseXML = runSparqlQuery(query)
    return parseSparqlResponse(responseXML)
    
        

The following code provides some helpers for constructing SPARQL queries against dbpedia.

dbpedia.py
from urllib import quote


def constructQueryURL(query):
    url = "http://dbpedia.org/sparql?default-graph-uri=http%3A//dbpedia.org&output=xml&query="
    url += quote("""
        PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
        PREFIX owl: <http://www.w3.org/2002/07/owl#>
        PREFIX dbpedia: <http://dbpedia.org/property>
        PREFIX foaf: <http://xmlns.com/foaf/0.1/>
        """)      
    url += quote(query)
    return url
    
lookup_template = """
    SELECT
        ?obj 
    WHERE {
        ?obj foaf:name ?name .
        FILTER(?name = "%s"@%s) .
    }
    LIMIT 10
    """
    
sparql_template = """
    SELECT
        ?rel ?val 
    WHERE {
        ?object ?rel ?val .
        FILTER(?object = <%s>) .
    } 
    ORDER BY ?rel ?val
    LIMIT %d
    OFFSET %d
    """   

The Node class builds on the basic sparql query capabilities to collect all known information about a particular object, identified by its URI:

node.py
from core import processSparqlQuery
from dbpedia import constructQueryURL, sparql_template

class Node(object):
    
    def __init__(self,uri):
        self.uri = uri
        self.page_count = 0
        self.page_size = 25
        
        self.loaded = False
        self.literals = {}
        self.nodes = {}
        
    def __str__(self):
        return "Node("+self.uri+")"
        
    def __repr__(self):
        return str(self)
        
    def load(self):
        if self.loaded:
            return
        self.page_count = 0
        while(True):
            if not self.loadpage():
                break
        self.loaded = True
                
    def loadpage(self):
        try:
            query = sparql_template%(self.uri,self.page_size,self.page_size*self.page_count)
            url = constructQueryURL(query)
            results = processSparqlQuery(url)
            for result in results:
                rel = result[0]
                val = result[1]
                if val[0] == 'uri':
                    self.add(self.nodes,rel[1],Node(val[1]))          
                else:
                    self.add(self.literals,rel[1],val[1])
            self.page_count += 1
            return len(results) == self.page_size
        except Exception, ex:
            print "Exception:"+str(ex)
            return False
            
    def __getitem__(self,key):
        if not self.loaded:
            self.load()
        if key in self.literals:
            return self.literals[key]
        if key in self.nodes:
            return self.nodes[key]
        return None
    
    def add(self,attrs,rel,val):    
        if rel in attrs:
            attrs[rel].append(val)
        else:
            attrs[rel] = [val]
              
    def getReferencedNodes(self):
        if not self.loaded:
            self.load()
        return self.nodes
        
    def getLiterals(self):
        if not self.loaded:
            self.load()
        return self.literals
            
    def dump(self):
        print self.uri
        print "================================================"
        for key in self.literals:
            print key + " --> " + str(self.literals[key])
        for key in self.nodes:
            print key + " --> " + str(self.nodes[key])
                
                    
    

The Factory class can be used to list URIs that are associated with a particular name, and construct a set of zero or more Nodes which wrap any associated resources:

factory.py
from core import processSparqlQuery
from node import Node
from dbpedia import lookup_template, constructQueryURL
   
   
class Factory:

    def __init__(self):
        pass
        
    def createNodes(self,name):
        query = lookup_template%name
        url = constructQueryURL(query)
        results = processSparqlQuery(url)
        return [Node(uri) for [(t,uri)] in results if t == 'uri' ]  

if __name__ == '__main__':
    f = Factory()
    nodes = f.createNodes(("Blade Runner","en"))
    for node in nodes:
        print str(node)

The following script brings everything together to crawl dbpedia to find out who directed the film "Blade Runner"

explore.py
from factory import Factory

director_rel = "http://dbpedia.org/property/director"
starring_rel = "http://dbpedia.org/property/producer"

f = Factory()

print "Who directed Blade Runner?"

# search for nodes which may be associated with the name "Blade Runner"
nodes = f.createNodes(("Blade Runner","en"))

print(len(nodes))
# for each of the nodes
for node in nodes:
    # query dbpedia to load in details
    node.load()
    # if the node is the subject of the director relationship, look at the associated node
    if node[director_rel]:
         directors = node[director_rel]
         for director in directors:
            # this n
            director.load()
            print director['http://xmlns.com/foaf/0.1/name']

Running this script should produce the following output (bear in mind that dbpedia does change over time)

$ python explore.py
Who directed Blade Runner?
[u'Sir Ridley Scott', u'Ridley Scott']
	

 

Leave a comment

Anti-Spam Check
Comment