====== Federated SPARQL query processing with Corese/KGRAM ======
===== Motivations =====
Most of scientific disciplines, and especially biomedical sciences, face nowadays "data deluge" challenges. In spite of the continuously increasing computing and storage capabilities provided by computing infrastructures, the management of massive scientific data through centralized approaches became inappropriate, for several reasons: (i) they do not guarantee the autonomy property of data providers, constrained, for either ethical or legal concerns, to keep the control over the data they host, (ii) they do not scale and adapt to the massive scientific data produced through e-Science platforms.
Federated approaches, strongly relying on distributed query processing techniques, have gained a lot of interest to adapt to the autonomy constraints of biomedical data providers. In addition, Knowledge engineering and Semantic Web technologies became popular to enhance data and knowledge sharing.
In this page, we will briefly introduce the **federation extension** of the [[http://wimmics.inria.fr/corese|Corese/KGRAM Semantic Web engine]] to support distributed data sources, and to address the underlying performance issues.
===== Demonstration =====
We will now demonstrate how two federate two datasets from the french national statistical institute (INSEE), namely the [[http://rdf.insee.fr/demo/index.html|demographic dataset]] and the [[http://rdf.insee.fr/geo/index.html | geographic dataset]]. We will basically show the content of the two datasets and distribute a query over these two endpoints. As a result, we will show the provenance of intermediate results as well as we will monitor the cost of distributed query processing in terms of the volume of issued intermediate requests and transferred intermediate results.
The last stable version of Corese-server can be downloaded here ftp://ftp-sop.inria.fr/wimmics/soft/corese-server-3.1.3.jar, and the current development version can be found, after project compilation in ''trunk/kgserver/target/corese-server-3.1.4-SNAPSHOT-jar-with-dependencies.jar''
=== Corese server launching ===
Here is the help description of **corese-server** :
java -jar corese-server-3.1.4-SNAPSHOT-jar-with-dependencies.jar -h
usage: kgserver [-e] [-h] [-l ] [-p ] [-v]
Once launched, the server can be managed through a web user interface, available at http://localhost:
-e,--entailments enable RDFS entailments
-h,--help print this message
-l,--load data file or directory to be loaded
-p,--port specify the server port
-v,--version print the version information and exit
We now launch a first server on port ''9091''. We will then configure it from the web user interface two load the datasets.
java -jar corese-server-3.1.4-SNAPSHOT-jar-with-dependencies.jar -p 9091
Once launched, the **Data querying** interface can be accessed through http://localhost:9091 :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.31.50.png?direct&500 |SPARQL querying interface }}
=== RDF Data loading ===
Data loading is then performed through the **Data loading** tab :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.37.34.png?direct&500 |SPARQL data loading}}
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.40.56.png?direct |Datasets available online}}
We load the **INSEE Geographic** dataset.
Finally, we can check the content served by the SPARQL endpoint through a set of statistical queries :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.44.02.png?direct&500 |Counting RDF triple}}
Counting RDF triple :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.45.26.png?direct&500 |Counting and sorting triples using properties}}
Counting and sorting triples using properties :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.47.06.png?direct&500 |Counting and sorting class instances}}
=== Second Corese-server launching and populating ===
We launch a second server on port ''9092''
java -jar corese-server-3.1.4-SNAPSHOT-jar-with-dependencies.jar -p 9092
and configure it similarly with the **INSEE Demographic** dataset
=== Federated querying configuration ===
We launch a third server that will be responsible for distributing a SPARQL query onto the two SPARQL endpoints :
java -jar corese-server-3.1.4-SNAPSHOT-jar-with-dependencies.jar -p 9093
We add the two SPARQL endpoint URLs through the **Federated querying** tab :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_11.55.12.png?direct&500 |}}
And we should obtain the following configuration :
{{ :public:corese:capture_d_e_cran_2014-07-25_a_12.03.41.png?direct&500 |}}
=== Federated querying (DQP) ===
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.11.18.png?direct&500 |}}
== Results provenance ==
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.53.46.png?direct&500 |}}
== DQP live monitoring ==
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.12.33.png?direct&500 |}}
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.12.38.png?direct&500 |}}
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.12.24.png?direct&500 |}}
{{ :public:corese:capture_d_e_cran_2014-07-25_a_14.13.19.png?direct&500 |}}
===== Source code organization =====
* Server :
* SPARQL Endpoint
* WebApp
* DQP
* QueryProcessDQP
* Metaproducer
* RemoteProducer
* ServiceProvider
* QueryRewriting (Path / Named Graphs / Services)
* Provenance
* Cost
* Triple Pattern grouping