SPARQL endpoint and InterOperability for enviroCar data


Introduction

This is the wiki page for the sparql enpoint project which is a Google summer of code 2015 project to create an endpoint of their Linked open data . For this project i will be enhancing the EnviroCar server to support an SPARL Endpoint . The idea is to connect open data sources and provide a standard for all useful information coming out of the web . The project will heavily use Apache Jena . You can follow my progress on my github repo where i modify the server code https://github.com/gotodeepak1122/enviroCar-server . I hope you find this project as exiting as i do and will try to write the code as object oriented and resuable as possible . Feel free to collaborate or discuss new ideas with me about the project at anytime.

Weekly Reports

Week1

Status

  • wrote module for writing to linked data we recieve have into Apache Jena Fuseki which is a Triple store
  • Added intelliJ gitgnores to local repo ( repo is pull request ready )
  • pondered about making the inserter generic that just a Fuseki inserter , to write a resuable module if switching to another triple store of the future

Problems faced

  • Fuseki sometimes does unexpected write exceptions at times ( needs cleanup procedure )

Next

  • Plug in triple store writer into the server code
  • Ask for development server to clone database , as db syncing features will require db testing
  • Test fuseki endpoint for all possible queries and loop holes

Week2

Status

  • uploaded repository which takes in sample enviroCar data and stores it in a triple store
  • Read about web and data services in enviroCar server and looked for places to add an event for data storing
  • Making sample database dumps to test module for cloning the existing data into Fuseki

Problems

Next

  • Create a process that runs on the existing EnviroCar database and extracts the entities from it
  • Learn to use the existing RDF converter of the enivroCar server

Week 3

Status

  • Writing module to clone data from mongoDB into Java , and make it more integratable with other DB
  • Writing module so it can be re-run ro clone only cloned or new data
  • The progress to the cloner can be found here , the DataDump here

Problems

  • Functional style code , and needs to be integrated with Morphia
  • Morphia Joins might or might not exist after cloning

Next

  • Convert remaining entities into POJO and representable format
  • Make the data triple store writable by converting using the necessary linkers

Week 4

Status

Problems

  • Decoupling the processes from frameworks being used

Next

  • Maintain extensive documentation for newly created designs
  • Implement MongoLoader
  • Ask for dev servers to test module with more enviroCar data

Week 5

Status

  • Implemented Loader for the conversion of data from mongoDB to Fuseki , code can be found here
  • Fitted in all the pieces together to make a sample endpoint running on the datastore server link

Problems

  • Improve running documentation and logging
  • Software does not have a script to install as of yet and instead has to be manually done

Next

  • Conversion works only for tracks , other entities need to be converted
  • Explore new data sources to add to the linkers of enviroCar data
  • Plug the running module in the enviroCar server or an enviroCar server simulated environment and check for any bugs or excpetions that pop up
  • Plug in another triple store and try to store data there to check the extensibilty of the module

Week 6

Status

Problems

  • Getting stuck in SSN linking conversions
  • Needs few configuration modifications to run on demo server

Next

  • Upload apache shiro.ini file to make changes on the demo server
  • Add a video screencast to the project
  • Check for mongoDB 3.0 compatibility
  • Complete Linkers for the other entities

Week 7

Status

  • Added different method of inserting into fuseki using sparql protocol
  • Discussed about authentication with mentors and org admins , got up to speed with apache shiro

Problems

  • No Authorization is causing deployment problems one way or the other
  • Server doesnt seem to have mongoclient commands running for non sudousers

Next

  • Upload dataset into developmental server
  • Solve authentication problem as fast as possible

Week 8

Status

  • Proof testes authentication methods of Oauth , reverse proxy ( mentor aided ) and shiro
  • Installed Silk and interlinking new data sources with LSL configuration
  • Found version of mongoDB that connects with existing code to be installed by source

Problems

  • None so far

Next

  • Focus on interlinking of datasets initially with GeoNames dataset untill Daniel comes back
  • Keep trying shiro to proof test the only non sudo way of getting authentication in the server
  • Make the Endpoint jar configurable and easy to run

Week 9

Status

  • Downloaded behemoth 8.5gb geonames dataset and tried linking existing names to geonames resources with the "owl:sameas" ontology
  • Ran silk on geonames rdfdump , dbpedia endpoint to find links between enviroCar dataset and these datasets
  • Installed a script ( requires root login ) to block all url's which are unsafe for public use with iptables and give access to endpoint using reverse proxy methos
  • Made the software read from a config file to easily change parameters on a server without needing jdk

Problems

  • Geonames dump and DBpedia endpoint are way too massive starting threads for silk to run and give links in a runnable amoint of time
  • Gsoc demo server requires iptables installed
  • My existing test data set contains far too little information on cities or any geonames related data and doesnt show resluts for dump files for the api's get request
  • Linked open geo data's RDF can be directly linked to measurements via the POINT in geometry features but this would require putting the POINT resources in a "dul:hasmember" in measurements which isn't currently in the envirocar schema
  • Linking tracks to is a little tricky schema and would require tracks -> measurements -> points -> linked (sameas) point which adding a layer to the schema which needs to be approved

Next

  • Thought of writing a Geonames parser which would crawl and extract only the rdflinks and thier desciptor and use a key value par mapping to reduce the breadth of the endpoint
  • Work on getting the linked open geoset working
  • Make the config file on server running as it is currently is not working as in local machine
  • Read more about security possible loopholes and possible suggets a better way of securing the data on fuseki using NLTM , kerberos or TLS

Week 10

Status

  • Installed mongodb from source in the demo server and uploaded test data
  • Outlined methods to write into a triple store using normal sparql query by appeding query to a file
  • Tried inserting , querying , putting and deleting data from Fuseki using Graph store protocol and sparql queries and sparql queries seems to be a better fail safe way
  • cURL and normal http POST results of the queries seem to fail depending on the SPARQL server hence outlined all test cases with tests for major implementations ( Openlink virtuoso , Jena Fuseki)
  • Tried deploying code to demo server and everything seems to be fine except Fuseki because it would need a configuration file form default to use options of --update than fuseki.war
  • The username password pop up can be just closed without entering anything and can still be accesed on Fuseki on certain browsers like safari , so Basicauth can be vulnerable
  • Uploaded SILK jobs to silk wokrbench and continued working on trying to link the latitudes and longitudes of the enviroCar servers measurements , only need to extract the source path correctly and linking can begin
  • Writing initial plans for final deployable endpoint's process flow and script which manages everything from configuring fuseki and mongodb , to extract data from mongoDB , read from temporary file and prepare SPARQL update or insert queries , run silk jars with LSL configuration and modifyies the rdf dumps , and then call the SPARQL update with query on top of http : should perfoming all these tasks all in one package or executable ( ideally should be run as a cron job to keep the endpoint updated )

Problems

  • POST requests with basicAuth seem to not be passing while similar cuRL http sparql updates work
  • Fuseki deployed as a war file can't add new datasets from the admin panel and hence needs to be configured from configuration file with a writable file for tdb storage
  • SPARQL graph protocol ( SPARQL 1.1) seems to need fine tuning for each different SPARQL server hence considered backporting from http requests to generating it from sparql queries
  • Can't use Sparql over http (SOH) as though it works , requires Ruby as a dependancy to be installed
  • Might need to create a new shiro filter even after loading reverse proxy to support authentication but might cause problems for the SPARQL update queries

Next

  • Deployment and configuration of fuseki on remote servers ( this will solve half of the problems )
  • Find out how fuseki deals with SPARQL queries when authentication is used
  • Find out the Silk workbench configurations and transform them into the LSL xml configuration file
  • Deploy everything from source on server to emulate difficuties and request for immediate acess to deploy on final server when it is ready due to large number of configurations , dependancy files that the final script has to deal with
  • Write a unified SPARQL client with using sparql queries purely from Java File handling

Week 11

Status

  • Wrote LSL Config files to link data between enviroCar and LinkedGeodata , found some issues and reported it to the SILK community
  • Explored Jena Fuseki source code , trying to figuring out how to modify the software and try and deploy a war to our own existing needs , main areas focused was the Default Shiro Local host filter and handling of SPARQL updates
  • Completed more rdf linkers and generating dumps for all entities is almost done
  • Completely went to my code to rewire deployment flexibility

Problems

  • Only existing SILK erratic behaviour issues

Next

  • Deploy all flexibilities with existing features
  • Determine which new features take priority in finishing up
  • Put a hold to the auth problem and instead focus on code features completion , documentation and cleanup
  • Scan code and ensure highest quality and to all deployed code from project inception

Week 12

Status

  • Fixed missing Base Url Error with served Linked Data
  • Few RDF Linkers left , thoughts penned out about final data consistensy
  • Worked extensively to solve silk erros , resolved many but still facing issues to resolve before interlinking can happen , as of now data is parsed but 0 entities are getting cached

Problems

  • Entities are not getting referenced in SILK even though syntax after revisting all documentation seems to be correct

Next

  • Make video showcasing all feautres
  • Keep base of DBsync and start configuring final script and deploy instructions
  • Solve SILK errors and find out why entities are not getting cached
  • Extensive Documentation

Week 13

Status

  • Checked if geoSPARQL can be used to help with interlinking of data with likedgeodata's RDF without needing the help of silk
  • Awaiting reply to the problem which is keeping us stuck in SILK to the SILK community
  • Scanning through SILK source in scala to find out what to problem is and why no measurement entities are being discovered with our input paths
  • Embarked on a Java project which can compare two wgs84 (enviroCar ) and GeoSPARQL ( linkedgeodata) and get the distance between them , can be used for static dumping untill SILK provides a better solution . Trying it with GeoCalc
  • Converted all scripts to run inside Java using the Runtime Class so that execution is easy with only one Jar file

Problems

  • Sometimes the user directory picked up by Java depends on where the 'mvn clean install' is run and hence doesn't pick up the config file , unusual case but can error proof it by getting the project base directory with a static block
  • Produce a faster traversal method for scraping through the dump and collecting the wgs84 information

Next

  • Migrate the Jar and module to run on actual enviroCar data and fix any issues that pop up with the actual data so it is up and running and can produce real world dumps and can upload
  • Some parts of my video are a bit soft spoken as feedback provided by Ann , shall re-record to make it better
  • Explore the enviroCar servers production code and see what configurations are present and how we can run Fuseki and our modules to effetively create a live SPARQL endpoint by working with mentors Daniel has assigned for the same (Eike and Christoph if i'm not mistaken )
  • Do all that is required left to get summer of code repository merged with the enviroCar-Server master on github
  • Possible steps to convert the RDF into GeoSPARQL ( would be fantastic fro applications ) as discussed with Carsten only implementation is left
  • Try working on issues on Envirocar-server's github to get a better understanding of the server-code to improve the quality of code of my repository and for better integration


Topic revision: r20 - 27 Jun 2016 09:02:39, UnknownUser
This site is powered by FoswikiCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Wiki? Send feedback