Open Sensor Search
: Mohammad Ahmed Hamed Yakoub
: Simon Jirka (firstname.lastname@example.org
), Daniel Nüst (email@example.com
The project follows and agile software development method, so we do not have a fixed set of tasks. We use Scrum (to a certain degree) with one week long sprints and telcos on Mondays and Fridays.
- In this week I explored the code and understand it's different components , since our OSS depends on the current implementation of SIR , we needed an extensive Integration tests for testing those components and make sure they work as expected.
- First the code was transferred to Github account , me and Daniel forked our repos.
- The test cases were all implemented according to the F.I.R.S.T methodology using JUnit4
- The following tests were implemented:
We found two bugs in the system:
- The Configuration and the system variables weren't set up in a suitable way to allow for writing integration tests.
- The insertSesnorInfo method doesn't work as expected and doesn't insert new sensors on calls.
This week we concentrated on the data backend , we implemented the following :
- Indexing mechanism for the sensors Keywords - Location - BBox center.
- JMeter testcases for the following two scenarios
- Autocomplete servlet.
- Apache Solr search test to be compared to the existing PQSQL implementation.
- Unit tests were implemented and run to test the functionalities for all of the previous tasks
Results for Next weeks
- The Apache Solr contain a stored index of sensor keywords , locations and BBox center
- Verified and successfully run unit tests.
- Complete JMeter testcases.
- BBox wasn't implemented as a supported datatype for Apache Solr , and so we have to implement it using a position (the center) and modify the implementations accordingly.
- JMeter tests weren't so decisive due to the lack of sensor test files , so a set of test sensor is to be generated via a harvesting mechanism or via dummy random generations
All of the commits of this week were implemented under the solr branch of the github repo and can be found at : https://github.com/Moh-Yakoub/OpenSensorSearch/commits/solr
In this week we completed the indexing and storing of all of the metadata of sensors , to be indexed in apache solr , we added the following fields
- Classification values from discovery profile.
- Valid time
- Interfaces values.
- Inputs and Outputs and measurements
- Identification values
For each of the previous fields , there were a unit test , to index ,store retrieve and test the autocompletion.
To solve the problem of lack of sensors we generated sensor by two ways
- Dummy test Sensors : generated randomly , depend on the site (http://www.databasetestdata.com/) , you can check the test data file at ; https://github.com/Moh-Yakoub/OpenSensorSearch/blob/solr/52n-sir/src/test/resources/data/randomSensors.json
- We configured the Harvest service to insert sensors into the Solr backend as well .
- We run some tests on JMeter to compare results.
TODOs For the next week
- The harvest service didn't work because it depend on validating the sensors which returns invalid for all sensors with no classifications.
- The JMeter tests were required to be modified to test them against the ordinary SIR interface.
- Include the SIR interface in the JMeter test to test the PGSQL backend vs the Apache solr one.
- The Autocompletion should include any combination of indexed fields , at any order.
- The test sensors datasets should bigger ~ 1000 sensors for good comparisons
- In this week we concentrated on the test performance of the OSS with the Solr and PGSQL backends , namely we used JMeter for the following tests
- OpenSearch binding with Solr Backend.
- OpenSearch binding with PGSQL Backend.
- Autocompletion of OpenSearch performance test.
The results can be found here
- We implemented the InsertSensorInfoRequest to allow adding of inserted sensors in both the Solr and PGSQL backends.
- We implemented the dummy sensor generation based on tempelate for testing purposes
- We implemented searching by all indexed fields for autocompletion and search purposes
- We added to the implemention of the XmlListener , that returns the results of the sensors if they contain persistend data fields.
- We allowed the autocompletion to use multiple word for different indexed fields for a google UI like autocompletion.
Tasks for next weeks
- The harvestSensor is still not working , the server log indicates a problem related to the SirConfigurator.
- The Listeners doesn't support all of the different data representations for sensor data (JSON - feeds , HTML...).
- Work on the harvest sensor bug.
- Implement the different data types listeners.
In this week we did two important steps in the implementation of the OSS:
- Merge the two development branches of me and Daniel , so that we can have a shared codebase , and so that I can share his implementation of Sensor unique Id generator.
- Implement the temporal and spatial search so that it complies with the temporal and spatial search specifications for the open search.
1) For the merge part , I merged my solr branch with the local master and merged them with Daniel's master that included the following modifications
- Unique Sensor ID generator.
- License header modifications.
2) The second part included the implementation of the following parts
Other tasks :
Notes for next weeks
Tasks for next weeks
- Temporal search built on the dtstart and the dtend fields of the SensorDescription.
- Change the SolrSearchSensorDao to include the method of searching based on the valid range and those two fields , also to implement filtering queries results based on temporal data.
- Implementation of the following Unit tests : Keyword+temporal for search sensor DAO , keyword+temporal+spatial for the search sensor DAO and the integration test of search by all fields and temporal filtering
- The SIR interface has a different schema for search criteria , so we transformed the SIRSearchCriteria to SearchCriteria to stick to both standards
- Spatial search built on the location , bbox_x (bounding box center longitude) , bbox_yfields of the SensorDescription , the search used those fields along with that of radius to implement a full spatial search based on boundingBox and Location
- Change the SolrSearchSensorDao to include the method of searching based on the SearchByGeoBox , SearchByGeoBoxWithQuery , SearchByLocation(lng/lat/radius).
- Implementation of the following Unit tests : Keyword+spatial for search sensor DAO , keyword+temporal+spatial for the search sensor DAO , temporal+spatial , and integration test of search by all fields and spatial filering for the OpenSearch interface.
- The SIR interface has a different schema for search criteria , so we transformed the SIRSearchCriteria to SearchCriteria to stick to both standards , an incomplete IT test was implemented too , for the temporal extension
- Start working on the Harvesting mechanism.
- Implement the listeners for HTML , Feed and KML for the OSS interface
In this week the work was included in the following themes
1) The harvesting mechanism
- The harvesting mechanism.
- The sandbox of the harvesting scripts.
- Harvesting scheduling.
- Dependency Injection using Guice
2) The Sandbox of harvesting
- The DB script was modified to allow the harvest scripts to be persisted for later use , it persisted the values of the User , scriptId , version ,lastRunTime and last update , A DAO was implemented accordingly.
- A form was implemented to allow the users to upload their scripts for harvesting.
- The harvesting used a restful interface depending on apache Jersey and Guice
To protect the server from the scripts , sandboxing using class shutters of rhino allow to restrict the classes that a script can access , a class
was implemented for that functionality and allowed the access only to the Insertion DAO.
3) Harvest scheduling
The scheduling depend on using quartz scheduler
which lets the server harvest scripts at given times by implementing jobs , The listener
allowed the initialization of the service at server startup , the user were allowed to access routes to let their sensors harvested at certain times using restful routes , the quartz functionality is not yet complete and need some modifications.
4) Dependency Injection using Guice.
The DI using guice
was the most important part of the implementation to allow better testing and to inject all the dependencies , the listener was implemented and the Guice configurations
was implemented to run at startup
Tasks for the next week
- The configuration of Guice with tomcat was a little bit badly documented and required much search to get the right type of configurations.
- The sandbox allowed only restriction of classes , we need to extend it to restrict method access to even for the java standard library.
- The scheduler is not fully implemented and is required to be tested by unit and Integration tests.
- The harvest scripts DAO is needed to allow search of script by Id and updating script version.
- In this week we worked more on the harvesting mechanism , mainly we implemented different harveseters for different sensor data sources:
- Adding sensor data harvester for sensors stored in SmartCitizen platform
- Adding sensor data harvester for sensors stored in an OWS service.
The problem we get was to try to allow developers to use platforms like JQuery and other JS platforms , while trying to use platform like env.js , there were a lot of bugs and we couldn't install it properly so we did an HTTP utility class
to be called by JS scripts , all of the previous tasks were tested using unit and integration tests.
Tasks for the next weeks
- The ability to use Platforms like JQuery wasn't succesful and wasn't implemented properly and we used java code instead for HTTP request tasks.
- The quartz mechanism wasn't tested properly ( check if the sensor was harvested at the indicated time) .
- Search for an alternative way to use JQuery and other JS frameworks or use a different harvesting mechanism.
- Implement the scheduling mechaism unit and integration test properly to make sure that they work as expected.
In this weeks we worked on the following tasks
- Quartz scheduling testing : A unit test was implemented to make sure that all of the units involved in the implementation of the scheduling work as expected including the binding , the job and the DAO .
- The alternative harvesting mechanism was implemented completely , including the restful method implementation , the harvest job and the DAO.
- An initial unit test was implemented for the alternative mechanism.
- The unit tests used a mock test using the wiremock library.
- The initial UI for the web interface was implemented using the twitter bootstrap as a UI library .
The initial UI that will be used for the whole harvesting web app
Tasks for the next weeks
- The unit test needs to be rewritten as most of the implementation of the units was not modularized as neccessary as possible.
- The jersey framework responses were always chunked i.e. sent in stream , which can't work with ajax , I had to sent content-length as header manually.
- Complete the implementation of the UI.
- Enhance the non-well modularized units and their unit tests.
- Use a framework (MVC - Play - Spring ,.. ) for the implementation of the UI , especially authentication and views.
Alternative harvesting Techinque
- Another alternative that's dependent upon a restful mechanism is explained in the diagram as follows:
This figure explains an alternative mechanism for the harvesting mechanism.
The mechanism works as follows
- The harvest script developer develops his server which is a restful web services with a certain specified routes.
- The harvest developer makes a POST call to the OSS , with the URL of the webservice.
- The OSS returns the Script Id and an auth_token to the user , the Id is to identify the sensor and the auth_token to allow his access over it.
- The OSS then makes a harvest call to the server and returns the result
- The user can then make GET , POST and DELETE calls to the OSS using his Id and auth_token to update , reharvest and delete the script.
The harvesting mechanism can support the following methods on the OSS side
- POST /harvest/url=... : Let's OSS to harvest a URL , returns the ID and Auth_Token
- POST /harvest/ : Reharvest an inserted script , takes one parameter , the ID.
- DELETE /harvest : Deletes a script from the scripts list , takes two parameters ID , Auth_Token.
- GET /harvest/:id : Checks the state of the sensor harvesting process - pending , successfully harvested , failure on harvesting
- GET /harvest/:id/status : A description of the process output.
- POST /harvest/schedule , params : id : script Id , date : time to harvest : do a scheduling , returns a Job Id
- GET /harvest/job/status/:id : Get the status of a scheduled job/
The developer role is to implement a restful web service with the following route(s):
- GET /sensors : retrieves a list of json/xml encoded array of sensors , the developer can have his/her data source from different places as indicated in the diagram above (DB - Custom sensors - implement an HTTP call to harvest platforms like smartCitizen - thingsSpeak ,.. and so on)
- Allows the user to develop his harvesting script in any language and be able to add his sensors to the OSS.
- Allow the harvesting mechanism to be decoupled , so that we implement a thin layer for harvesting only on the server and all the other work of interpreting and loading another language code into java code is avoided which is insecure and inefficent also not very stable.
Week 9 & 10
In that week we completed work on the OSS-UI for uploading both scripts and remote servers using Spring MVC
, We implemented a complete authentication + permission level system , the following screens were implemented and tested
- Login with successful / fail indication
- Remote sensor upload URL
- Remote sensor schedule harvesting
Fig - Sign in window
Fig - Wrong authentication
fig - Remote server upload
Fig - remote server with auth_Token returned
Fig - script uploaded msg
- The two harvesting jobs : Remote harvest and script harvest were rewritten to be unit testable.
- Unit tests for the previous harvest jobs were implemented.
- Daniel worked on many parts of SIR , changing it's configuration to Guice and it's name to OSS , he made a pull request , I later merged the harvestCallback branch with his master to get a shared codebase.
Tasks for the next weeks
- Authenticaion method need to be implemented for the script uploading so that each user can access only his/her sensors , the method of Cross-domain policy will not be helpful in that case.
- The scripts need to include license headers.
- Authenticaion method need to be implemented for the script uploading so that each user can access only his/her sensors
- Implement useful API for users to be able to validate and convert their sensors
In this week we concentrated on the OSSUI , we implemented the license agreement checking and automatic license appending to the uploaded script
Fig - How the license agreement works in the OSSUI
2 - The User access DAO
was enhanced to allow auth token authentication for all scripts , A user access resource
was implemented and tested : The scenario works as follows
- The user logs in using username and password.
- On successful authentication an auth token is returned.
- Whenever the user needs to access a restricted access resource he uses the auth token , the OSS then checks it and see which user matches it and sees if the the user is allowed to access the resource or not
3 - The Harvest resources
and the DAO
was updated to allow the authentication schema mentioned in point #2
4 -Two public api methods were implemented
- validator , to validate a given SensorML document.
- Converter , to convert a sensor to a given format (json - ebrim)
- The IT was implemented and run for the validator.
In this week we concentrated on the finalization of our tasks and listing of the future tasks
- The whole Project webapp was written using twitter bootstrap v3.
- The content assist was implemented according to the guidelines of OpenSearch specifications.
- The autocomplete was inserted in the OpenSearch Document.
- The client webapp was validated using HTML validator , it was tested among major browsers : Mozilla Firefox , Google chrome and IE.
- A search nearby mechanism was implemented that allow search by near location depending on GEOLOCATION API of the browser.
- The API documentation was implemented using annotations and Swagger UI framework.
- A Github Page was implemented containing the most important developer documentations of the project.
- A script view page was implemented that allow user to view and list scripts.
- Registeration form and mechanism was completely implemented using admin permission for user validation.
- The IT was revised to see which tests need to be written in future releases.
- Few bugs were found and reported at the Github repo issues , the most important ones were
- The auth_token wasn't sent correctly in the headers of requests from the OSS-UI
- The response from the jersey webservice was always chunked which didn't work well with Ajax Calls
Apache Solr vs PQSQL backend
One of the most critical aspects of the OSS interface is speed and high performace , for this we need to have a reliable backend that's reliable , with high performance and speed , to make sure that indexing the data in the Solr backend adds to both the speed and performance we performed some tests using the Apache JMeter tool
The test environment contained the following
- A set of 1000 sensors randomly generated test sensors that were inserted in both the PGSQL and Apache Solr using HTTP Request (https://github.com/Moh-Yakoub/OpenSensorSearch/blob/solr/52n-sir/src/test/resources/data/randomSensors.json)
- The OSS was deployed to Apache Tomcat 6.0
- A search query was used to make search once for PQSQL , once for Apache Solr , with 100 calling sensor , the time for each thread was recorded.
- The results set data format was XML data format.
Fig 1 - The test performace of OSS with the PGSQL backend
Fig 2 - The OSS test performace with the Solr backend
As the results show
For the OSS binding with Solr :
A total number of 100 threads requesting HTTP requests took around 1843 ms
operation / minute
For the OSS binding with PGSQL
A total number of 100 threads requesting HTTP requests took around 4665 ms
operation / minute
While the results show a great performance enhancement of Solr over PGSQL , the results are not decisive yet , there are few things to add to the next tests
- The dataset was randomly generated , we need a real data sets harvested from remote sensors to test against.
- The results were shown only for the XML data response format , what about other formats and outputs (HTML-KML-JSON-...)
- The results were tested on a local machine , what will be the results when both are accessed via a remote call.
Autocomplete Test Performance
The autocomplete is a crucial part of the search engine , in OSS it suppors a Google-like autocomplete with many words related to different fields , the autocomplete needs to be very fast becuase it runs in realtime , We made a test using the Apache JMeter on a set of samples of 100 threads on a set of 1000 dummy sensors generated as in the previous section , the results were like the following
Fig 3 - Test performance for the autocomplete servlet
As we can see , using a number of 700 sample took 928 ms a throughput of 1687.628 operation/minute which is less than the Solr backend search process throughput but can be attributed to the maniupulation of result sets in the java implementation of the autocomplete servlet , which shows that the throughput is not that bad.
I'm Mohammad Ahmed Hamed Yakoub , I'm a Computer Engineering premasters student , I'm interested in software development in general , I mostly use java , I hope to implement the OpenSensorSearch
Idea and make a successful and productive summer at GSOC2013 here at 52°North
Original Project Idea
: We want to take sensor data in the web to the next level by implementing the one stop shop for finding sensor data. This project fights this battle at two fronts: First, the existing 52°North Sensor Instance Registry (SIR) implementation must be made more open: Registered users should be able to request a harvesting of their data source as well as provide their own harvesting implementation to integrate their metadata into the catalog of sensors. The potential student would also implement such harvesting mechanisms for popular sites such as Cosm and Thingspeak.
Second, the database interface must be switched out to something that scales and is really (!) quick, such as Apache Lucene and/or commercial cloud storage. The infrastructure should be tested with a simple search form with a good auto-suggest search field - you could call it "Google for Sensor Data" :-).
The student should have some experience in Java and an interest to scrape data from the net and put it in a high performance data structure. Experiences with Apache Lucene, XML and Database indices as well as cloud infrastructures are great but not mandatory. The complexity of this project itself is quite scalable. Expected results
: Scalable database backend for 52°North SIR, user management for open sensor search, public harvesting API with harvesting configuration UI for open sensor search.
Community and Code License: Sensor Web, Apache 2.0
I hope the development cycle goes like the following :
- In the first weeks I hope to understand the code , enhance and test it via Unit and Integration tests and Unit tests of all the specifications and details of the software.
- Later on I hope to make a concise comparison between the products that can be used be used to implement the backend search engine , and make sample tasks and tests to settle that decision.
- Later after we settle those decisions I hope to implement the system as specified above , and make a general integration and unit testing , then I hope to work on enhance the UI and extend the capabilities of the system.
- I hope we can implement this system via an agile methodolgy (specifically : Scrum framework) and apply Test driven development for best utilization of time and resources and higher quality