Read the steps below to understand how you can start with the contest:
STEP 1: Register for the contest
Read the post on HOW TO PARTICIPATE for details.
STEP 2: Analyze the Problem Statement and the datasets
Read the Problem Statement here.
Two datasets (Training and Prediction datasets) are provided as part of the problem statement.
The links to data sets are:
* Training Data
* Prediction Data
STEP 3: Login into StreamAnalytix and build your application
Login into the StreamAnalytix Cloud instance that you received in the email confirming your registration. Use the username and password that you provided while registering for the contest.
You will be able to login into your own StreamAnalytix Workspace.
Workspaces are logical containers in StreamAnalytix. You can read more about workspaces here.
Contestants Workspaces are pre-configured with the following components:
* Ready to use schema definition of ‘NetworkConnectionsDetails’ data
* Data generator for ingesting streaming data into spark data flows
* Connections to DFS, Hbase, ES, and RMQ
* Dashboard for creating widgets & graphs
Contest Pipeline development entails the following tasks:
* Create pipeline on the pipeline canvas
* Use ‘Data generator channel’ to ingest a data file to the streaming application
In order to create a pipeline, navigate to the ‘Data Pipeline’ link on the vertical sidebar.
Read more on how to create a pipeline here.
When saving a pipeline, most of the parameters are auto-populated with fixed values for contestants.
Details on the Data Generator Channel are available here.
Enrichment of data with latitude and longitude values using IP to Geo-mapping: a static dataset of IP to Geo-mapping is provided in an HBase table (ip2geo).
"ip2geo" table in hbase maps ip to latitude & longitude values. These values are stored in ‘geocoordinates’ column with column family ‘geopoint’.
LookupHbase function can be used to lookup geopoints by providing ‘row key (ip)’ value as input. The result can be assigned to the message field geopoint by using a general purpose programming operator named ‘enricher’. Enricher (help link) allows creating complex data manipulation from a variety of data sources, MVEL expression & embedded JAVA routines.
For instance – the MVEL expression
Will lookup ‘EMPColumnFamily’ column family of an Hbase table – ‘Employee’ using rowkey of ‘ID’ field in the StreamAnalytix Message named ‘EmployeeDetails’ & extract value of ‘Salary’ column from the first row result ().
* Data Generator operator will stream file contents only once
* Restrictions on number of cores, memory & data rate for pipelines
Data partitioning across Workspaces – you can click on ‘View Data’ icon in sidebar to view and query ES, RMQ, and HDFS
To execute multiple runs of the pipeline when using the data generators, you can stop and start the pipeline again to replay data generator data.