Indexing Data to Elastic Search Part-1
PART1: There are two methods in Elastic Search to index relational Data. River’s are deprecated from elastic search 1.5.x version on-wards in place of this…

PART1:

  • Indexing Data Using Elastic Search API from various Sources(From CSV, XML, Database, flat files etc…)
  • Here Coming to Elastic Search Perspective we have another way to index data from relational databases to elastic search.

There are two methods in Elastic Search to index relational Data.

  • Using River Method
  • Using Feeder Method

River’s are deprecated from elastic search 1.5.x version on-wards in place of this there is a new concept introduced to index relational databases data, using Elastic Search Feeder which is an individual component.

1. Indexing Data Using API From Files:

For Example, We can Index Data From a CSV file.

Note:

Since Elastic Search is noSQL, with out creating index explicitly it will create his own structure for index based on the data, we are pushing to ElasticSearch.

This is good feature from ElasticSearch but we have small problem in perspective of memory if we give index creation responsibility to ElasticSearch API when we are pushing data.

Why, what is the problem:

There are two main reasons why we need to create index Explicitly.

1. For Example, I am indexing an Integer field from CSV file but when I am creating index while running my code, there might be a chance that, ES will consider this field as long. But Integer data type is fine for my field but in es it is representing as long. so If we speak about memory occupied by Integer and Memory Occupied by long data type, long will do more consumption of memory when compared to integer so this is wasting of memory on that particular field.

2. we can remove or append full text search to the field inside schema. By default all string fields will be analyzed when indexing, which will not require on particular field this feature we can add at the index creation time only once. Once we index the data we can not change existing field behavior(analyzed, not_analyzed) but using multivalue field concept we can change behavior of particular field.

Why Memory is Important in ElasticSearch:

Memory is important in ES systems, while we are performing aggregations operations on elastic search. when we query elastic search ES will load all records in to memory in order to calculate the requested aggregated(sum, Average etc..) value. so all my fields load in to memory in order to calculate requested operation, so here long will occupy more memory than integer now. So If we save our memory on field our operations will be good as well as it will not lead cluster out of memory due to insufficient memory.

It is not a condition that only because of “field data” cluster will go OOM issue but if we save those small small memory on fields, we can get optimal performance from cluster with regards to index operations, as well as we can ignore vertical scaling cluster frequently due to memory issues.

So now we can see how to index data from CSV file.

First we need to get TransportClient connection in order to connect to ES cluster. The following is the code snippet to connect to cluster using transport Protocol (it will not work for node protocol)

Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name","RND").build();
TransportClient transportClient = new TransportClient(settings);
transportClient = transportClient.addTransportAddress((TransportAddress)new InetSocketTransportAddress("10.0.0.40", 9300));
//IP //port
if(transportClient.connectedNodes().isEmpty()){
System.out.println("unable to reach es…");
}else{
//will go with our logic
}
“RND” – cluster name
“IP” – ip of any node in cluster. – we can define list of ips in addTrasportAdress which will connect to other nodes in cluster.
Port: port for communication.
Here for transport Protocol we will use 9300 port.
Next we will use CSVReader class to read CSV file.
Here is the code to get records from CSV line by line.
CsvReader metaProps =new CsvReader(path);
metaProps.readHeaders();
while(metaProps.readRecord())
{
String field1=metaProps.get("field1").trim();
String field2=metaProps.get("field2").trim();
String field3=metaProps.get("field3").trim();
String field4=metaProps.get("field4").trim();
String field5=metaProps.get("field5").trim();
String field6=metaProps.get("field6").trim();
String field7=metaProps.get("field7").trim();
String field8=metaProps.get("field8").trim();
}
path –- path of csv file

Indexing to ElasticSearch:

XContentBuilder builder = jsonBuilder()
.startObject()
.field("test1",field1)
.field("test2",field2)
.field("test3",field3)
.field("test4",field4)
.field("test5",field5)
.field("test6",field6)
.field("test7",field7)
.field("test8",field8)
.endObject();
String json = builder.string();
IndexResponse response = transportClient.prepareIndex(indexName,type)
.setSource(json)
. .execute()
.actionGet();
System.out.println("index response is:::"+response.getId());
indexName: name of the index
type: type of index.

we will get response of inserted record id using above line response.getId() it is like acknowledgment from es that record got inserted to ES index.

So the above code will index any kind of csv information to elastic search.

Index Creation at ES index Explicitly.

Create JSON structure like following on es side.

“test1": {
"aliases": {},
"mappings": {
"test_index": {
"properties": {
"test1": {
"type": "string",
"index": "not_analyzed"
},
"test2": {
"type": "string",
"index": "not_analyzed"
},
"test3": {
"type": "string",
"index": "not_analyzed"
},
"test4": {
"type": "string",
"index": "not_analyzed"
},
"test5": {
"type": "string"
},
"test6": {
"type": "string",
"index": "not_analyzed"
},
"test7": {
"type": "string",
"index": "not_analyzed"
},
"test8": {
"type": "date",
"format": "dateOptionalTime"
},
"test9": {
"type": "string"
}
}
}
}
After data indexing record will look like following:
{
"_index": "space_auditlog",
"_type": "space_auditlog",
"_id": "AVCpMrUKjLjAmZCH_ZuM",
"_score": 1,
"_source": {
"test1": "12923",
"test2": "Logo",
"test3": "Folder",
"test4": "admin",
"test5": "admin",
"test6": "test me",
"test7": "10.0.3.79",
"test8": "Delete",
"test9": "2015-10-27T08:03:55"
}
},

Note:

The fields which are going to index from CSV to ES the key values declared in jsonbulder must and should match the key values index other wise a new field will be added with the wrong name field which will lead to data inconsistency and search results.

For Example , test1 from needs to map to “test1” key in ES. once we created index let say we changed key value of test1 from code to “tests1”, if we indexing data in ES , a new field “tests1” will be add to schema. So schema got disturbed. once we created a field we can not remove that particular field from schema we need to delete complete index and need to re index again. Which is very costly operation on large indices.

Please refer our next post on Indexing Data to Elastic Search 2