Our Work

Indexing Data to Elastic Search Part-2

Posted 1 year, 6 months and 3 days ago

PART 2:

Indexing data from River or Feeder.

  • Indexing data From River: (Deprecated from ES 1.5.x)
  • Indexing relation data using river is a good feature on elastic search which is showed a way to index relational data to elastic search, which is implemented by ShayBanon.

Reference: https://www.elastic.co/guide/en/elasticsearch/rivers/current/index.html

why rivers Deprecated?

Disadvantages:

  • Rivers are a big cause for cluster instability. Due to their inherent notion of working with external systems and external libraries.
  • Need to be installed on every node in a cluster which Is very difficult to maintain in upgrade times.
  • Plugin need to reside inside elastic search which leads to cluster instability.
  • Close river connection once our indexing is completed.
  • We can pull data from one table at a time using river API.

River Concepts:

Currently rivers deprecated from ES 1.5.2 Version. So In place of river ES introduced Feeder to index data to elastic search . Feeder is a standalone application by ES which is no need of installation and don't need to install on every es node.

The following is the river script, to index data from MySQL to elastic search. MySQL river is implemented by Jprante. We can schedule river according to our needs.

Following is the json script to populate data from database (MySQL) to elasticsearch which can be executed using sense tool.

PUT /_river/indexName/_meta
{
"type" : "jdbc",
"jdbc" : {
"url": "jdbc:mysql://10.0.1.41:3306/test",
"user": "test",
"password" : "secret",
"sql" : "SELECT * FROM `test_DB`",
"index": "test_news",
"type": “tn2",
"bulk_size": "160",
"max_bulk_requests":"5",
"timezone": "America/Los Angeles"
}
}

Note: Once we populated the data we have to delete _river mapping otherwise whenever server restarts happened the data will be populate to elastic search again for same script.

Deleting River:

DELETE /_river/

it will delete river mapping for index

Data Validations:

once data is completed population to elastic search we have to validate data counts and net amount or any other calculation and cross checking on data .

Index Fixes from river scripts:

To index elasticsearch data in bulk or from huge table we have to define limits to the following parameters.

"bulk_size" : "160",

"max_bulk_requests" :"5",

So that while pulling data from table es will get 160 records per one request .This limits can avoid the chance of data missing while indexing to elastic search

2. "timezone" : "America/Los Angeles" (Most important param when indexing data from other db server to elastic search.)

while indexing data form database server to elastic search server , ES and Database , should be on the same server otherwise we have to define timezone of the database server where the db is hosted means we have to specify database server time zone as parameter.

Please refer our next post on Indexing Data to Elastic Search 3

Related Posts