Populate your Search index, NEST 2016-01
Presented at a Meetup by New England Search Technology, 2016-01-14.
Published on: Mar 4, 2016
Transcripts - Populate your Search index, NEST 2016-01
Populating your Search Index
NEST Meetup, 2016-01
Indexing Considerations, Pipelines, and Apache NiFi
A Proposal for a Document Pipeline
How we do it at TIAA-CREF with Solr
How we do it at DRG with Solr
Logstash and Beats with ElasticSearch
Indexing considerations to think about when building out a
What do I mean?
How do you plan to get data into the index
Schedule & Monitor?
Realtime search requirements?
What software? (pipelines, crawlers, …)
Common in the “enterprise search” space
What crawler will you use?
Nutch is well-known but too complex for smaller scale
Many more exist.
Security access control metadata to federate?
Try ManifoldCF which excels at this.
Plan for a “bulk reindex” use-case
When changing schemas / ingestion extraction rules
Or recovering when there’s no backup
Not having a backup is typical; esp. if re-indexing is fast
Optimize settings for this to be fast
May need to toggle after ingestion into “normal” settings
Use multiple machines during indexing (e.g. via hadoop)?
“Optimize” (merge) Lucene segments at the end?
(adding new/updated content)
Detect deletes how?
A: Flag for removal upstream before eventually removing
B: Track all IDs somewhere; find the ones that went
Maybe don’t need to synchronize deletes until off-hours?
Realtime Indexing, separate?
Backups (DR: Disaster Recovery)
Admin accidentally deleted 30k random docs; oh %#?!
Not solved by replication/redundancy
Useful in other scenarios, like testing
Might not need it; especially if bulk re-indexing is fast
Take Snapshots (e.g. AWS, or via the search
Recovery: Deploy snapshot then sync it back up to date.
Solr: see BloomReach’s “HAFT” project
Mapping source data (e.g. HTML doc or database
record) to a search document
Text from PDF extraction
Enrichment (e.g. Named Entity Recognition)
Text pre-processing before search platform gets it
Merging multiple data sources; joining
Home-grown or use an existing ETL / “pipeline”?
Do some of this directly on the search platform?
How will a bulk index be triggered? Incremental
Unix Cron? Basic but crude.
A Web UI to control this is great.
A CI server (e.g. Jenkins) can work! (web, logs, alerting)
Monitor/alert for problems?
Perhaps via general log monitoring (e.g. ELK)
Open-Source ETL Software
A summary of an investigation I did on open-source
options in 2013.
Extract Transform Load – a general idea
Software that calls itself ETL tends to be very similar.
Pentaho Data Integration, AKA Kettle
Talend Open Studio, Data Integration
Two are GPL/LGPL, Talend is Apache
Fremium model – pay for “enterprise” features
The Good: (in a word, mature)
GUI wire diagram builder
Books / resources
Text-editing the pipeline not recommended: thus need
Data model is table-like; no native multi-valued fields
“is an easy to use, powerful, and reliable system to
process and distribute data.”
Apache Nifi overview
Runtime modification of flow control
Data provenance features
Extensible (of course)
Security, role based access control