Wednesday 2 July 2008

Full text search in rails with postgres



Full text search in rails with postgres


Summary
There are basically 4 options:- Ferret, Solr, Sphinx and native postgres search (which used to be called tsearch2 but is now compiled into the db.) Each of course has advantages and disadvantages.

Ferret
- advantages
1. Fast indexing
2. Indexing on active record save
3. set boost values independently per field and per record
4. write custom text tokenizers, stemmers and stop lists (and use different ones per field even)
5. highlight matches in results using the same engine that does the searching
6. manage my own indexes, merging them at will, or just merging results from them.
7. Index content generated on the fly, without having to store it in my sql database (pull in all the associated tags for a post as you index it for example).
8. Store original data in the index (though most people use it to index an SQL database anyway).
Ferret - disadvantages
1. Corrupts indexes if used with Transactions in your apps because of its after_update filter.(It updates the index before the actual save to
the database)

2. Unstable on the production server if you use some load balancing techniques like round-robbin scheme and you have instances of mongrel on
different machines.
(Added burden to use a separate dRB server)
3. slow searching.


Solr - advantages
1. Index update with activerecord save
2. In-built support for highlighting search keywords like you see in Google Search and many more advanced features.

Solr - disadvantages
1. Runs on Jboss or some other java stack
2. Slow to reindex and query wrt sphinx and uses 50x more memory

Sphinx - advantages
1. Very fast to search and index, slow to update
2. searching and ranking across multiple models
3. delta index support
4. excerpt highlighting
5. Google-style query parser
6. spellcheck
7. faceting on text, date, and numeric fields
8. field weighting, merging, and aliasing
9. geodistance
10. belongs_to and has_many includes
11. drop-in compatibility with will_paginate
12. drop-in compatibility with Interlock
13. multiple deployment environments
14. comprehensive Rake tasks

Sphinx - disadvantages
1. Closely tied to mysql, php – can run with postgres but needs to be compiled
2. Difficult to integrate as compared to Ferret or Solr
3. You have to write a lot of sql code in the configuration file for indexing and searching data
4. Not hooked with the ActiveRecord save or the life cycle of an object, so you need a cron job to rebuild the index periodically (But plugins use delta indexes so model changes are automatically added to the live indexes but regular periodic reindexing is still needed
6. 'Shared hosts do not support sphinx'
7. No automatic updates – must use cron job to update index

postgres
- advantages
1. Can use triggers to index on save
2. No overhead of another system

postgres - disadvantages
1. Limited plugin support (we will need to write our own)
2. Will need to hand code pagination and search term highlighting (there are functions for search term highlighting built in to postgres, but must be called via sql)
3. Not hooked in to active record automatically

Requirements:
Stemming
Stop words
Wildcards
Search across multiple fields in multiple tables and rank by specific fields
paginate results
highlight search terms in text

Notes

Postgres native search (used to be tsearch2)
http://groups.google.com/group/acts_as_tsearch/browse_thread/thread/6437f86a2540f406
and
http://www.pervasivecode.com/blog/2008/01/24/acts_as_tsearch-adjustments-needed-for-postgresql-83rc2/

looks like only minor changes to get acts_as_tsearch working with postgres 8.3

Ferret generally gets a bad press – I thought the problems went away after moving to Drb, but apparently not.

We’ve used ferret on past projects… and now use sphinx. We’re not
likely going back to ferret. ;-)

and lots more comments like this in
http://www.ruby-forum.com/topic/137629
And most convincing of all ...
http://deadprogrammersociety.blogspot.com/2008/05/in-search-of-search.html



Shinx plugins:
Ultrasphinx (only works with rails 2.0)
Thinking Sphinx
acts_as_sphinx
sphinctor