Similarities:
- Solr and Sphinx can meet all your requirements. They are fast and aim to efficiently index and search large amounts of data.
- Both have a list of high-traffic sites that use them (Solr, Sphinx)
- Both provide commercial support. (Solr, Sphinx)
- Both provide client API bindings for multiple platforms / languages (Sphinx, Solr)
- Both can be distributed to increase speed and capacity (Sphinx, Solr)
Difference:
- Solr supports field folding (currently only as an additional patch) to avoid repeating similar results. Sphinx does not seem to provide any such functionality.
- Although Sphinx aims to retrieve only the document ID, in Solr, you can directly get the entire document containing almost any type of data, making it more independent from any external data storage area and saving extra round-trip time.
- Solr, except for embedded applications, runs in Java web containers, such as Tomcat or Jetty, which require other specific configuration and adjustments (or you can use the included Jetty and start only with java -jar start.jar it). Sphinx has no other configuration.
- In Sphinx, all document IDs must be unique unsigned non-zero integers. Solr doesn’t even need unique keys for many operations, and the unique keys can be integers or strings.
- Sphinx does not allow partial index updates to field data.
- Solr comes with a spell checker.
- Solr can index proprietary formats such as Microsoft Word and PDF, while Sphinx cannot.
- Sphinx is tightly integrated with RDBMS, especially MySQL.
- Solr can be integrated with Hadoop to build distributed applications
- Solr can be integrated with Nutch to quickly build a mature Web search engine with crawler functionality.
- Solr is built on Lucene, a proven technology that has been proven for 8 years and has a large user base (this is only a small part). Whenever Lucene gains new features or speeds up, Solr also gains. Many developers dedicated to Solr are also contributors to Lucene.
- Solr can be easily embedded in Java applications.
- Solr is an Apache project, and apparently has obtained the Apache2 license. Sphinx is GPLv2. This means that if you need to embed or extend Sphinx (not just “use”) a commercial application, you must purchase a commercial license (reasonable price)
- Solr is still a solution for publicly indexing / searching servers via HTTP, but I think ElasticSearch provides an excellent distributed model and is easy to use (although some search functions are currently lacking, this function is not very long For a while, in this case, the plan was to put all the functions of Compass into ElasticSearch.
- Essentially, a distributed Lucene solution needs to be dispatched. Similarly, with the development of HTTP and JSON as ubiquitous APIs, this means that solutions for many different systems with different languages can be easily used.
- Using pure Lucene is challenging. If you want to make it really good, you need to pay attention to many things, and it is a library, so there is no distributed support, it is only an embedded Java library that needs to be maintained.
- As for Lucene and database full-text search, I think Lucene’s performance is unmatched. As long as you have set up the Lucene index correctly, no matter how many records you want to search, you should be able to complete almost all searches within 10 milliseconds.