Sphider Pro - Crawler Indexing Script

HostCheetah · April 25, 2018, 2:31pm

Sphider Pro is a PHP search engine based around the idea of the original Sphider by Ando Saabas.

We have taken the original scripts and rebuilt then from the ground up to make Sphider Pro a light weight dynamic, simple install package to run a powerfull PHP search engine on your website. Sphider Pro has its own bot to index internet content and images.

We have worked with a UK Web Hosting Company to make sure that this version of Sphider can and will work on shared web hosting packages. They will even install it for you! Webreger.com

Support is avaliable via our support 24/7. Help is always on hand.

Sphider Pro comes with very little CSS styling and is ready for your to implement your style around our code scripts.

http://www.sphiderpro.eu/

This version has been tested with over two million keywords, 900,000 links, 29,540 domains & 37,000 images My Safe Search: http://www.mysafesearch.co.uk/

Introduction - Instalation - Change Logs - Documentation - Reviews

Introduction
Sphider Pro was first re-developed on the 16th april 2009 from the original Sphider created by Ando Saabas for a UK community interst company. Its first release was Sphider Pro version 1.0.

Since its release on the 16th april 2009 Sphider Pro went through a number of versions within the first year until version 1.8 (stable release) in 2010 where it was made available for all charities and community interest companies throughout the world.

Throughout the previous three years Sphider Pro has under gone a number of changes until finally released to the general public with version 3.2, released 29th November 2012 where the website, support and downloads where made avalaible.

Sphider Pro is maintained by a small team of developers who are always on hand most hours of the day 7 days a week for any support issues and requests you may have.

Sphider Pro offers a wide range of customising of the index and search procedures; this is by means of an Admin backend management system.

Modification requests are always welcome via our support which is open to all members of the Sphider Pro community and we will always do our very best to produce any modifications that you do request.

http://support.sphiderpro.eu

Liquid Layer Networks

Powered by:

HostCheetah Networks
Global Web Hosting, Domain Registration and Internet Services
http://hostcheetah.net | http://hostcheetah.uk

HostCheetah · April 25, 2018, 2:34pm

Other fork of the Original - still active as of 04/2018

Live Search Site
http://www.search.sphider-plus.eu

and

Original Script - end of life
http://www.sphider.eu

HostCheetah · April 25, 2018, 3:27pm

http://bigdata-madesimple.com/top-50-open-source-web-crawlers-for-data-mining

web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots or web scutters) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.

There are various uses for web crawlers, but essentially a web crawler is used to collect/mine data from the Internet. Most search engines use it as a means of providing up-to-date data and to find what’s new on the Internet. Analytics companies and market researchers use web crawlers to determine customer and market trends in a given geography. In this article, we present top 50 open source web crawlers available on the web for data mining.

HostCheetah · April 25, 2018, 3:32pm

Sphinx overview

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily — or index and search data on the fly, working with Sphinx pretty much as with a database server.

A variety of text processing features enable fine-tuning Sphinx for your particular application requirements, and a number of relevance functions ensures you can tweak search quality as well.

Searching via SphinxAPI is as simple as 3 lines of code, and querying via SphinxQL is even simpler, with search queries expressed in good old SQL.

Sphinx clusters scale up to tens of billions of documents and hundreds of millions search queries per day, powering top websites such as Craigslist, Living Social, MetaCafe and Groupon… to view a complete list of known users please visit our Powered-by page.

And last but not least, it’s licensed under GPLv2.

Performance and scalability

Indexing performance. Sphinx indexes up to 10-15 MB of text per second per single CPU core, that is 60+ MB/sec per server (on a dedicated indexing machine).
Searching performance. Searching through 1,000,000-document, 1.2 GB text collection that we use for everyday development and testing runs at 500+ queries/sec on a 2-core desktop machine with 2 GB of RAM.
Scalability. Biggest known Sphinx cluster indexes 25+ billion documents, resulting in over 9TB of data. Busiest known one is Craigslist, serving 300+ million search queries/day.
Key features

Batch and Real-Time full-text indexes. Two index backends that support both efficient offline index construction andincremental on-the-fly index updates are available.
Non-text attributes support. An arbitrary number of attributes (product IDs, company names, prices, etc) can be stored in the index and used either just for retrieveal (to avoid hitting the DB), or for efficient Sphinx-side search result set post-processing.
SQL database indexing. Sphinx can directly access and index data stored in MySQL (all storage engines are supported), PostgreSQL, Oracle, Microsoft SQL Server, SQLite, Drizzle, and anything else that supports ODBC.
Non-SQL storage indexing. Data can also be streamed to batch indexer in a simple XML format called XMLpipe, or inserted directly into an incremental RT index.
Easy application integration. Sphinx comes with three different APIs, SphinxAPI, SphinxSE, and SphinxQL. SphinxAPI is a native library available for Java, PHP, Python, Perl, C, and other languages. SphinxSE, a pluggable storage engine for MySQL, enables huge result sets to be shipped directly to MySQL server for post-processing. SphinxQL lets the application query Sphinx using standard MySQL client libary and query syntax.
Advanced full-text searching syntax. Our querying engine supports arbitrarily complex queries combining boolean operators, phrase, proximity, strict order, and quorum matching, field and position limits, exact keyword form matching, substring searches, etc.
Rich database-like querying features. Sphinx does not limit you to just keyword searching. On top of full-text search result set, you can compute arbitrary arithmetic expressions, add WHERE conditions, do ORDER BY, GROUP BY, use MIN/MAX/AVG/SUM, aggregates etc. Essentially, full-blown SQL SELECT is supported.
Better relevance ranking. Unlike many other engines, Sphinx does not solely rely on 30-year-old statistical ranking that only considers keyword frequencies, nor limits you to it. By default, Sphinx additionally analyzes keyword proximity, and ranks closer phrase matches higher, with perfect matches ranked on top. Also, ranking is flexible: you can choose from a number of built-in relevance functions, tweak their weights by using expressions, or develop new ones.
Flexible text processing. Sphinx indexing features include full support for SBCS and UTF-8 encodings (meaning that effectively all world’s languages are supported); stopword removal and optional hit position removal (hitless indexing); morphology and synonym processing through word forms dictionaries and stemmers; exceptions and blended characters; and many more.
Distributed searching. Searches can be distributed across multiple machines, enabling horizontal scale-out and HA (High Availability).
License

The Sphinx Search server is dual-licensed, thus it can be either commercially licensed or freely available via the Downloads page if used in accordance with the terms of the GPL v.2.

For those interested in commercial licensing, typically needed for embedding Sphinx in non-GPL products (OEMs/ISVs). Please refer to the Commercial Licensing page for additional information, or reach out to the Sphinx Licensing team directly via our Contact page.