Search engine is the specialized software that provides ability to store data, index data and perform quick and relevant search. The biggest difference between the search engine and a conventional data storage is the fact that search engine performs search significantly faster, and can do overall full-text search in addition to search by the specific area. Examples of search engines – Elasticsearch, Lucene, Solr. To use an analogy to relational databases, search engine is similar to Database Management System (DBMS).

Search index (sometimes referred as just index) is an actual storage where specific scope of search data is stored. One search engine may work with several indexes. Search index is structured, i.e. provides ability to organize data in the complex multilevel structures; minimal amount of levels required to work with Oro application is two. Search index can validate data type of the data stored inside. Search index is a document based storage where each document represents one specific entity from the main relational database. Search index can be considered as a specialized reflection of the main relational database. To use an analogy to relational databases, search index is similar to database (DB).

Entity alias is a text representation of entity name stored inside the specific search index. Entity alias represents first level of search index structure. To use an analogy to relational databases, entity alias is similar to a table name.

Entity field is a text representation of entity property name assigned to a specific entity alias. Entity alias represents second level of search index structure. Entity field has assigned data type (text, integer, decimal, datetime) and search engine uses this information to validate data stored inside the index. To use an analogy to relational databases, entity field is similar to a column name.

Entity field value is an actual value of an entity property assigned to a specific entity field. To use an analogy to relational databases, entity field value is similar to a value of a column.

All-text field is a special entity field used to do overall full-text search. Value of this field is usually calculated automatically based on the text entity field values.

Search document is a combination of entity fields and entity field values and represents data of one specific entity from the main relational database. Search document has plain structure – i.e. field values must not contain other documents. To use an analogy to relational databases, search document is similar to a table row.

Indexation is a process of updating of a data in a search index – it might be extraction of the required data from an entities and saving it to search index, or removing of required documents from search index.

Search mapping is a combination of entity aliases and entity field definitions. To use an analogy to relational databases, search mapping is similar to a database schema.

Search placeholder is a variable part of entity alias or entity field which can be substituted with an actual value during the indexation or performing a search request.

Standard index type (sometimes referred as default index type or backend index type) is used at all applications based on OroPlatform. Each entity is represented by one plain entity alias, contains plain field names to represent data assigned directly to main entity or data from the related entities.

Standard index type triggers following events:

oro_search.mapping_config – during the mapping collection process, used to alter mapping information;

oro_search.prepare_entity_map – during the indexation process, triggered for each entity, used to change data stored inside the index;

oro_search.before_search – before the search request, used to change search request before its execution;

oro_search.prepare_result_item – after the extraction of documents from search index, used to populate additional information (entity objects, URLs etc).

Standard index type performs indexation on an entity level – i.e. indexation process is triggered for each entity. Search field values can be calculated automatically based on the defined search mapping. Each search document contains one all-text field value calculated automatically as a concatenation of all text entity field values. Each field might have several values. In this case, it will be represented as an array, and comparison operations will iterate over all of them. During reindexation the entities that are being reindexed are not available for search.

Search mapping can be stored inside any bundle at the file Resources/config/oro/search.yml. Engine-specific services have to be defined at file Resources/config/oro/search_engine/<engine>.yml. Main logic of this index type is stored inside the OroSearchBundle at platform package.

Website index type (sometimes referred as frontend index type) is used only at the OroCommerce application – this application uses standard index type at backend part, and website index type at frontent part of the application. Each entity is represented by one alias with optional search placeholder (e.g. oro_product_WEBSITE_IT), so in fact each website might have its own entity alias (e.g. oro_product_1 and oro_product_2). Entity fields might contain search placeholders as well (e.g. name_LOCALIZATION_ID), so in fact each field might have several values depending on the provided placeholders (e.g. name_1, name_2 and name_3).

Website index type triggers following events:

oro_website_search.reindexation_request – triggers reindexation process for the specified scope of entities, here are examples of a triggering of this event;

oro_website_search.event.website_search_mapping.configuration – during the mapping collection process, used to alter mapping information;

oro_website_search.event.collect_context – before the indexation, used to collect context which will be used during the indexation;

oro_website_search.event.restrict_index_entity – during the indexation for all entities, used to decrease amount of entities that has to be indexed;

oro_website_search.event.restrict_index_entity.<alias> – during the indexation for the specific entity, used to decrease amount of specific entities that has to be indexed (pay attention that <alias> is a standard entity alias, not search entity alias);

oro_website_search.event.index_entity – during the indexation fro all entities, used to collect data to put into search index, here is example of a listener for this event;

oro_website_search.event.index_entity.<alias> – during the indexation for the specific entity, used to collect data for a specific entity to put into search index (pay attention that <alias> is a standard entity alias, not search entity alias);

oro_website_search.before_search – before the search request, used to change search request before its execution.

Website index type performs reindexation on an entity batch level – i.e. indexation process is triggered for batch of entities (default batch size is 100). Search field values have to be calculated and set manually at a listener to oro_website_search.event.index_entity event. Each search document contains all-text fields for each available localization (all_text_LOCALIZATION_ID) and one all-text field that includes values from all localizations, values are calculated automatically based on a flag set during the indexation (i.e. developer may specify what exact values should be in all-text field value). Each field without placeholder must have ony one value. During reindexation, entities that are being reindexed are available with the old (outdated) data.

Placeholders are defined in code in classes that implement Oro\Bundle\WebsiteSearchBundle\Placeholder\PlaceholderInterface. Here are the most commonly used placeholders (pay attention that there are more of them in a code):

WEBSITE_ID – integer identifier of a current website

LOCALIZATION_ID – integer identifier of a current localization

CUSTOMER_ID – integer identifier of a current customer user

CURRENCY – string identifier of a current currency

CPL_ID – integer identifier of a current combined price list

Search mapping can be stored inside any bundle at the file Resources/config/oro/website_search.yml. Engine-specific services have to be defined at file Resources/config/oro/website_search_engine/<engine>.yml. Main logic of this index type is stored inside the OroWebsiteSearchBundle at commerce package.

ORM search engine doesn’t use actual document-based storage – instead it emulates such storage inside application relational database using EAV model. As a consequence, performance of ORM engine is not very good and because of that it is recommended only for small applications – with a couple thousands of entities.
ORM search engine uses separate Entity Manager and connection called “search” – this way search requests can be processes independently from default DB connection.
ORM search engine for standard index type is implemented at the OroSearchBundle at platform package, for website index type – at the OroWebsiteSearchBundle at commerce package.

Elasticsearch search engine allows to store big amount of data and perform fast search queries. Performance of Elasticsearch engine is quite good and it is recommended for bigger applications, with hundereds of thousands and millions of entities.

Note

The Elasticsearch feature is only available in the Enterprise edition.

Elasticsearch search engine requires credentials to connect to actual index. Credentials include optional WWW-auth parameters and SSH connection support.
Elasticsearch search engine for standard index type is implemented in OroElasticSearchBundle in the platform-enterprise package, for website index type it is implemented in OroWebsiteElasticSearchBundle in the commerce-enterprise package.
Current implementation supports only Elasticsearch 2.* versions.

Search index stores documents grouped by entity alias. Each document contain scalar fields with values, one of the fields is all-text field used to perform overall search. Data from the related entities might be stored as well, but in this case it has to be denormalized to a plain structure.
There are four supported entity field value data types:

ORM search engine stores data using EAV model to store attributes. There is one main entity and four related entities used to store data for each of the supported field data types.
One main entity contains main information about the document – entity class, entity ID, entity alias, default title, flag that indicates whether entity was changes and createdAt/updatedAt fields. All four related entities have similar structure – they store name of the entity field and actual entity field value. Here is diagram that shows this structure:

Each of the supported index types uses its own set of entities – i.e. 5 entities for standard index type and 5 more entities for website index type.

String representation is pretty similar to standard SQL query – this string may contain keywords select, from, where, aggregate, order_by, offset and max_results. String represenation is used mostly at the API where user can request specific data with specific condition. During the processing of string representation it is converted to object representation.

Object representation has two levels – low and high.
Low-level object (OroBundleSearchBundleQueryQuery, sometime called search query builder) represents a query and has parts similar to string represenation (select, where etc). Low-level query is not aware about specific search engine. It is used by all search engines as a main query representation. Low-level object is in fact a Data transfer object.
High-level object (implementation of OroBundleSearchBundleQuerySearchQueryInterface) is used to hide search engine specific logic from a developer. It embeds low-level object and proxies most of the calls. High-level objects are created by the query factory (implementation of OroBundleSearchBundleQueryFactoryQueryFactoryInterface). Each index type has its own implementation of high-level object which encapsulates the way this query has to be executed, and its own implementation of query factory responsible for creation of high-level object. High-level object is in fact a Facade.
Following diagram demonstrates connection between low-level object, high-level object, query factory and search engine:

The recommended way to trigger search request is to get instance of high-level object, build a query, execute it and get results. It’s also recommended to isolate all search requests in a search repository (see Best Practices section) to separate storage logic from business logic.
However, if you really need to work on a lower level (e.g. to write functional test) then you can get instance of an appropriate search engine type. All following engines implement standard search engine interface Oro\Bundle\SearchBundle\Engine\EngineInterface:

Most of the indexation operations are performed asynchronously using a message queue. Advantage of this approach is that user should not wait while indexation is finished to see response from the application, also asynchronous indexations might be perfromed in parallel to speed up overall indexation process. Disadvantage is that indexation may happen with a delay – delay time depends on number of consumers, server hardware and queue length.
Every time some entity which should be represented by a document in a search index is changed new message that contains entity class and entity identifier is generated and sent to message queue. Then message queue consumer receives this message and runs appropriate message processor that performs real indexation and does real change in search index.
Please, remember, that parallel indexation is possible only if there are several message queue consumers running – each consumer is able to run indexation, so the bigger amount of consumers running the more indexations can be performed in parallel.
All automatically triggered reindexations are processed asynchronously.

Despite the fact, that asynchronous processing is very convenient for a user, sometimes it might be required to track process manually and make sure that indexation is finished right away. In this case synchronous indexation should be used instead of the asynchronous one. Advantage of this approach that indexation will happen right now without any delay. Disadvantage is that it might be slower and UX is worse than in case of asynchronous indexations.

Standard search index type provides CLI command oro:search:reindex that can be used to manually trigger full reindexation of all entities, or only entities of a specific class. It has flag called scheduled to run indexation asynchronously. Here are examples of a work with this command.

Website search index type provides similar CLI command called oro:website-search:reindex which used to manually tirgger full reindexation of all entities, only entities of a specific class or entitie for a specific website. It also has flag called scheduled to run indexation asynchronously. Here are examples of a work with oro:website-search:reindex command.

Website search index type provides an event called oro_website_search.reindexation_request to manually trigger reindexation process for the specified scope of entities. It uses event class Oro\Bundle\WebsiteSearchBundle\Event\ReindexationRequestEvent which accepts boolean parameter $scheduled to specify whether indexation has to be asynchronous (default behaviour) or synchronous. Here are examples of a triggering of this event.

Both standard and website search index types have synchronous and asynchronous indexers which trigger corresponding type of indexation. All following indexers implement the same standard indexer interface Oro\Bundle\SearchBundle\Engine\IndexerInterface:

Standard index type automatically adds owner and organization fields to all entities and fills them with data during the indexation process. Then during the search request ACL restrictions are automatically applied to a low-level query to show only entities which current user is allowed to see.

Website index type doesn’t have common ACL protection like standard index type. Instead each entity can be protected by the custom specific conditions. For example, visibility of a Product entity to a cutomer user is affected by a product status, product inventory status and product visibility settings on a customer, customer group and website levels.

ORM search engine uses DBMS as a main storage and its scalability depends on scalability of DBMS. For example, PostgreSQL supports several clustering solutions, so ORM search index can be scaled together with the main relational DB.
There is one more trick that can be used. As long as ORM search engine uses its own connection and its own entity manager, all search related tables can be moved to a separate DB at the remote server. In this case application administrator has to override configuration for connection called search and refer to this remote server.

Elasticsearch search engine is horizontally scalable out of the box – it can be organized as an Elasticsearch cluster with a several nodes inside it. Application administrator can specify how many Elasticsearch shards index will consist of (i.e. how many parts it will be split into), default number of shards is 5. Then depending on the number of nodes at cluster, search engine can move shards to different nodes and, as a consequence, allow to perform distributed search.

ORM search engine uses DBMS as a main storage and ability to handle unavilable state of a storage depends on how it is organized at the used DBMS. Both MySQL and PostgreSQL support repliacation, which can be used to manage such situation.
If you have search index at the separate DB then you can organize replication exclusively for this DB.

Elasticsearch search engine support replications as well. Application administrator can specify how many Elasticsearch replicas index should have, default number of replicas is 1. After that Elasticsearch cluster will create appropriate number of replica shards and distribute them over the available nodes.
For example, by default index is created with 5 shards and 1 replica – it means that Elasticsearch will create 10 shards: 5 primary shards and 5 replica shards.

Loggin is an essential part of any conponent, and search component is not an exception. Both standard and website search indexes in dev mode log all requests to search index storages (DB or Elasticsearch); in prod mode only exceptions are logged. In case of prod mode all exceptions are also sent to an emails specified in system configurations at System Configuration > General setup > Application settings > Error logs notification section.
Standard search index also may log all search queries to database table oro_search_query (entity name is Oro\Bundle\SearchBundle\Entity\Query), by default this logging is turned off.
Elasticsearch engine impelementations uses their own Monolog logger channels – oro_elastic_search for standard index type and oro_website_elastic_search for website index type.

There is a special datagrid search datasource that works with search index. Search datasource works with high-level search query object, so both index types are supported. Datagrids based on this datasource use configuration similar to ORM configuration – developer can specify shown columns, filters, sorters, properties and actions.

Implemented search datagrids:

search results grid (search-grid) – used to show results of overall search at all Oro applications, grid may show any entity;

products frontend grid (frontend-product-search-grid) – used as to represent list of products and show product search results at OroCommerce frontend part, grid shows only Product entities.

Search index is a document based storage and it does not support relations. Instead you can denormalize them and store related information at the main entity level – this way you can increase search speed.
For example, if Product entity has a relation to Brand entity then Product document at the search index may store some brand information (ID, label etc) which might be required to do effective and fast search.

Every time you want to request some custom information from the search index you should get it via the search repository. Search repository provides a storage abstraction layer (similar to Doctrine entity repositories), so business logic will be aware about the repository, but not about the search index structure.
If you want to create a repository then you should create new class extended from the appropriate search repository class (Oro\Bundle\SearchBundle\Query\SearchRepository for standard index type or Oro\Bundle\WebsiteSearchBundle\Query\WebsiteSearchRepository for website index type), declare it as a service, add custom methods and inject it into required business logic service. Search repositories work with the high-level search query object representations. You can optinally pass entity name as well – in this case queries will be by default executed only for the specific entity.
Here is example of search repository for standard index type and its definition:

Please, don’t use search engine and search indexer at the production code directly.
If you need to perform search request it’s better to encapsulate it inside search repository.
If you need to trigger reindexation then you should remember, that most of the data is reindexed automatically; if you need to do manual reindexation you can usually do it on a higher level – e.g. you can trigger an event at website index type.
If for some reason you still have to work directly with search engine or search indexer – please, encapsulate all your logic inside the intermediate storage layer service (similar to search repository) and use it at the business logic layer. This way business logic will be aware about a storage, but not about the structure of the search index itself.

If you need to debug some search requests/indexations, but don’t know an entry point to it, then you can set a breakpoint in the appropriate engine/indexer (or all engines/indexers) – this way after you catch the breakpoint you’ll be able to track whole stack trace. The most commonly used method for a search engine is search; the most commonly used methods for a search indexers are save and delete.
Please, remember, that most of the indexations are happended asynchronously – so, if you want to debug real indexation you should set breakpoint at the appropriate indexer, run message queue consumer in debug mode to be able to catch breakpoint and then trigger asynchronous indexation.
If you need to debug remote server and you don’t have an access to it then you should configure logging there and see results in logs.

Search index interaction like any storage interaction have to be covered with functional tests. In functional tests you can work directly with the search repository, search engine or search indexer.
You can and you should also cover search index interaction with behat tests as well.

Elastic HQ is a very useful plugin for developers because it provides UI to manage Elasticsearch cluster (indexes, mappings, queries etc) instead of the plain CLI. This plugin is recommended to use only at the development environment because of the possible security issues at production environment.

There are several ways how you can optimize search index and speed up search and indexation speed.

Change search engine. If you’re using ORM search engine then you can switch to Elasticsearch engine and do full reindexation (might take some time). ORM search engine uses EAV model to store documents at the relational storage, which is not very efficient and fast approach. However, Elasticsearch engine is document based originally, so it works faster and shows much better performance.

Storage optimization. You can improve hardware used for a search index storage – more RAM, more processing cores, faster disk (SSD) etc. If you’re using ORM search engine then you can check performance of a standard search queries – just get queries, do EXPLAIN and see how you can optimize DBMS.

Index data optimization. You can check what entity data is not required (or not used) at a search index, remove it from mapping and/or indexed data and trigger full reindexation (might take some time). After that each search document should become smaller and whole index should take less space on a disk.

Index structure optimization. If you’re using Elasticsearch you may change the way full-text index is built – you can change default index analyzers, index tokenizers and index token filters to the faster ones.

Accessibility optimization. You can measure delay required to connect to a search index storage and try to decrease this value – e.g. move it to a server with smaller network delay and connection time. If you’re using Elasticsearch cluster then you can check where shards and replicas are placed and optimize this infrastructure as well.

Here are couple of recommendations how to work properly with search index with a big amount of data – 1 million of entities and more.

Run indexation in parallel – split scope of indexed entities on a small chunks (1000-10000 entities) and make sure that there are enough consumers to process them in parallel.

Use Elasticsearch engine – it’s faster and performs much better than ORM search engine in case of big amout of data.

Use language optimization – if you know what languages are used at your application then you can optimize index structure and data according to these languages; see list of Elasticsearch language analyzers.

Keep search index at the separate server – this way it will not be affected by main relational DB load.

Use Elasticseach cluster if needed – if your index is too big to keep it on one server and/or you want to balance search index load between several servers then you might use Elasticseach cluster.

Increase RAM – the recommended amount of RAM for search index is a half of index size or more, i.e. if you have 50GB of index data, it is recommended to have 25GB+ of RAM.

Use SSD – this type of disk provides faster read/write access than HDD and allows to request parts of search index data faster.

Entity that was changed might be not indexed yet and reindexation request message is still waiting in message queue. Please, make sure that consumer is running, all messages are processed and then try again.

First possible reason: New entity might be not indexed yet and reindexation request message is still waiting in message queue. Please, make sure that consumer is running, all messages are processed and then try again.
Second possible reason: Current user is not allowed to see new entity.
Standard search index type: Current user doesn’t have permissions to see the entity. Please, have a look at ownership and organization of the entity and check if current user have an access to it.
Website search index type: The entity is invisible to a current user. Please, check parameters that might affect visibility of the entity to a current user (statuses, visibility restrictions, system configuration etc).

Please, verify credentials specified at the confing/parameters.yml file – host, port, index name, authentication options. You can try to manually connect to Elasticsearch server via CLI curl command to make sure that you have an access to it:

oro:website-search:reindex – allows to reindex all entities, or only of a specific entity class, or entities for a specific website, or specific entities by their identifiers; indexation can be synchronous (default behaviour) or asynchronous.