Main application: InnoDB’s Full Text Search overview (with Shiny/R)

## Credits

Author: Emanuel Calvo

Company: Pythian

Thanks to Valerie Parham-Thompson @ Pythian and Daniel Prince @ Oracle.

Repository available at Github.

For the whole article and the Shinyapp application is available here.

## Some initial thoughts

A couple of days ago one of our customers came up with a question regarding FTS over InnoDB engine. Although the question is not answered in the current article, I came up with the conclusion that FTS is sometimes misunderstood.

In order to show the effects off the field sizes over the query expansion algorithm, you will see two main tables (bookContent and bookContentByLine) both containing the same books in different approaches: by line and by paragraph. You’ll see the noise generated by the QUERY EXPANSION algorithm when phrases are too large.

The current article has been developed using Shiny/R in order to allow you to see the effects of the algorithms.

For the sake of simplicity, in this article we won’t go through the FTS parsers. Probably that would be material for a future post.

## Why I consider FTS sometimes misunderstood?

FTS is a technology that can be use for any purpose, not only simple searches. There is a myth that the FTS only should be placed on clusters for that purpose, which I agree. However, certain bussines rules require complex searches, and having such feature can be a win.

RDBMS aren’t a good place for massive amount of FTS queries, without using any of the join capabilities that they offer, or the ACID complaints.

As I said above, FTS is totally acceptable in RDBMS, if you are using at least one RDBMS critical feature, required by your bussines model.

## Action!

Here is an example of how ranks differ among algorithms and field sizes using the word ‘country’:

To start showing the effects of the algorithms, the following example searches the word ‘country’ using query expansion. This means that we are not looking only the exact matches, but also the entries that appear the most when the the exact match has been found.

In the SELECT clause you’ll see both FTS expressions using NATURAL LANGUAGE with query expansion and BOOLEANmodes respectively.

set global innodb_ft_aux_table = 'ftslab/bookContentByLine';

SELECT  content, bookid, group_concat(it.POSITION) as pos,
round(MATCH(content) AGAINST ("country" IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION)) as QERank,
round(MATCH(content) AGAINST ("country" IN BOOLEAN MODE)) as BoolRank
FROM bookContentByLine bl join information_schema.INNODB_FT_INDEX_TABLE it
ON (bl.FTS_DOC_ID = it.DOC_ID)
WHERE  MATCH(content) AGAINST ("country" IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION)
AND it.WORD = 'country'
GROUP BY FTS_DOC_ID
ORDER BY 4 DESC
LIMIT 10 ;

Result:

+-----------------------------------------------------------------------------+--------+-----+--------+--------+
|content                                                                      | bookid | pos | QERank |BoolRank|
+-----------------------------------------------------------------------------+--------+-----+--------+--------+
|"country in September, migrating into Holland, and leave their mates behind" |  15707 | 1   |    105 |      7 |
|"unsatisfied desire to serve his country, the two prevalent enthusiasms at"  |  15707 | 33  |     98 |      7 |
|"Language, Vol. I. p. 212. In this country, where four or five horses travel"|  15707 | 35  |     93 |      7 |
|"inflicting immense damage upon the country. Whereupon the Florentines"      |   1232 | 36  |     89 |      7 |
|"made for a country of twenty or thirty millionsâ€™ population, can be laid" |  39064 | 12  |     89 |      7 |
|"The spiders of this country manufacture nets of various forms, adapted to"  |  15707 | 21  |     87 |      7 |
|"a velvet-covered arm-chair at my head! This country is too decadent"        |  33415 | 45  |     86 |      7 |
|"country may be ennobled, and under its auspices may be verified that"       |   1232 | 1   |     84 |      7 |
|"name. The writer of this unpublished pamphlet sees his country in a"        |  39064 | 56  |     84 |      7 |
|"In our country, Mr. Pennant informs us, that some quails migrate, and"      |  15707 | 8   |     83 |      7 |
|"all the morning in passing over the adjacent country." (Voyage to Senegal," |  15707 | 46  |     82 |      7 |
|"the electoral system of the country. Immediately an outcry burst out"       |  39064 | 29  |     82 |      7 |
|"country, under a most excellent president, wherein all cities had their"    |   1232 | 1   |     81 |      7 |
|"Though in this country horses shew little vestiges of policy, yet in the"   |  15707 | 16  |     81 |      7 |
|"country districts. As Lucca had five gates, he divided his own country"     |   1232 | 1,63|     80 |     14 |
+-----------------------------------------------------------------------------+--------+-----+--------+--------+
15 rows in set (1,16 sec)

The noise generated by the query expansion is expected and described in the official documentation here.

The interesting case is the following row, which has 2 exact occurrences and it is not the highest rank using query extension. Remember, this is expected.

Text: "country districts. As Lucca had five gates, he divided his own country"
bookid: 1232
pos: 1,63
QERank: 80
BoolRank: 14

This is even worser when using large sentences. In the example bellow you will see the same query, against the table storing by paragraph. The boolean rank shows some of the entries way above others, however the query extension locates at the top records that not necessarily has a lot of exact matches.

SET GLOBAL innodb_ft_aux_table = 'ftslab/bookContent';

SELECT bookid, FTS_DOC_ID,
group_concat(it.POSITION) as positions,
round(MATCH(content) AGAINST ("country" IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION)) as QERank,
round(MATCH(content) AGAINST ("country" IN BOOLEAN MODE)) as BooleanRank,
length(content) as len
FROM bookContent bl join information_schema.INNODB_FT_INDEX_TABLE it ON (bl.FTS_DOC_ID = it.DOC_ID)
WHERE  MATCH(content) AGAINST ("country" IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION)
AND it.WORD = 'country'
GROUP BY FTS_DOC_ID
ORDER BY QERank DESC
LIMIT 10 ;

Result:

+--------+------------+-----------------+--------+-------------+-------+
| bookid | FTS_DOC_ID | positions       | QERank | BooleanRank | len   |
+--------+------------+-----------------+--------+-------------+-------+
|  16452 |      17637 | 942,2552,9084   |  32494 |          10 | 51790 |
|  16452 |      17827 | 31699           |  30232 |           3 | 51701 |
|  16452 |      17761 | 667,47646       |  29517 |           7 | 50264 |
|  16452 |      17791 | 13566           |  28888 |           3 | 49129 |
|  16452 |      17927 | 23259,7044      |  26731 |           7 | 48983 |
|  16452 |      17839 | 9012,199        |  24933 |           7 | 44451 |
|  16452 |      17815 | 29318           |  24745 |           3 | 44011 |
|  16452 |      17729 | 895,16485,24034 |  23305 |          10 | 42612 |
|  16452 |      17621 | 1765            |  19935 |           3 | 36698 |
|  16452 |      17803 | 3942            |  17552 |           3 | 30586 |
+--------+------------+-----------------+--------+-------------+-------+
10 rows in set (1,88 sec)

The query expansion is useful when you intend to search which entries contain more words that appear frequently within the search term. Having large text fields increase the probability to have more words that appear among the search term. In the case of bookContent table (by paragraph table), the average field size is r rs$len characters. ## The INNODB_FT_INDEX_TABLE There is a way to play with the contents of the FTS indexes. As you may noticed in the previous examples, I used the set global innodb_ft_aux_table = 'ftslab/bookContent'; statement, which loads the index content to memory for an easy querying. If you use RDS, the option innodb_ft_aux_table is not available as it is GLOBAL and require SUPER privileges. i.e. You can easily get the most frequent tokens: SELECT WORD,count(*) FROM information_schema.INNODB_FT_INDEX_TABLE group by WORD having count(*) > 1000 order by 2 limit 10; Result: +--------+----------+ | WORD | count(*) | +--------+----------+ | should | 1023 | | yet | 1027 | | any | 1070 | | like | 1071 | | been | 1073 | | first | 1080 | | nor | 1087 | | your | 1106 | | thou | 1130 | | shall | 1164 | +--------+----------+ 10 rows in set (5,40 sec) Probably it isn’t a very useful information as most of this words appears too frequently and are modal verbs, adverbs, pronouns, determiners, etc. It could be the case that you are not interested on indexing those words. If that’s the case you can add them as stopwords in your own stopwords table. Specially if you are more interested in boolean searches, loosing some part of the language expressions. I built I query for this situation to allow us to build the stopwords table using the current words that we want to add to the filtering: (ftslab) > select group_concat(WORD) FROM (select distinct WORD FROM information_schema.INNODB_FT_INDEX_TABLE group by WORD having count(*) > 1000) d\G *************************** 1. row *************************** group_concat(WORD): all,and,any,been,but,can,first,had,has,have,her,him,his,into, its,like,may,more,nor,not,now,one,only,other,our,said,shall,she,should,some,such, than,their,them,then,there,these,they,those,thou,thus,thy,time,were,which,would, yet,you,your 1 row in set (5,28 sec) Let’s build our filter table using both default and new entries and keeping the alphabetical order: CREATE TABLE bookContentByLine_stopwords(value VARCHAR(30)) ENGINE = INNODB; INSERT INTO bookContentByLine_stopwords SELECT value FROM ( SELECT value FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD UNION SELECT DISTINCT WORD as value FROM information_schema.INNODB_FT_INDEX_TABLE GROUP BY WORD having count(*) > 1000 ) allEntries ORDER BY value ASC; DROP INDEX ftscontent ON bookContentByLine; SET GLOBAL innodb_ft_server_stopword_table = 'ftslab/bookContentByLine_stopwords'; CREATE FULLTEXT INDEX ftscontent ON bookContentByLine(content); Checking the contents of the index is easy as: (ftslab) > select * from information_schema.INNODB_FT_INDEX_TABLE WHERE lower(WORD) like '%country%'; +------------------+--------------+-------------+-----------+--------+----------+ | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | +------------------+--------------+-------------+-----------+--------+----------+ | country | 149 | 787 | 28 | 733 | 265 | | country | 149 | 787 | 28 | 733 | 1342 | | countrydistricts | 733 | 733 | 1 | 733 | 816 | | thecountry | 249 | 733 | 2 | 733 | 750 | +------------------+--------------+-------------+-----------+--------+----------+ 4 rows in set (0,08 sec) (ftslab) > select * from information_schema.INNODB_FT_INDEX_TABLE WHERE DOC_ID = 155 AND lower(WORD) like '%country%'; +---------+--------------+-------------+-----------+--------+----------+ | WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION | +---------+--------------+-------------+-----------+--------+----------+ | country | 149 | 787 | 28 | 155 | 31 | | country | 149 | 787 | 28 | 155 | 495 | | country | 149 | 787 | 28 | 155 | 158 | | country | 149 | 787 | 28 | 155 | 525 | +---------+--------------+-------------+-----------+--------+----------+ 4 rows in set (0,09 sec) In the example shown before the is no intention to compare ranks score as they are based in different algorithms. The idea there is to show that QUERY EXPANSION can have non desire results in some cases due to its mechanism. ### Going ahead on choosing stop words The full article is amazingly interesting. In a brief, it says that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on (rank-frequency distribution is an inverse relation). The idea here is to measure how much index do we safe cutting those words that are extremely frequent and don’t add a necessary meaning to the search. ## Considerations and recommendations • Use QUERY EXPANSION only if you are interested to search relations over exact matches. Remember that the field size is crucial when using this. • FTS is not the best fit for exact string matches in single columns. You don’t want to use FTS for searching emails in a single column, name and lastname fields , i.e. For those, you’ll probably use other techniques as reverse searches or exact match operator (=). • Keep your FTS indexes short. Do not add ALL the text columns. Parse first from your application the user search and adapt the query. • If you are using BOOLEAN MODE, you can use the rank score to filter rows. MySQL is clever enough to optimize the FTS functions to avoid double executions. You can do this using something like: match(content,title) against ("first (<second >third)") > 1 Generally, scores lower than 1 can be ignored when using boolean or natural mode searches. • OPTIMIZE TABLE does a rebuild of the table. To avoid this, set innodb_optimize_fulltext_only=1 in order to do an incremental maintance on the table. • Recall that NATURAL LANGUAGE MODE does not take the operands as the BOOLEAN MODE. This affects the ranking score (try +bad (thing) i.e.) • If you plan to order by rank, it is not necessary to specify the clause ORDER BY as InnoDB does the order after retrieve the doc ids . Also,the behavior is different from the default as it returns the heaviest at the top (like an ORDER BY rank DESC). • If you come from MyISAM’s FTS implementation, recall that the ranking scoring is different. • Create the FULLTEXT index after the data is loaded InnoDB bulk load. When restoring FTS backups, you will probably hit the “ERROR 182 (HY000) at line nn: Invalid InnoDB FTS Doc ID”. • Try to avoid using use more than one FTS expression in the where clause. Keep in mind that this affects the order in the results and it consumes a considerably amount of CPU. InnoDB orders by the latest expression in the WHERE clause. WL#7123 • Also, if avoiding the rank information in the projection (SELECT clause) and using other aggregations like count(*), will use the “no ranking” FT_hints. The limit hint won’t be used if invoked explicitely an ORDER BY and th MATCH clause in the projection. explain select * from bookContentByLine where match(content) against ("+home" IN BOOLEAN MODE) ORDER BY FTS_DOC_ID LIMIT 10\G select_type: SIMPLE table: bookContentByLine type: fulltext Extra: Using where; Ft_hints: no_ranking; Using filesort explain select * from bookContentByLine where match(content) against ("+home" IN BOOLEAN MODE) LIMIT 10\G table: bookContentByLine type: fulltext Extra: Using where; Ft_hints: no_ranking, limit = 10 explain select count(content) from bookContentByLine where match(content) against ("+home" IN BOOLEAN MODE) \G table: bookContentByLine type: fulltext Extra: Using where; Ft_hints: no_ranking • If you plan to use FTS_DOC_ID column with AUTO_INCREMENT option, have in mind that there is a limitation regarding this. You must declare a single column PRIMARY KEY constraint or as an UNIQUE index. Also, the data type is stricted as bigint unsigned. i.e: CREATE TABLE test ( FTS_DOC_ID bigint unsigned AUTO_INCREMENT, mainPk bigint, other text, PRIMARY KEY(mainPk), UNIQUE(FTS_DOC_ID) ); ### FT_QUERY_EXPANSION_LIMIT This variable controls the number of top matches when using WITH QUERY EXPANSION (affects only MyISAM). reference ### Bug 80347 - Invalid InnoDB FTS Doc ID emanuel@3laptop ~/sandboxes/rsandbox_5_7_9$ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 $./m dumpTest < ddl.dump emanuel@3laptop ~/sandboxes/rsandbox_5_7_9$ ./m dumpTest < onlyData.dump
emanuel@3laptop ~/sandboxes/rsandbox_5_7_9 \$ ./m dumpTest < full.dump
ERROR 182 (HY000) at line 73: Invalid InnoDB FTS Doc ID

mysqldump is not very clever if you use FTS_DOC_ID:

2016-02-13T22:11:53.125300Z 19 [ERROR] InnoDB: Doc ID 10002 is too big. Its difference with largest used Doc ID 1 cannot exceed or equal to 10000

It takes dumps without considering the restriction coded in innobase/row/row0mysql.cc:

Difference between Doc IDs are restricted within
4 bytes integer. See fts_get_encoded_len()

The fix to this is backuping the table by chunks of 10000 documents.

### Maintenance

innodb_optimize_fulltext_only

### Parsers internals

Writting FTS parser plugins

## Version

This R Markdown document is made interactive using Shiny. Unlike the more traditional workflow of creating static reports, you can now create documents that allow your readers to change the assumptions underlying your analysis and see the results immediately.