Consolidation Query: CONTAINS performance

Given the query:
Select *
from A
left outer join B on A.bid=B.id
left outer join C on A.cid=C.id
Where contains(A.*, word)
or contains (B.*, word)
or contains (C.*, word)
I am concerned about performance does SQL Server retrieve every row from A (which might eventually contain a vast number of records) in order to establish the records from B and C which are candidate results for this query?
I am assuming that a number of columns in A, B and C are marked for text searches Does it make a difference if B.id and C.id are unique keys? Does it make a difference if those unique keys are the unique keys supplied as keyname for their respective
tables when sp_fulltext_table created the text index?
************************************************** ********************
Sent via Fuzzy Software @. http://www.fuzzysoftware.com/
Comprehensive, categorised, searchable collection of links to ASP & ASP.NET resources...
This is an interesting question. Essentially you will be doing 3 remote
scans of the SQL FTS catalogs and you are not limiting your results set
which is the number one performance factor. It doesn't look like you will be
doing any table scans, although it is a little hard to tell.
It seems to me that table A bears some relationship to table B and table C,
perhaps a parent child relationship.
It looks like with the left joins you are saying give me all bid's in A that
a exist in B or don't exist in B, and the same with table C. I think you
might benefit from some consolidation of your tables.
Can you tell me more about the relationships of these tables to one another
and what you are trying to accomplish.
Hilary Cotter
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
"Mal Earnest" <malcolm.earnest@.bom.co.uk> wrote in message
news:%23oUFgA2uEHA.908@.TK2MSFTNGP11.phx.gbl...
> Given the query:
> Select *
> from A
> left outer join B on A.bid=B.id
> left outer join C on A.cid=C.id
> Where contains(A.*, 'word')
> or contains (B.*, 'word')
> or contains (C.*, 'word')
> I am concerned about performance - does SQL Server retrieve every row from
A (which might eventually contain a vast number of records) in order to
establish the records from B and C which are candidate results for this
query?
> I am assuming that a number of columns in A, B and C are marked for text
searches - Does it make a difference if B.id and C.id are unique keys? Does
it make a difference if those unique keys are the unique keys supplied as
"keyname" for their respective tables when sp_fulltext_table created the
text index?
> ************************************************** ********************
> Sent via Fuzzy Software @. http://www.fuzzysoftware.com/
> Comprehensive, categorised, searchable collection of links to ASP &
ASP.NET resources...
|||Mal,
The performance problem is most likely not with your join, but the number of
FTS predicates, such as CONTAINS.
Specifically, see the online version of the "Best Practices Analyzer Tool
for Microsoft SQL Server" at http://www.sqldude.4t.com/rules.html and more
specifically "Duplicate Full-Text Predicates" and it's recommendation:
"This rule checks stored procedures, triggers, views and functions for use
of duplicate full text predicates (CONTAINS or FREETEXT) operating on the
same column in the same query.
As an example, the following query:
SELECT col1 FROM dbo.tab1 WHERE CONTAINS(ft_col, 'merry') or CONTAINS
(ft_col, 'christmas')
should be rewritten as:
SELECT col1 FROM dbo.tab1 WHERE CONTAINS(ft_col, 'merry OR christmas')
The former query performs two separate scans of the full-text catalog, while
the latter does a single pass. Significant performance improvements can be
obtained by consolidating full-text predicates."
I'd recommend that you reduce the number of contains clause to one and test
your SQL FTS performance to confirm better performance.
Regards,
John
"Mal Earnest" <malcolm.earnest@.bom.co.uk> wrote in message
news:#oUFgA2uEHA.908@.TK2MSFTNGP11.phx.gbl...
> Given the query:
> Select *
> from A
> left outer join B on A.bid=B.id
> left outer join C on A.cid=C.id
> Where contains(A.*, 'word')
> or contains (B.*, 'word')
> or contains (C.*, 'word')
> I am concerned about performance - does SQL Server retrieve every row from
A (which might eventually contain a vast number of records) in order to
establish the records from B and C which are candidate results for this
query?
> I am assuming that a number of columns in A, B and C are marked for text
searches - Does it make a difference if B.id and C.id are unique keys? Does
it make a difference if those unique keys are the unique keys supplied as
"keyname" for their respective tables when sp_fulltext_table created the
text index?
> ************************************************** ********************
> Sent via Fuzzy Software @. http://www.fuzzysoftware.com/
> Comprehensive, categorised, searchable collection of links to ASP &
ASP.NET resources...
|||A valid point in general, however not valid in this case as there are three
tables which are being full text indexed and queries, A, B, and C. This also
involves not two scans of the full text catalog (termed a remote scan in
QO), but three.
Mal's only option is the three separate scans he is currently doing, or
consolidating the three tables into a single table (perhaps a child table),
which will reduce the three remote scans to one.
"John Kane" <jt-kane@.comcast.net> wrote in message
news:u$6wCN4uEHA.1400@.TK2MSFTNGP11.phx.gbl...
> Mal,
> The performance problem is most likely not with your join, but the number
> of
> FTS predicates, such as CONTAINS.
> Specifically, see the online version of the "Best Practices Analyzer Tool
> for Microsoft SQL Server" at http://www.sqldude.4t.com/rules.html and more
> specifically "Duplicate Full-Text Predicates" and it's recommendation:
> "This rule checks stored procedures, triggers, views and functions for use
> of duplicate full text predicates (CONTAINS or FREETEXT) operating on the
> same column in the same query.
> As an example, the following query:
> SELECT col1 FROM dbo.tab1 WHERE CONTAINS(ft_col, 'merry') or CONTAINS
> (ft_col, 'christmas')
> should be rewritten as:
> SELECT col1 FROM dbo.tab1 WHERE CONTAINS(ft_col, 'merry OR christmas')
> The former query performs two separate scans of the full-text catalog,
> while
> the latter does a single pass. Significant performance improvements can be
> obtained by consolidating full-text predicates."
> I'd recommend that you reduce the number of contains clause to one and
> test
> your SQL FTS performance to confirm better performance.
> Regards,
> John
>
>
> "Mal Earnest" <malcolm.earnest@.bom.co.uk> wrote in message
> news:#oUFgA2uEHA.908@.TK2MSFTNGP11.phx.gbl...
> A (which might eventually contain a vast number of records) in order to
> establish the records from B and C which are candidate results for this
> query?
> searches - Does it make a difference if B.id and C.id are unique keys?
> Does
> it make a difference if those unique keys are the unique keys supplied as
> "keyname" for their respective tables when sp_fulltext_table created the
> text index?
> ASP.NET resources...
>
|||The example from "Best Practices Analyzer Tool for Microsoft SQL Server" is
generic for two CONTAINS clauses, but it also applies for more than two
CONTAINS clauses as each contains clause is a full "round trip" to the FT
Catalog and if the FT Catalogs are large, then this can adversly affect the
FTS query performance.
Hilary, I'm sure Mal has reasons for having 3 tables and it is inappropriate
to recommend that he change his table structures for just this query as
changing the table schemas may be either not possible or detmentrial to his
application. What is a better recommendation is to make alterations to the
FTS query that is non-intrusive to the existing table structures.
Mal, could you provide more details on the three tables? Specifically, the
output of sp_help for each table. Meanwhile (and until we get back the
requested info), you should review the following simple Pubs database query
and stored proc as possible alternative to using multiple contains clauses:
use pubs
select pub_id, pr_info from pub_info
where contains(pr_info , '(("book" or "books") and ("publish" or
"publisher"))')
go
-- DROP PROCEDURE sp_FTSearchPubsInfo
go
CREATE PROCEDURE sp_FTSearchPubsInfo ( @.vcSearchText varchar(7800))
AS
declare @.s as varchar (8000)
set @.s='select pub_id, pr_info from pub_info where
contains(pr_info,'+''''+@.vcSearchText+''''+')'
exec (@.s)
go
EXEC sp_FTSearchPubsInfo '("book*") or ("publish*")'
EXEC sp_FTSearchPubsInfo '(("book" or "books") and ("publish" or
"publisher"))'
go
Regards,
John
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
news:e4$bLY4uEHA.1448@.TK2MSFTNGP10.phx.gbl...
> A valid point in general, however not valid in this case as there are
three
> tables which are being full text indexed and queries, A, B, and C. This
also
> involves not two scans of the full text catalog (termed a remote scan in
> QO), but three.
> Mal's only option is the three separate scans he is currently doing, or
> consolidating the three tables into a single table (perhaps a child
table),[vbcol=seagreen]
> which will reduce the three remote scans to one.
>
> "John Kane" <jt-kane@.comcast.net> wrote in message
> news:u$6wCN4uEHA.1400@.TK2MSFTNGP11.phx.gbl...
number[vbcol=seagreen]
Tool[vbcol=seagreen]
more[vbcol=seagreen]
use[vbcol=seagreen]
the[vbcol=seagreen]
CONTAINS[vbcol=seagreen]
christmas')[vbcol=seagreen]
be[vbcol=seagreen]
text[vbcol=seagreen]
as
>
|||It is obvious to me you are mounting a disinformation campaign for your own
purposes which are not in this newsgroups best interests. I will state my
views and follow up with the poster directly.
It should be obvious to the more seasoned dba the mendacity of your above
post.
This is the last time I will be responding directly to your posts.
If Mal is interested in solutions involving table changes to improve
performance he can post back here.
"John Kane" <jt-kane@.comcast.net> wrote in message
news:O5xPHk4uEHA.3376@.TK2MSFTNGP12.phx.gbl...
> The example from "Best Practices Analyzer Tool for Microsoft SQL Server"
> is
> generic for two CONTAINS clauses, but it also applies for more than two
> CONTAINS clauses as each contains clause is a full "round trip" to the FT
> Catalog and if the FT Catalogs are large, then this can adversly affect
> the
> FTS query performance.
> Hilary, I'm sure Mal has reasons for having 3 tables and it is
> inappropriate
> to recommend that he change his table structures for just this query as
> changing the table schemas may be either not possible or detmentrial to
> his
> application. What is a better recommendation is to make alterations to the
> FTS query that is non-intrusive to the existing table structures.
> Mal, could you provide more details on the three tables? Specifically, the
> output of sp_help for each table. Meanwhile (and until we get back the
> requested info), you should review the following simple Pubs database
> query
> and stored proc as possible alternative to using multiple contains
> clauses:
> use pubs
> select pub_id, pr_info from pub_info
> where contains(pr_info , '(("book" or "books") and ("publish" or
> "publisher"))')
> go
> -- DROP PROCEDURE sp_FTSearchPubsInfo
> go
> CREATE PROCEDURE sp_FTSearchPubsInfo ( @.vcSearchText varchar(7800))
> AS
> declare @.s as varchar (8000)
> set @.s='select pub_id, pr_info from pub_info where
> contains(pr_info,'+''''+@.vcSearchText+''''+')'
> exec (@.s)
> go
> EXEC sp_FTSearchPubsInfo '("book*") or ("publish*")'
> EXEC sp_FTSearchPubsInfo '(("book" or "books") and ("publish" or
> "publisher"))'
> go
> Regards,
> John
>
>
> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
> news:e4$bLY4uEHA.1448@.TK2MSFTNGP10.phx.gbl...
> three
> also
> table),
> number
> Tool
> more
> use
> the
> CONTAINS
> christmas')
> be
> text
> as
>
|||Hilary,
I'm sorry you feel that way, but your posts in this newsgroup lately leave
much to desire and in many cases have been incorrect and inappropriate and
perhaps you should reconsider your views and upgrade your own technical
responses.
Best regards,
John
"Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
news:errBer5uEHA.2012@.TK2MSFTNGP15.phx.gbl...
> It is obvious to me you are mounting a disinformation campaign for your
own[vbcol=seagreen]
> purposes which are not in this newsgroups best interests. I will state my
> views and follow up with the poster directly.
> It should be obvious to the more seasoned dba the mendacity of your above
> post.
> This is the last time I will be responding directly to your posts.
> If Mal is interested in solutions involving table changes to improve
> performance he can post back here.
>
> "John Kane" <jt-kane@.comcast.net> wrote in message
> news:O5xPHk4uEHA.3376@.TK2MSFTNGP12.phx.gbl...
Server"[vbcol=seagreen]
FT[vbcol=seagreen]
the[vbcol=seagreen]
the[vbcol=seagreen]
in[vbcol=seagreen]
recommendation:[vbcol=seagreen]
for[vbcol=seagreen]
catalog,[vbcol=seagreen]
can[vbcol=seagreen]
and[vbcol=seagreen]
row[vbcol=seagreen]
to[vbcol=seagreen]
this[vbcol=seagreen]
keys?[vbcol=seagreen]
supplied[vbcol=seagreen]
************************************************** ********************
>
|||Hilary, John-
Please don't fight. I guess one of you just had a bad day. It will be a
terrible loss if any of you stops responding. I've known Hilary and John for
the same period I know Microsoft FTS. You two are bound together like SQL
and BOL

Peace guys.
"John Kane" <jt-kane@.comcast.net> wrote in message
news:OgAOy%235uEHA.1288@.TK2MSFTNGP11.phx.gbl...[vbcol=seagreen]
> Hilary,
> I'm sorry you feel that way, but your posts in this newsgroup lately leave
> much to desire and in many cases have been incorrect and inappropriate and
> perhaps you should reconsider your views and upgrade your own technical
> responses.
> Best regards,
> John
>
> "Hilary Cotter" <hilary.cotter@.gmail.com> wrote in message
> news:errBer5uEHA.2012@.TK2MSFTNGP15.phx.gbl...
> own
my[vbcol=seagreen]
above[vbcol=seagreen]
> Server"
two[vbcol=seagreen]
> FT
affect[vbcol=seagreen]
as[vbcol=seagreen]
to[vbcol=seagreen]
> the
> the
This[vbcol=seagreen]
> in
or[vbcol=seagreen]
Analyzer[vbcol=seagreen]
and[vbcol=seagreen]
> recommendation:
> for
on[vbcol=seagreen]
> catalog,
> can
> and
> row
order[vbcol=seagreen]
> to
> this
for[vbcol=seagreen]
> keys?
> supplied
created[vbcol=seagreen]
> ************************************************** ********************
&
>
|||First and foremost, John and Hillary thank you both for your responses... I've not established what I set out to discover as yet, but there were several useful ideas, and you both asked me to clarify my question.
When I initially asked about performance - I suppose I should have made it clearer that I am primarily concerned by scalability. I can put-up with the wall-clock speed of execution at present but I need to establish that as the number of records grows ov
er time that my solution won't grind to a halt.
The advice to limit the number of results was good - I'd omitted that for brevity - the actual form of query which interests me includes such a limit - I'd have better stated the query:
Select top 100 * from
A left outer join B on A.bid=B.id
left outer join C on A.cid=C.id
Where
contains(A.*, word)
or contains (B.*, word)
or contains (C.*, word)
Assume that the word I choose to search upon is sufficiently unusual to limit my result-set after application of the where clause to between, say, 0 and 10 rows and that all these rows must be returned to my client hence for the circumstances I was co
nsidering the TOP 100 would likely have no significant effect on performance. As you might have guessed, the query I posted is a significant simplification of a series of real world problems (hence the generic names) the detailed explanation of which is
way beyond the scope of a post here. In my example the relationship from A to B and from A to C is 1..n. For some of the tables I actually use B.id is unique in B; in some C.id unique in C; for some both are unique; for some neither. Many queries and upd
ates within the system only act upon these tables individually (hence fixing the table structure.) I also require support for data mining searches for which the result set matches that returned for the query above. While I can see a potential for improvi
ng performance by using a simplified table structure (and avoiding the complexity of evaluation strategy introduced by left-outer-join) this is not practical as a consequence of other binding project constraints. I need to keep the distinct tables I intr
oduced as A, B and C their relationship can be considered set-in-stone. I realise that a single table would eliminate concerns about performance but that idea is not an option for this project.
Im still a little unclear about your terminology (specifically scans of the FTS catalogs and table scans) I suspect that this relate closely to what I consider the primary performance concern.
I assume by "table scans" you suggest that SQLServer might evaluate both outer joins for each record in A then establish if the candidate result rows match any of the three contains clauses? If this is the case then the performance would be worse than li
near in the number of records in A and hence exhibit relatively poor scalability.
I assume scans of the FTS catalogs you mean that SQLServer might use the text index to efficiently establish rows from A matching the contains predicate ranging over A; outer join those rows with B and C to get the initial result rows; establish in B us
ing the text index rows matching the contains predicate ranging over B then inner join those results with A, then outer join that result with C; finally use the text index for C to efficiently establish the remaining rows in C before inner joining those w
ith A and outer joining with B all the while carefully arranging not to return the exact same row twice? This would be a workable strategy but it feels unlikely as it both fails to explain observed performances during my (extremely limited and potent
ially very error prone) experiments and because when I specified two OUTER JOIN expressions in my query I wouldnt have expected the optimiser to implement this using two inner-join and four outer-join relationships unless I re-write my query to explicitl
y hint that this is the way I need the query to be evaluated. While this approach to establish the result set is relatively complex, it offers far better scalability especially if table A happens to have a large number of records.
I'm not in a position to give concrete sp_help output - primarily as this question is a massive simplification of the actual queries I use... Tables typically have 20-50 fields of various types (CLOB; BLOB; integer; date; etc. etc.); All tables have an in
teger primary key and in example above assume indices are defined on B.id and C.id - each of which is can be assumed ordinal (say integer). The textual information is free-text enabled in every table.
I didn't understand John's stored procedure - but the idea of a stored procedure to make explicit an evaluation strategy for the desired result-set is certainly an interesting and technically plausible possibility... all be it one which would prove quite
a challenge in the context of my project!
I'm not encountered the "Best Practices Analyzer Tool" - though the advice on http://www.sqldude.4t.com/rules.html looks familiar enough. Is this tool supplied with SQLServer or is it 3rd party product? Would this tool tell me how my query is being opti
mised?
At this stage I'm trying to establish if my hunch is accurate about the scalability of queries of the type illustrated above. I am aware that the three contains clauses would be detrimental to performance - and I am aware that if I had a single flat tabl
e that any performance issues would likely disappear. I'm also aware that I could use an alternative sequence of queries to efficiently establish the same result rows - but it would be a big task to re-write my query generator to make this optimisation a
nd I am reluctant to do this if there is an alternative. Queries of the form above work OK for my current test data on a quick server - (hundreds of records) but I want a solution which will scale to support hundreds of thousands to millions of records.
I'm still not clear how the query above will be optimised and, as such, I'm still not in a position to estimate scalability. I cant use a single table and avoid outer-joins but I might be able to alter the way in which the tables are indexed if tha
t would influence the optimiser in a positive way.
************************************************** ********************
Sent via Fuzzy Software @. http://www.fuzzysoftware.com/
Comprehensive, categorised, searchable collection of links to ASP & ASP.NET resources...
|||"John Kane" <jt-kane@.comcast.net> wrote in message
news:OgAOy%235uEHA.1288@.TK2MSFTNGP11.phx.gbl...
> Hilary,
> I'm sorry you feel that way, but your posts in this newsgroup lately leave
> much to desire and in many cases have been incorrect and inappropriate and
> perhaps you should reconsider your views and upgrade your own technical
> responses.
John, I've been reading the posts between you and Hilary in a couple of
threads, and in each case you appear to misunderstand both the original post
and Hilary's reply.
Take this thread for instance, you post stuff about combining the CONTAINS
into a single one, yet if you read the OP you'll see that the 3 CONTAINS are
acting on different tables, so a single CONTAINS clause isn't appropriate
unless the tables are combined into a single table, which is what Hilary
suggested. You both post useful information, but you seem to be taking an
aggressive stance against Hilary when none is necessary.
Dan

Friday, February 24, 2012

CONTAINS performance

No comments:

Post a Comment

Consolidation Query

Blog Archive

About Me