Last week, in “Bring Data and Applications Practices Together for Best Performance“, I mentioned that specific technical choices about how to host and code a computation can easily “… determine a hundred-fold speed-up …” Was I serious about that? In fact, innocent-looking changes frequently alter database retrieval speeds by factors of a thousand or more. Reaction to these enormous disparities is part of the cultural divide last week’s posting tried to describe: data people generally find the contrast routine, at the same time as they astonish managers and programmers on the application side.
Specific speed tests are sensitive to the details of configuration, content, and query; it’s hard to reproduce exact circumstances in just a few lines. Here are a few examples of choices that often result in big differences:
Index tables! More specifically, judicious indexing is widely-recognized as a way to speed selected queries by enormous factors. Not so well known: there’s plenty of technique to getting the best from indexes. The size of indexed content, distribution of indexed values, and location of indexes in storage all have the potential to multiply or degrade performance significantly.
One all-too-common practice is that application programmers with little understanding of SQL ask for large datasets which they filter programmatically. An SQL query such as select * from my_table feeds a big loop which then tests return values for employee_id, date-range, and so on.
If the resulting application meets its business requirements, I have no complaint. The application programmer would be happier in the long run, though, were he to ask for the little bit of help from a data specialist to transform the query to select salary, ... from my_table where employee_id in .... Not only is SQL likely to execute such filters several times as fast as the programming language’s loop, but reduction of the result dataset to the minimum necessary can reduce the network burden of the query by hundreds or millions of times.
I sometimes see even experienced SQL devops conservatively confining themselves to the SQL idioms they learned in their first years, rather than taking full advantage of expressions that have become standard in the last decade or two. One favorite of mine is SQL “window functions”. These enhancements to SQL3 appeared in the SQL:2003 standard, and are now supported in nearly all current SQL dialects. Window functions make it easy for SQL to express, for instance, the ranks of competitive swimmers timed in separate heats.
Part of the dialogue between data and application specialists last week’s article recommended involves the realization that SQL is not the best solution to all problems. The previous paragraphs illustrated a few ways in which a little SQL knowledge improves what an application might otherwise achieve. It’s possible to overdo SQL, though.
I see this most often when stored procedures or functions include an excess of filesystem computations. Proprietary extensions to SQL expose the operating system or a filesystem-aware virtual machine in various ways. As handy as this can be in special circumstances, it rarely makes sense to code application-specific calculations within the database, and is almost certain to be more difficult to maintain than the same computations expressed in an application language.
In all cases, the underlying idea is that application and data experts should talk over the architecture of their systems to make well-informed choices, rather then be confined within the capabilities their own ignorance defines.
A special bonus for those with an interest in coding higher-performance applications is “Web Performance 101“, which complements the database-oriented remarks above with 22 annotated links mostly on performance boosts through front-end programming.