Cassandra NoSQL Expertise to Optimize Your SQL Web Performance ?!?!

Cassandra NoSQL Expertise to Optimise Your SQL Web Performance?!?! No, I’m not drunk. In this article, I want to show and convince you that techniques used to design Cassandra NoSQL databases can actually be implemented in any SQL relational database to optimise your web application performance by many folds. I’m going to show techniques that may blow your mind, improve your database performance, and dramatically speed up your web application.


Point 1: What is Cassandra?

Cassandra is a highly scalable, distributed NoSQL database designed to handle vast amounts of data across many commodity servers. Developed initially at Facebook to power their inbox search feature, Cassandra is built on a decentralised, peer-to-peer architecture with no single point of failure. It excels in scenarios that demand high availability, fault tolerance, and linear scalability.

Cassandra uses a data model based on rows and columns but with a schema-less design, allowing flexibility in how data is stored and accessed. Data is partitioned and replicated across multiple nodes, ensuring that the system can handle a large volume of operations per second, even in the face of node failures. While Cassandra is designed for NoSQL use cases, its underlying principles, such as denormalisation and distributed data storage, can be adapted to optimise SQL databases for high-performance applications.

Point 2: JOIN queries are web performance killers

JOIN queries are an essential feature of SQL databases, enabling the combination of data from multiple tables into a single, coherent dataset. For one or two users accessing a database locally, JOINs are highly effective, allowing complex queries to be executed efficiently. The database can leverage local CPU and RAM resources to process these JOIN operations quickly, making them a valuable tool for data manipulation and retrieval.

However, the situation changes drastically in the context of web applications, where many concurrent users may be accessing the same database simultaneously. When a web application uses JOIN queries, the database must manage multiple complex operations at once. This increases CPU and RAM usage significantly, as each JOIN requires the database to lock and access multiple tables simultaneously. As the number of users grows, these locks become more frequent and longer in duration, leading to contention and bottlenecks. The database might need more CPU power and RAM to handle these operations, but even then, the efficiency of JOINs diminishes as the system becomes increasingly burdened by the need to serve multiple users concurrently.

In such scenarios, the performance impact is twofold: first, the CPU load increases dramatically, as it must process the complex logic of JOINs for numerous users; second, the RAM usage skyrockets because the database has to cache and manage large amounts of intermediate data generated by JOIN operations. This often leads to slower query execution times, increased latency, and poor user experience as the web application struggles to keep up with the demand. The more users accessing the database, the more severe the performance degradation, making JOINs a critical consideration in web application design.

Point 3: Cassandra doesn’t support JOIN

One of Cassandra’s core principles is that it does not support JOIN operations. This might initially seem like a limitation, but it’s actually a deliberate design decision aimed at optimising performance in large-scale, distributed environments. Instead of relying on JOINs, Cassandra encourages denormalisation—storing related data together in a single table or column family. This eliminates the need to perform costly JOIN operations at query time, significantly reducing the CPU and RAM load on the database.

In Cassandra, related data is often stored together in a way that makes it easily retrievable without needing to access multiple tables. This approach allows for much faster read operations and more predictable performance, especially under heavy loads. By adopting similar strategies in SQL databases, such as denormalising your data model or using techniques like materialised views or indexed views, you can reduce the reliance on JOINs and improve your web application’s performance.

Point 4: NoSQL databases are not normalized

Normalisation is a fundamental concept in SQL databases, where data is organised to reduce redundancy and ensure integrity. In a normalised database, related data is spread across multiple tables, which can then be combined using JOINs. While this is beneficial in terms of data integrity, it introduces significant overhead in terms of performance, particularly for web applications that require fast, concurrent access to data.

NoSQL databases like Cassandra take a different approach. They are often designed with denormalisation in mind, meaning related data is stored together to optimise read performance. This results in some data redundancy but dramatically improves the speed of read operations, which is crucial in environments where quick access to data is paramount.

When applying this principle to SQL databases, you might denormalise your schema, storing frequently accessed data together even if it means duplicating some information. This reduces the need for JOIN operations, thereby lowering the CPU and RAM requirements and improving overall performance. The trade-off is that your database might require more storage and more complex data management, but the performance gains in a web application environment are often well worth it.


Conclusion: Normalizing a SQL database is best practice, but is bad practice for web applications!

In traditional SQL database design, normalisation is considered a best practice to ensure data integrity and minimise redundancy. However, in the context of web applications, where performance is critical, this best practice can become a significant liability. JOIN operations, a byproduct of normalised schemas, can cripple your web application’s speed, particularly as the number of concurrent users increases.

By borrowing design principles from Cassandra and other NoSQL databases, such as denormalisation and avoiding JOINs, you can dramatically improve the performance of your SQL database. This might involve rethinking how you structure and query your data, potentially at the cost of increased data redundancy. However, the trade-off is a faster, more responsive web application that requires less CPU and RAM to handle large numbers of concurrent users effectively. Ultimately, while normalisation is ideal for maintaining data integrity, in the world of web applications, performance must come first.

 

Technique 1: Store Data by Queries, Not by Normalization Agenda

Explanation: In traditional SQL database design, normalisation separates data into related tables to minimise redundancy. However, this approach can lead to performance bottlenecks in web applications due to the need for JOIN operations to retrieve related data. Instead of normalising purely based on data integrity principles, this technique suggests storing data in a way that directly supports the most common queries your application will execute.

Example: Let’s consider an e-commerce application with the following normalised tables:

  • Users: Stores basic user information.
    • UserID, UserName, Email
  • Orders: Stores order details.
    • OrderID, UserID, OrderDate, TotalAmount

In a normalised setup, retrieving a user’s last order date, the sum of the last order, and the total number of orders would require multiple queries or a complex JOIN operation between the Users and Orders tables.

Normalized Query Example: To get a user’s last order date and total amount:

SELECT 
    u.UserID, 
    u.UserName, 
    o.OrderDate AS LastOrderDate, 
    o.TotalAmount AS LastOrderSum
FROM 
    Users u
JOIN 
    Orders o ON u.UserID = o.UserID
WHERE 
    o.OrderDate = (SELECT MAX(OrderDate) FROM Orders WHERE UserID = u.UserID);

To optimize this, you can denormalize the Users table by adding columns that store the user’s last order date, the sum of the last order, and the total number of orders up to today. This way, these values are readily available without the need for complex JOINs.

Denormalized Users Table:

  • Users: Stores user information along with order summary data.
    • UserID, UserName, Email, LastOrderDate, LastOrderSum, TotalOrdersToDate

Denormalized Query Example: Now, to get the user’s order details, you simply query the Users table:

SELECT 
    UserID, 
    UserName, 
    LastOrderDate, 
    LastOrderSum, 
    TotalOrdersToDate
FROM 
    Users
WHERE 
    UserID = 1;

By storing this precomputed data in the Users table, you eliminate the need for expensive JOINs and make these details immediately accessible, leading to faster query performance in your web application.


Technique 2: Store Aggregated Data

Explanation: In a typical SQL database, aggregates like sums, averages, or counts are calculated in real-time as part of a query. This can be resource-intensive, especially as the volume of data grows. Instead, by storing precomputed aggregated values directly in your tables, you can significantly reduce the need for on-the-fly calculations, improving both speed and efficiency.

Example: Continuing with the e-commerce application, let’s assume you have a normalised OrderItems table that tracks the items in each order:

  • Orders: Stores order details.
    • OrderID, UserID, OrderDate, TotalAmount
  • OrderItems: Stores details about items in each order.
    • OrderItemID, OrderID, ProductID, Quantity, Price

To calculate the total quantity of a particular product ordered across all orders, you would typically sum the Quantity column in the OrderItems table for that product.

Normalized Query Example:

SELECT 
    SUM(Quantity) AS TotalProductOrdered
FROM 
    OrderItems
WHERE 
    ProductID = 101;

To optimize performance, you can add precomputed aggregate values directly to the OrderItems or Orders table. For instance, each order can store the total quantity of a specific product ordered so far.

Enhanced Orders Table with Aggregated Data:

  • Orders: Stores order details along with aggregated data.
    • OrderID, UserID, OrderDate, TotalAmount, TotalProductOrderedToDate

In this setup, the TotalProductOrderedToDate column would store the running total of this product ordered across all previous orders, up to the current order.

Denormalized Query Example: Now, instead of summing the quantities across multiple rows in real-time, you can quickly retrieve the total directly:

SELECT 
    OrderID, 
    UserID, 
    OrderDate, 
    TotalAmount, 
    TotalProductOrderedToDate
FROM 
    Orders
WHERE 
    OrderID = 1;

Other Ideas for Aggregated Data:

  • TotalRevenuePerProduct: Store the total revenue generated by each product.
  • TotalOrdersByUser: Store the total number of orders placed by a user directly in the Users table.
  • TotalRevenuePerUser: Track the cumulative spending of each user.

By embedding these aggregates in your tables, you avoid the need to compute them dynamically, leading to faster queries and a more responsive web application, especially under heavy load.

 

Technique 3: Use Materialized Views

Explanation: Materialised views are database objects that store the result of a query physically, rather than dynamically generating the result each time the query is run. Unlike a regular view, which is just a stored query, a materialised view contains the actual data, precomputed and stored in a table-like format. This technique is particularly effective for queries that involve complex joins, aggregations, or filtering conditions that don’t need to be updated in real-time.

Example: Let’s say your application frequently needs to retrieve the top-selling products across different categories:

  • Sales: Stores individual sales records.
    • SaleID, ProductID, SaleDate, Quantity, TotalAmount
  • Products: Stores product information.
    • ProductID, ProductName, Category, Price

To optimize the retrieval of top-selling products, you can create a materialized view:

  • TopSellingProducts: Stores the top-selling products by category.
    • Category, ProductID, ProductName, TotalSales

This materialized view is updated periodically (e.g., hourly or daily) and can be quickly queried without recalculating the top-selling products each time. The view could be refreshed automatically at intervals, ensuring that the data remains reasonably up-to-date while providing significant performance benefits.

Benefits of Materialized Views:

  • Speed: Since the data is precomputed, queries are faster.
  • Reduced Load: Complex queries are run less frequently, reducing CPU and RAM usage.
  • Scalability: Materialised views can be distributed across multiple nodes in a distributed SQL setup, further improving performance.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *