Technology

Endless scroll and real-time data: Solution for preventing data duplication

Posted by Pixafy Team

Endless Scroll  |  Pixafy.com

Endless scroll is a feature on a site where content is continually loaded as a user continues to scroll down the page. It’s essentially automatic pagination. While it may be straightforward to implement endless scrolling with static/rarely changing data, it becomes more difficult once you start working with data that is constantly changing.

One main problem that I’ve run into is the duplication of data on a page. This duplication occurs because in-between runs to the backend; a record that appeared in one set is pushed out of its current set and into the set of records that have yet to be retrieved. This could be due to modification on the record itself or another record. Either way, the next time we hit the backend for the next set of records, the record that was in the previously retrieved set is now in the new set or its contents will be shown again to the visitor.

A good example of this would be a site that displays products that are continuously curated. Let’s say that these products are displayed in the order they were last updated in in groups of 4, which is to say the last product to be updated will be the first to appear. Let’s also assume we have the following products: A, B, C, D, E, F, G, and H.

The query would look something like this (for the sake of the example, we’ll have the sql query perform the ordering):

Select [col1, col2, etc] from products order by updated_on limit 4

Here is how the duplication issue would develop:

–          Dataset 1 has been retrieved for user 1, displaying products A, B, C, & D.

–          Before dataset 2 is retrieved for user 1, the admin has updated product E.

–          Due to the update, product E is now in dataset 1 and the next user to visit the page will see products E, A, B, & C.

–          Due to the update of product E, product D is now moved into dataset 2.

–          User 1 passes the scroll threshold, the backend is hit, dataset 2 is retrieved, and user 1 is presented with products D, F, G, and H, & D being the duplicate.

The problem here is that the datasets are too volatile. There has to be a way to prevent external actions from affecting the sequence in which a visitor is presented data.

The way that I’ve combated this is by simply storing the time the user first landed on the page. Once I have that timestamp, I can use it as a part of the infinite scroll implementation. Here’s how the above query would change:

Select [col1, col2, etc.] from products where updated_on < [user_timestamp] order by updated_on limit 4

Here’s how the duplication issue would be prevented with the inclusion of the timestamp:

–          Dataset 1 has been retrieved for user 1, displaying products A, B, C, & D.

–          Before dataset 2 is retrieved for user 1, the admin has updated product E.

–          Due to the update product E is now in dataset 1 and the next user to visit the page will see products E, A, B, & C.

–          With the addition of the user timestamp, we now ignore product E as its updated_on time is greater than user 1’s timestamp.

–          User 1 passes the scroll threshold, the backend is hit, dataset 2 is retrieved, and user 1 is presented with products F, G, & H.

Voilà! Problem solved!

Please send us a comment describing how you were able to combat this problem, or tweet us @Pixafy!