CatThink: see the cats of Instagram in realtime with RethinkDB and Socket.io

Modern frameworks and standards make it easy for developers to build web applications that support realtime updates. You can push the latest data to your users, offering a seamless experience that results in higher engagement and better usability. With the right architecture on the backend, you can put polling out to pasture and liberate your users from the tyranny of the refresh button.

In this tutorial, I’ll show you how I built a realtime Instagram client for the web. The application, which is called CatThink, displays a live feed of new Instagram pictures that have the #catsofinstagram tag. Why cats of Instagram? Because it’s one of the photo service’s most popular and beloved tags. People on the internet really, really like cats. Or maybe we just think we do because our feline companions have reprogrammed us with brain parasites.

The cat pictures appear in real time, as they are posted by their respective users. CatThink shows the pictures in a grid, accompanied by captions and other relevant metadata. In a secondary view, the application uses geolocation info to plot the cat pictures on a map.

CatThink’s architecture

The CatThink backend is built with Node.js and Express on top of RethinkDB. The HTML frontend uses jQuery and Handlebars to display the latest cat pictures. The frontend map view is built with Leaflet, a popular map library that uses tiles from OpenStreetMap. The application uses Socket.io to facilitate communication between the frontend and backend.

CatThink takes advantage of Instagram’s realtime APIs to determine when new images are available. Instagram offers a webhook-based system that allows a backend application to subscribe to updates on a given tag. When there are new posts with the #catsofinstagram tag, Instagram’s servers send an HTTP POST request to a callback URL on your server. The POST request doesn’t actually include the new content, it just includes a timestamp and the name of the updated tag—your application has to fetch the new records using Instagram’s conventional REST API endpoints.

When the CatThink backend receives a POST request from Instagram, it performs a RethinkDB query that uses the r.http command to fetch the latest records from the Instagram REST API and add them directly to the database. The database itself performs the HTTP GET request and parses the returned data.

Because the operation is performed entirely with ReQL, the backend application isn’t responsible for fetching or processing any of the new Instagram pictures. Of course, the backend application will still need to know about new cat pictures so that it can send them to the frontend with Socket.io. CatThink accomplishes that with changefeeds, a RethinkDB feature that lets applications subscribe to changes on a table. Whenever the database adds, removes, or changes a document in the table, it will notify subscribed applications.

CatThink subscribes to a changefeed on the table where the cat records are stored. Whenever the database inserts a new cat record, CatThink receives the data through the changefeed and then broadcasts it to all of the Socket.io connections.

Connect to the Instagram realtime API

To use the Instagram API, you will have to register an application key on the Instagram developer site. You will need to use the client ID and client secret provided by Instagram in order to hit the API endpoints. You don’t need to configure the key with a redirect URI, however, as you won’t be using authentication.

To subscribe to a tag with Instagram’s realtime API, make an HTTP POST request to the api.instagram.com/v1/subscriptions/. In the form data attached to the request, you will need to provide the application key data, the name of the tag, a verification token, and the callback URL where you want Instagram to send new data. The verification token is an arbitrary string that Instagram will pass back to your application when it hits the callback URL.

Note: the callback URL that you provide to Instagram must be publicly-accessible to outside networks. For development purposes, it can be helpful to use a tool like ngrok that exposes a local port to the public internet.

In CatThink, I use the request library to perform the initial request to the Instagram server:

var params = {
  client_id: "XXXXXXXXXXXXXXXXXXXXXXXXX",
  client_secret: "XXXXXXXXXXXXXXXXXXXXXXXXX",
  verify_token: "somestring",
  object: "tag", aspect: "media",
  object_id: "catsofinstagram",
  callback_url: "http://mycatapp.ngrok.com/publish/photo"
};

request.post({url: api + "subscriptions", form: params},
  function(err, response, body) {
    if (err) console.log("Failed to subscribe:", err);
    else console.log("Successfully subscribed.");
});

If the subscription API call is properly formed, Instagram will immediately attempt to make an HTTP GET request at the callback URL. It will send several query parameters, including the verification token and a challenge key. In order to complete the subscription, you have to make the GET request return the provided challenge key. With Express, create a GET handler for the callback URL:

app.get("/publish/photo", function(req, res) {
  if (req.param("hub.verify_token") == "somestring")
    res.send(req.param("hub.challenge"));
  else res.status(500).json({err: "Verify token incorrect"});
});

Fetch the latest cats

The next step is to implement the POST handler for the callback URL. When Instagram sends the application a POST request to inform it of new content on the subscribed tag, it includes several bits of information in the request body:

[{
        "changed_aspect": "media",
        "object": "tag",
        "object_id": "catsofinstagram",
        "time": 1414995025,
        "subscription_id": 14185203,
        "data": {}
}]

The object_id property is obviously the name of the subscribed tag. The time property is a UNIX timestamp that reflects when the event occurred. The subscription_id property is a value that uniquely identifies the individual subscription.

Whenever the application receives a POST request at the callback URL, it will tell the database to fetch the latest cat records from Instagram’s REST API. The application also provides a response so that Instagram knows that the POST request didn’t fail. If the POST requests that Instagram sends to the application start to fail, Instagram will automatically taper off requests and eventually cancel the tag subscription.

app.post("/publish/photo", function(req, res) {
  var update = req.body[0];
  res.json({success: true, kind: update.object});

  if (update.time - lastUpdate < 1) return;
  lastUpdate = update.time;

  var path = "https://api.instagram.com/v1/tags/" +
             "catsofinstagram/media/recent?client_id=" +
             instagramClientId;


  r.connect(config.database).then(function(conn) {
    this.conn = conn;
    return r.table("instacat").insert(
      r.http(path)("data").merge(function(item) {
        return {
          time: r.now(),
          place: r.point(
            item("location")("longitude"),
            item("location")("latitude")).default(null)
        }
      })).run(conn)
  })
  .error(function(err) { console.log("Failure:", err); })
  .finally(function() {
    if (this.conn)
      this.conn.close();
  });
});

In the code above, the ReQL query uses the r.point command in a merge operation to turn the geographical coordinates for each cat photo into a native geolocation point object. That’s not used in the application, but it might be useful later if you wanted to create a geospatial index and query for cat pictures based on location.

In order to avoid hitting the Instagram API limit, the application checks the timestamp provided with each POST request and does some basic throttling to ensure that new cat records aren’t typically going to be fetched more than once per second.

The path variable in the handler code is the URL of the Instagram REST API endpoint that the application uses to fetch the latest cat. In this example, the “catsofinstagram” tag is hard-coded into the URL path. It’s worth noting, however, that you could use the name of the subscribed tag from the object_id property if you wanted to use the same POST handler to deal with multiple tag subscriptions.

Verify the request origin

In cases where you rely on the object_id property, you’d probably also want to validate the source of the POST request to make sure that it actually came from Instagram. If you don’t verify the origin, somebody might figure out your URL endpoint and send you malicious POST requests that include an object_id for a rogue tag that you don’t want to appear in your application. You wouldn’t want some nefarious anti-cat vigilante to trick your application into showing dogs, for example.

Every POST request from Instagram will have an X-Hub-Signature header with a hash that you can validate using your secret key and the request body. The bodyParser middleware provides a verify option that is specifically intended for such purposes:

app.use("/publish/photo", bodyParser.json({
  verify: function(req, res, buf) {
    var hmac = crypto.createHmac("sha1", "XXXXXXXXXXXXXXX");
    var hash = hmac.update(buf).digest("hex");

    if (req.header("X-Hub-Signature") == hash)
      req.validOrigin = true;
  }
}));

At the beginning of your POST handler, you would simply check the value of req.validOrigin and make sure that it’s true before continuing.

Use changefeeds to handle new cats

The CatThink backend uses RethinkDB changefeeds to detect when the database adds new records to the cat table. In a ReQL query, the changes command returns a cursor that exposes every modification that is made to the specified table. The following code shows how to consume the data emitted by the changefeed and broadcast each new item with Socket.io:

r.table("instacat").changes().run(this.conn).then(function(cursor) {
  cursor.each(function(err, item) {
    if (item && item.new_val)
      io.sockets.emit("cat", item.new_val);
  });
})
.error(function(err) {
  console.log("Error:", err);
});

CatThink broadcasts every cat to every user, so you don’t need to worry about tracking individual Socket.io connections or routing messages to the right users.

In addition to broadcasting new cats, it’s also a good idea to pass the user a modest backlog of cats when they first establish their connection with the server so that their initial view of the application is populated with some data. In a Socket.io connection event handler, CatThink performs a ReQL query that fetches the 60 most recent cats and then sends the result set back to the user:

io.sockets.on("connection", function(socket) {
  r.connect(config.database).then(function(conn) {
    this.conn = conn;
    return r.table("instacat").orderBy({index: r.desc("time")})
            .limit(60).run(conn)
  })
  .then(function(cursor) { return cursor.toArray(); })
  .then(function(result) {
    socket.emit("recent", result);
  })
  .error(function(err) { console.log("Failure:", err); })
  .finally(function() {
    if (this.conn)
      this.conn.close();
  });
});

Implement the frontend

The CatThink frontend has a very simple user interface: It displays the grid of cats and the accompanying map view. A full-blown JavaScript MVC framework would likely be overkill, so it uses a pretty light dependency stack. It uses Leaflet for the map, jQuery for the UI logic, and Handlebars templating to generate the markup for each new cat picture.

After some initial setup for the tab switching and map view, the bulk of the frontend code is housed in a single addCat function that applies the template to the cat data, inserts the new markup into the grid, and then creates the location marker for cats with geolocation data:

var map = L.map("map").setView([0, 0], 2);
map.addLayer(L.tileLayer(mapTiles, {attribution: mapAttrib}));

var template = Handlebars.compile($("#cat-template").html());
var markers = [];

function addCat(cat) {
  cat.date = moment.unix(cat.created_time).format("MMM DD, h:mm a");
  $("#cats").prepend(template(cat));

  if (cat.place) {
    var count = markers.unshift(L.marker(L.latLng(
        cat.place.coordinates[1],
        cat.place.coordinates[0])));

    map.addLayer(markers[0]);
    markers[0].bindPopup(
        "<img src=\"" + cat.images.thumbnail.url + "\">",
        {minWidth: 150, minHeight: 150});

    markers[0].openPopup();

    if (count > 100)
      map.removeLayer(markers.pop());
  }
}

The map markers are stored in an array so that the application can easily remove old markers as it adds new ones. The marker cap is set to 100 in the code above, but you could likely raise it considerably if desired. It’s important to have some kind of cap, however, because Leaflet can sometimes exhibit odd behavior if you have too many.

The Handlebars template that the application applies to the cat data is embedded in the HTML page itself, using a script tag with a custom type:

<script id="cat-template" type="text/x-handlebars-template">
  <div class="cat">
    <div class="user">{{user.full_name}}</div>
    <div class="meta">
      <div class="time">Posted at {{date}}</div>
      <div class="caption">{{caption.text}}</div>
    </div>
    <img class="thumb" src="{{images.low_resolution.url}}">
  </div>
</script>

The last piece of the puzzle is implementing Socket.io on the client side. The application needs to establish a Socket.io connection with the server and then provide event handlers for the backlog and new cats. Both handlers will simply use the addCat function shown above.

var socket = io.connect();

socket.on("cat", addCat);
socket.on("recent", function(data) {
  data.reverse().forEach(addCat);
});

The handler for the “cat” event receives a single cat object, which is immediately passed into the addCat function. The handler for the “recent” event receives an array of cat objects from the server. It reverses the array before adding the cats so that the images will display in reverse-chronological order, consistent with how they are added in real time.

Next steps

Although CatThink is not particularly complex, changefeeds helped to simplify the application and reduce the total amount of necessary code. Without changefeeds, the CatThink backend would have to fetch, parse, and process all of the cat records on its own instead of offloading that work to the database with a simple ReQL query.

In larger realtime applications, changefeeds can potentially offer more profound architectural advantages. You can increase the modularity of your application by decoupling the parts that handle and process data from the parts that convey updates to the frontend. There are also cases where you can use changefeeds to eliminate the need for dedicated message queue systems.

In the current version of RethinkDB, changefeeds offer a useful way to monitor changes on individual tables. In future versions, changefeeds will support a richer set of capabilities. Users will be able to monitor filtered data sets and detect change events on complex aggregations, like a player leader board or realtime moving averages. You can look forward to seeing the first round of new changefeed features in an upcoming release.

Install RethinkDB and try the ten-minute guide to experience the database in action.

For additional information, you can refer to: