Meet the Baconator — ProPublica – ProPublica

ProPublica is a nonprofit newsroom that investigates abuses of power. Sign up to receive our biggest stories as soon as they’re published.

This post was co-published with Source.

As a member of the team responsible for keeping ProPublica’s website online, there were times when I wished our site were static. Static sites have a simpler configuration with fewer moving parts between the requester and the requested webpage. All else being equal, a static site can handle more traffic than a dynamic one, and it is more stable and performant. However, there is a reason most sites today, including ProPublica’s, are dynamically generated.

In dynamic sites, the structure of a webpage — which includes items such as titles, bylines, article bodies, etc. — is abstracted into a template, and the specific data for each page is stored in a database. When requested by a web browser or other end client, a server-side language can then dynamically generate many different webpages with the same structure but different content. This is how frameworks like Ruby on Rails and Django, as well as content management systems like WordPress, work.

That dynamism comes at a cost. Instead of just HTML files and a simple web server, a dynamic site needs a database to hold its content. And while a server for a static site responds to incoming requests by simply fetching existing HTML files, a dynamic site’s server has the additional job of generating those files from scripts, templates and a database. With moderately high levels of traffic, this can become resource intensive and, consequently, expensive.

This is where caching comes into play. At its most basic, caching is the act of saving a copy of the output of a process. For example, your web browser caches images and scripts of sites you visit so subsequent visits to the same page will load much faster. By using locally cached assets, the web browser avoids the slow, resource-intensive process of downloading them again.

Caching is also employed by dynamic sites in the webpage generation process: at the database layer for caching the results of queries; in the content management system for caching partial or whole webpages; and by using a “reverse proxy,” which sits between the internet and a web server to cache entire webpages. (A proxy server can be used as an intermediary for requests originating from a client, like a browser. A reverse proxy server is used as an intermediary for traffic to and from a server.)

However, even with these caching layers, the demands of a dynamically generated site can prove high.

This was the case two years ago, shortly after we migrated ProPublica’s website to a new content management system. Our new CMS allowed for a better experience both for members of our production team, who create and update our articles, and for our designers and developers, who craft the end-user experience of the site. However, those improvements came at a cost. More complex pages, or pages requested very frequently, could tax the site to the point of making it crash. As a workaround we began saving fully rendered copies of resource-intensive pages and rerouting traffic to them. Everything else was still served by our CMS.

As we built tools to support this, our team was also having conversations about improving platform performance and stability. We kept coming back to the idea of using a static site generator. As the name suggests, a static site generator does for an entire site what our workaround did for resource-intensive pages. That is, generate and save a copy of each page. It can be thought of as a kind of cache, saving our servers the work of responding to requests in real time. It also provides security benefits, reducing a website’s attack surface by minimizing the amount users interact directly with potentially vulnerable server-side scripts.

In 2018, we brought the idea to a digital agency, Happy Cog, and began to workshop solutions. Because performance was important to us, they proposed that we use distributed serverless technologies like Cloudflare Workers or AWS [email protected] to create a new kind of caching layer in front of our site. Over the coming months, we designed and implemented that caching layer, which we affectionately refer to as “The Baconator.” (Developers often refer to generating a static page as “baking a page out.” So naturally, the tool we created to do this programmatically for the entire site took on the moniker “The Baconator.”) While the tool isn’t exactly a static site generator, it has given us many of the benefits of one, while allowing us to retain the production and development workflows we love in our CMS.

How Does It Work?

There are five core components:

20201002 baconator chart stack ProxyEgg Meet the Baconator — ProPublica - ProPublica
  • Cache Data Store: A place to store cached pages. This can be a file system, database or in-memory data store like Redis or Memcached, etc.
  • Source of Truth (or Origin): A CMS, web framework or “thing which makes webpages” to start with. In other words, the original source of the content we’ll be caching.
  • Reverse Proxy: A lightweight web server to receive and respond to incoming requests. There are a number of lightweight but powerful tools that can play this role, such as AWS Lambda or Cloudflare Workers. However, the same can be achieved with Apache or Nginx and some light scripting.
  • Queue: A queue to hold pending requests for cache regeneration. This could be as simple as a table in a database.
  • Queue Worker: A daemon to process pending queue requests. Here again, “serverless” technologies, like Google Cloud, could be employed. However, a simple script on a cron could do the trick as well.

How Do the Components Interact?

When a resource (like a webpage) is requested, the reverse proxy receives the request and will then check the cache data store. If the cache for that resource exists, its expiration, or time to live (TTL), is saved in a variable to check against later, and the cache is served. The TTL is then checked. If the cache has not yet expired for the resource, it is considered valid and nothing else is done. If the cache has expired, the reverse proxy then adds a request to the queue for that resource’s cache to be updated.

Meanwhile, the queue worker is constantly checking the queue. As requests come into the queue, it generates the webpage from the origin and updates the corresponding cache in the data store.

And finally at the origin, anytime a page is created or edited, the cache data store is amended or updated.

20201002 baconator chart elements ProxyEgg Meet the Baconator — ProPublica - ProPublica
How the elements in the Baconator interact.

For our team, the chief benefit of this system is the separation between our origin and web servers. Where previously the servers that housed our CMS (the origin) also responded to a percentage of incoming requests from the internet, now the two functions are completely separate. Our origin servers are only tasked with creating and updating content, and the reverse proxy is our web server that focuses solely on responding to requests. As a consequence, our origin servers could be offline and completely inaccessible, but our site would remain available, served by the reverse proxy from the content in our cache. In this scenario, we would be unable to update or create new pages, but our site would stay live. Moreover, because the web server simply retrieves and serves resources, and does not generate them, the site can handle more traffic and is more stable and performant.

Another important reason for moving to this caching system was to ease the burden on our origin servers. However, it should be noted that even with this caching layer it is possible to overload origin servers with too much traffic, though it’s far less likely. Remember, the reverse proxy will add expired pages to the queue, so if the cache TTLs are too short the queue will grow. And if the queue worker is configured to be too aggressive, the origin servers could be inundated with more traffic than they can handle. Conversely, if the queue worker does not run frequently enough, the queue will stay high, and stale pages will remain in cache and be served to end users for longer than desired.

The key to this system (as with any caching system) is proper configuration of TTLs: long enough so that the queue stays relatively low and the origin servers are not overwhelmed, but short enough to limit the time stale content is in cache. This will likely be different for different kinds of content (e.g., listing pages that change more frequently may need shorter TTLs than article pages). In our implementation, this has been the biggest challenge with moving to this system. It’s taken some time to get this right, and we continue to tweak our configurations to find the right balance.

For those interested in this kind of caching system, we’ve built a simple open-source version that you can run on your own computer. You can use it to explore the ideas outlined above.

Source of this news: https://www.propublica.org/nerds/baconator-news-site-caching-reverse-proxy-queue-worker

Related posts:

5 Reasons Your Company Should Use Proxy Servers - CMSWire
The average person probably has only a vague understanding of the purpose of a proxy server. If you’re like most people, you probably associate proxy servers with unblocking Netflix content from ...
Functions Checkit to see if your system is considered Windows 11 compatible among the WindowsReport....
courtesy of Vlad Turiceanu Editor-in-Chief Passionate about technology, Windows, yet everything that has a power button, he spent most of it's time developing new skills as learning ...
Proxy Virus time: http=localhost:8000;https=localhost:8000 - Virus, Trojan, Spyware, and Malware Rem...
same problem that Phideous was having in this post:  https://www.bleepingcomputer.com/forums/t/742727/proxy-virus-time-httplocalhost8000;httpslocalhost8000/ I have done the same anti viral measu...
Congressman has COVID after communicate 2 vaccine doses the reason why New York Daily News
A trio of Democrats — Agent. Bonne Watson Coleman of New Jersey, Rep. Pramila Jayapal of Washington, and Rap. Brad Schneider of The state of illinois — tested positive for just a virus earlier ...
I'm Begging You to Use a VPN at Hotels - Lifehacker
Keep your keycard with you, don’t leave valuables in your room unattended, and keep the door locked. Most of us follow basic security precautions at hotels, but I’m begging you to add one more to the...
Fix 'There Is Something Wrong With the Proxy Server' Issue in Chrome on Windows - BollyInside
This tutorial is about the Fix ‘There Is Something Wrong With the Proxy Server’ Issue in Chrome on Windows. We will try our best so that you understand this guide. I hope you like this blog Fix ‘Ther...
U.S. LAW ENFORCEMENT JOINS INTERNATIONAL PARTNERS TO DISRUPT A VPN SERVICE USED TO FACILITATE CRIMIN...
DETROIT – United States Attorney Matthew Schneider announced today that law enforcement in the United States has worked jointly in support of an international takedown of a virtual private network (V...
6 common use cases of Reverse Proxy scenarios - Packt Hub
Proxy servers are used as intermediaries between a client and a website or online service. By routing traffic through a proxy server, users can disguise their geographic location and their IP address...
Government Internet Shutdowns Are Changing. How Should Citizens and Democracies Respond? - Carnegie ...
SummaryGovernments worldwide continue to deploy internet shutdowns and network disruptions to quell mass protests, forestall election losses, reinforce military coups, or cut off conflict areas from ...
Succeeded Security Services Provider (MSSP) Ideas: 12 October 2021 attaining MSSP Alert
by Joe Panettieri • Oct 12, 2021 Both business day, MSSP Alert shows a quick lineup of news, studying and chatter from all over the managed security services provider ecosystem. The Content...
No, Apple's Private Relay is not a VPN - CNET
Apple touted its privacy work at its online WWDC event for developers. Apple; screenshot by Stephen Shankland/CNET This story is part of Apple Event, our full coverage of the latest news from Apple ...
Ad Fraud – The Biggest Threat to Programmatic? - Business 2 Community
Ad fraud in the programmatic realm is a serious issue that affects all key industry players, and that’s why it has been the prime focus of all sides concerned for the last couple of years.Ad fraud is...
High Court Orders Big UK ISPs to Block 19 More Piracy Websites - ISPreview.co.uk
The High Court in London has, following a case raised by the Motion Picture Association of Europe (MPA), issued a new injunction that forces most of the major UK broadband ISPs (e.g. BT, Sky Broadban...
SafeIP Hides Your IP Address to suit Private Browsing, Blocked Papers - Lifehacker
Windows: Take a look at access to streaming media labeled by your location, web sites regarding display differently depending on in which you are supposed to, or just a little privacy, ...
Study shows Omicron less severe than Delta among COVID-19 hospitalized patients - News-Medical.Net
New research posted to the medRxiv* preprint server suggests the Omicron variant produces less severe COVID-19 symptoms than earlier severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) varia...
Fix VALORANT connection error codes VAN 135, 68, 81 on Windows 11/10 - TWCN Tech News
VALORANT is a 5v5 character-based tactical FPS free-to-play first-person hero shooter where precise gunplay meets unique agent abilities – developed and published by Riot Games, for the Windows PC. I...
Type. io Brings 'Camera so as to Cloud' Functionality to Just Going Anyone - PetaPixel
[embedded content] Adobe is utilizing its acquisition of Frame. io to expand cloud a joint venture access — including the capacity to send content directly from some sort of camera to editors ...
Data Security: Defending Against the Cache Poisoning Vulnerability - Security Intelligence
Data Security: Defending Against the Cache Poisoning Vulnerability <!-- --> Do you trust your ca...

IP Rotating Proxy Onsale

SPECIAL LIMITED TIME OFFER

00
Months
00
Days
00
Hours
00
Minutes
00
Seconds
First month free with coupon code FREE30