- Flask By Example
- Gareth Dwyer
- 2690字
- 2021-07-09 20:06:54
Introduction to RSS and RSS feeds
RSS is an old but still widely used technology to manage content feeds. It's been around for such a long time that there's some debate as to what the letters RSS actually stand for, with some saying Really Simple Syndication and others Rich Site Summary. It's a bit of a moot point as everyone just calls it RSS.
RSS presents content in an ordered and structured format using XML. It has several uses, with one of the more common uses being for people to consume news articles. On news websites, news is usually laid out similarly to a print newspaper with more important articles being given more space and also staying on the page for longer. This means that frequent visitors to the page will see some content repeatedly and have to look out for new content. On the other hand, some web pages are updated only very infrequently, such as some authors' blogs. Users have to keep on checking these pages to see whether they are updated, even when they haven't changed most of the time. RSS feeds solve both of these problems. If a website is configured to use RSS feeds, all new content is published to a feed. A user can subscribe to the feeds of his or her choice and consume these using an RSS reader. New stories from all feeds he or she has subscribed to will appear in the reader and disappear once they are marked as read.
As RSS feeds have a formal structure, they allow us to easily parse the headline, article text, and date programmatically in Python. We'll use some RSS feeds from major news publications to display news to our application's users.
Although RSS follows a strict format and we could, with not too much trouble, write the logic to parse the feeds ourselves, we'll use a Python library to do this. The library abstracts away things such as different versions of RSS and allows us to access the data we need in a completely consistent fashion.
There are several Python libraries that we could use to achieve this. We'll select feedparser
. To install it, open your terminal and type the following:
pip install --user feedparser
Now, let's go find an RSS feed to parse! Most major publications offer RSS feeds, and smaller sites built on popular platforms, such as WordPress and Blogger, will often have RSS included by default as well. Sometimes, a bit of effort is required to find the RSS feed; however, as there is no standard as to where it should be located, you'll often see the RSS icon somewhere on the homepage (look at the headers and footers), which looks similar to this:

Also, look for links saying RSS or Feed. If this fails, try going to site.com/rss or site.com/feed, where site.com
is the root URL of the site for which you're looking for RSS feeds.
We'll use the RSS feed for the main BBC news page. At the time of writing, it is located at http://feeds.bbci.co.uk/news/rss.xml. If you're curious, you can open the URL in your browser, right-click somewhere on the page, and click on View Source or an equivalent. You should see some structured XML with a format similar to the following:
<?xml version="1.0" encoding="UTF-8"?> <channel> <title>FooBar publishing</title> <link>http://dwyer.co.za</link> <description>A mock RSS feed</description> <language>en-gb</language> <item> <title>Flask by Example sells out</title> <description>Gareth Dwyer's new book, Flask by Example sells out in minutes</description> <link>http://dwyer.co.za/book/news/flask-by-example</link> <guid isPermalink="false">http://dwyer.co.za/book/news/flask-by-example</guid> <pubDate>Sat, 07 Mar 2015 09:09:19 GMT</pubDate> </item> </channel> </rss>
At the very top of the feed, you'll see a line or two that describes the feed itself, such as which version of RSS it uses and possibly some information about the styles. After this, you'll see information relating to the publisher of the feed followed by a list of <item>
tags. Each of these represents a story—in our case, a news article. These items contain information such as the headline, a summary, the date of publication, and a link to the full story. Let's get parsing!
Using RSS from Python
In our headlines.py
file, we'll make modifications to import the feedparser
library we installed, parse the feed, and grab the first article. We'll build up HTML formatting around the first article and show this in our application. If you're not familiar with HTML, it stands for Hyper Text Markup Language and is used to define the look and layout of text in web pages. It's pretty straightforward, but if it's completely new to you, you should take a moment now to go through a beginner tutorial to get familiar with its most basic usage. There are many free tutorials online, and a quick search should bring up dozens. A popular and very beginner-friendly one can be found at http://www.w3schools.com/html/.
Our new code adds the import for the new library, defines a new global variable for the RSS feed URL, and further adds a few lines of logic to parse the feed, grab the data we're interested in, and insert this into some very basic HTML. It looks similar to this:
import feedparser from flask import Flask app = Flask(__name__) BBC_FEED = "http://feeds.bbci.co.uk/news/rss.xml" @app.route("/") def get_news(): feed = feedparser.parse(BBC_FEED) first_article = feed['entries'][0] return """<html> <body> <h1> BBC Headlines </h1> <b>{0}</b> <br/> <i>{1}</i> <br/> <p>{2}</p> <br/> </body> </html>""".format(first_article.get("title"), first_article.get("published"), first_article.get("summary")) if __name__ == "__main__": app.run(port=5000, debug=True)
The first line of this function passes the BBC feed URL to our feedparser
library, which downloads the feed, parses it, and returns a Python dictionary. In the second line, we grabbed just the first article from the feed and assigned it to a variable. The entries
entry in the dictionary returned by feedparser
contains a list of all the items that include the news stories we spoke about earlier, so we took the first one of these and got the headline or title,
the date or the published
field, and the summary of the article (that is, summary
) from this. In the return
statement, we built a basic HTML page all within a single triple-quoted Python string, which includes the <html>
and <body>
tags that all HTML pages have as well as an <h1>
heading that describes what our page is; <b>
, which is a bold tag that shows the news headline; <i>
, which stands for the italics tag that shows the date of the article; and <p>
, which is a paragraph tag to show the summary of the article. As nearly all items in an RSS feed are optional, we used the python.get()
operator instead of using index notation (square brackets), meaning that if any information is missing, it'll simply be omitted from our final HTML rather than causing a runtime error.
For the sake of clarity, we didn't do any exception handling in this example; however, note that feedparser
may well throw an exception on attempting to parse the BBC URL. If your local Internet connection is unavailable, the BBC server is down, or the provided feed is malformed, then feedparser
will not be able to turn the feed into a Python dictionary. In a real application, we would add some exception handling and retry the logic here. In a real application, we'd also never build HTML within a Python string. We'll look at how to handle HTML properly in the next chapter. Fire up your web browser and take a look at the result. You should see a very basic page that looks similar to the following (although your news story will be different):

This is a great start, and we're now serving dynamic content (that is, content that changes automatically in response to user or external events) to our application's hypothetical users. However, ultimately, it's not much more useful than the static string. Who wants to see a single news story from a single publication that they have no control over?
To finish off this chapter, we'll look at how to show an article from different publications based on URL routing. That is, our user will be able to navigate to different URLs on our site and view an article from any of several publications. Before we do this, let's take a slightly more detailed look at how Flask handles URL routing.
URL routing in Flask
Do you remember that we briefly mentioned Python decorators in the previous chapter? They're represented by the funny @app.route("/")
line we had above our main function, and they indicate to Flask which parts of our application should be triggered by which URLs. Our base URL, which is usually something similar to site.com
but in our case is the IP address of our VPS, is omitted, and we will specify the rest of the URL (that is, the path) in the decorator. Earlier, we used a single slash, indicating that the function should be triggered whenever our base URL was visited with no path specified. Now, we will set up our application so that users can visit URLs such as site.com/bbc or site.com/cnn to choose which publication they want to see an article from.
The first thing we need to do is collect a few RSS URLs. At the time of writing, all of the following are valid:
- CNN: http://rss.cnn.com/rss/edition.rss
- Fox News: http://feeds.foxnews.com/foxnews/latest
- IOL: http://www.iol.co.za/cmlink/1.640
First, we will consider how we might achieve our goals using static routing. It's by no means the best solution, so we'll implement static routing for only two of our publications. Once we get this working, we'll consider how to use dynamic routing instead, which is a simpler and more generic solution to many problems.
Instead of declaring a global variable for each of our RSS feeds, we'll build a Python dictionary that encapsulates them all. We'll make our get_news()
method generic and have our decorated methods call this with the relevant publication. Our modified code looks as follows:
import feedparser from flask import Flask app = Flask(__name__) RSS_FEEDS = {'bbc': 'http://feeds.bbci.co.uk/news/rss.xml', 'cnn': 'http://rss.cnn.com/rss/edition.rss', 'fox': 'http://feeds.foxnews.com/foxnews/latest', 'iol': 'http://www.iol.co.za/cmlink/1.640'} @app.route("/") @app.route("/bbc") def bbc(): return get_news('bbc') @app.route("/cnn") def cnn(): return get_news('cnn') def get_news(publication): feed = feedparser.parse(RSS_FEEDS[publication]) first_article = feed['entries'][0] return """<html> <body> <h1>Headlines </h1> <b>{0}</b> </ br> <i>{1}</i> </ br> <p>{2}</p> </ br> </body> </html>""".format(first_article.get("title"), first_article.get("published"), first_article.get("summary")) if __name__ == "__main__": app.run(port=5000, debug=True)
Common mistakes:
Tip
If you're copying or pasting functions and editing the @app.route
decorator, it's easy to forget to edit the function name. Although the name of our functions is largely irrelevant as we don't call them directly, we can't have different functions share the same name as the latest definition will always override any previous ones.
We still return the BBC news feed by default, but if our user visits the CNN or BBC routes, we will explicitly take the top article from respective publication. Note that we can have more than one decorator per function so that our bbc()
function gets triggered by a visit to our base URL or to the /bbc
path. Also, note that the function name does not need to be the same as the path, but it is a common convention that we followed in the preceding example.
Following this, we can see the output for our application when the user visits the /cnn
page. The headline displayed is now from the CNN feed.

Now that we know how routing works in Flask, wouldn't it be nice if it could be even simpler? We don't want to define a new function for each of our feeds. What we need is for the function to dynamically grab the right URL based on the path. This is exactly what dynamic routing does.
In Flask, if we specify a part of our URL path in angle brackets <
>
, then it is taken as a variable and is passed to our application code. Therefore, we can go back to having a single get_news()
function and pass in a <publication>
variable, which can be used to make the selection from our dictionary. Any variables specified by the decorator must be accounted for in our function's definition. The first few lines of the updated get_news()
function are shown as follows:
@app.route("/") @app.route("/<publication>") def get_news(publication="bbc"): # rest of code unchanged
In the code shown earlier, we added <publication>
to the route definition. This creates an argument called publication
, which we need to add as a parameter of the function directly below the route. Thus, we can keep our default value for the publication parameter as bbc
, but if the user visits CNN, Flask will pass the cnn
value as the publication argument instead.
The rest of the code remains unchanged, but it's important to delete the now unused bbc()
and cnn()
function definitions as we need the default route to activate our get_news()
function instead.
It's easy to forget to catch the URL variables in the function definition. Any dynamic part of the route must contain a parameter of the same name in the function in order to use the value, so look out for this. Note that we gave our publication variable a default value of bbc
so that we don't need to worry about it being undefined when the user visits our base URL. However, again, our code will throw an exception if the user visits any URL that we don't have as a key in our dictionary of feeds. In a real web application, we'd catch cases such as this and show an error to the user, but we'll leave error handling for later chapters.
Publishing our Headlines application
This is as far as we'll take our application in this chapter. Let's push the results to our server and configure Apache to display our headlines application instead of our Hello World application by default.
First, add your changes to the Git repository, commit them, and push them to the remote. You can do this by running the following commands (after opening a terminal and changing directory to the headlines directory):
git add headlines.py git commit –m "dynamic routing" git push origin master
Then, connect to the VPS with SSH and clone the new project there using the following commands:
ssh –i yourkey.pem root@123.456.789.123 cd /var/www git clone https://<yourgitrepo>
Don't forget to install the new library that we now depend on. Forgetting to install dependencies on your server is a common error that can lead to a frustrating debugging. Keep this in mind. The following is the command for this:
pip install --user feedparser
Now, create the .wsgi
file. I assume that you named your Git project headlines
when creating the remote repository and that a directory named headlines
was created in your /var/www
directory when you did the preceding Git clone command. If you called your project something else and now have a directory with a different name, rename it to headlines (otherwise, you'll have to adapt a lot of the configuration we're about to do accordingly). To rename a directory in Linux, use the following command:
mv myflaskproject headlines
The command used earlier will rename the directory called myflaskproject
to headlines
, which will ensure that all the configuration to follow will work. Now, run the following:
cd headlines nano headlines.wsgi
Then, insert the following:
import sys sys.path.insert(0, "/var/www/headlines") from headlines import app as application
Exit Nano by hitting the Ctrl + X key combo and enter Y when prompted to save changes.
Now, navigate to the sites-available
directory in Apache and create the new .conf
file using the following commands:
cd /etc/apache2/sites-available nano headlines.conf
Next, enter the following:
<VirtualHost *> ServerName example.com WSGIScriptAlias / /var/www/headlines/headlines.wsgi WSGIDaemonProcess headlines <Directory /var/www/headlines> WSGIProcessGroup headlines WSGIApplicationGroup %{GLOBAL} Order deny,allow Allow from all </Directory> </VirtualHost>
Save the file and quit nano. Now, disable our old site, enable the new one, and restart Apache by running the following commands:
sudo a2dissite hello.conf sudo a2enssite headlines.conf sudo service apache2 reload
Try and visit the IP address of your VPS from your local machine, and if all went as expected, you should see the news headline as before! If not, don't worry. It's easy to make a mistake in some piece of configuration. It's most likely that your headlines.wsgi
or headlines.conf
file has a small error. The easiest way to find this is by looking at the most recent errors in your Apache error log, which would have been triggered when you attempted to visit the site. View this again with the following command:
sudo tail –fn 20 /var/log/apache2/error.log
- Boost程序庫完全開發(fā)指南:深入C++”準(zhǔn)”標(biāo)準(zhǔn)庫(第5版)
- Java Web開發(fā)學(xué)習(xí)手冊
- Mastering QGIS
- HTML5 Mobile Development Cookbook
- C# 8.0核心技術(shù)指南(原書第8版)
- Arduino家居安全系統(tǒng)構(gòu)建實(shí)戰(zhàn)
- Building Machine Learning Systems with Python(Second Edition)
- Java程序設(shè)計(jì)案例教程
- Python趣味編程與精彩實(shí)例
- C編程技巧:117個(gè)問題解決方案示例
- 超簡單:Photoshop+JavaScript+Python智能修圖與圖像自動(dòng)化處理
- Application Development with Swift
- 實(shí)驗(yàn)編程:PsychoPy從入門到精通
- 你好!Java
- Instant Pygame for Python Game Development How-to